This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests with coverage
make test
# or directly:
pytest --cov-report html --cov-report term --cov=nomenklatura tests/
# Run a single test
pytest tests/test_resolver.py::test_function_name -v
# Type checking
make typecheck
# or:
mypy --strict nomenklatura/
# Run both tests and type check
make checkNomenklatura is a data deduplication and entity integration framework for Follow the Money (FtM) entities. The core workflow: ingest FtM entities → block candidates → score pairs → record judgements → export merged entities.
Resolver (nomenklatura/resolver/resolver.py) — The central graph structure. Stores edges (Edge) representing Judgements (POSITIVE/NEGATIVE/UNSURE/NO_JUDGEMENT) between entity ID pairs. Implements connected-components to find canonical IDs and evaluate transitive judgements. Backed by SQLAlchemy (defaults to SQLite at nomenklatura.db; configurable via NOMENKLATURA_DB_URL). The Resolver extends Linker, which is the read-only view used throughout the rest of the codebase.
Store (nomenklatura/store/base.py) — Generic triple store for FtM statement-based entities. Implementations:
MemoryStore— in-memory, used for CLI file-based workflowsSQLStore— PostgreSQL/SQLite persistenceLevelStore/RedisStore— alternative KV backends (optional deps)
blocker.Index (nomenklatura/blocker/index.py) — DuckDB-backed inverted index for blocking (finding candidate pairs). Tokenizes entities into name parts, phonetics, symbols, and words. Used by xref.
Matching algorithms (nomenklatura/matching/) — Scoring algorithms implement ScoringAlgorithm from matching/types.py. Available algorithms registered in matching/__init__.py:
RegressionV1— default; sklearn logistic regression trained on FtM feature comparisonsEntityResolveRegression— entity resolution regression modelLogicV1,LogicV2— rule-based logic matchersNameMatcher,NameQualifiedMatcher— name-only matchers
Feature comparisons are in matching/compare/ (names, dates, countries, identifiers, addresses, gender). These return normalized floats fed to the regression models.
Enrichment (nomenklatura/enrich/) — Framework for linking entities to external data sources. BaseEnricher is the ABC; implementations include AlephEnricher, YenteEnricher, OpenCorporatesEnricher, WikidataEnricher, etc. Enrichers are configured via YAML and use a SQLite-backed Cache.
TUI (nomenklatura/tui/) — Textual-based terminal UI for interactive deduplication (dedupe CLI command).
xref (nomenklatura/xref.py) — Orchestrates the cross-reference pipeline: builds a blocking index, scores candidate pairs, and saves unsure judgements to the resolver.
Both nk and nomenklatura invoke nomenklatura.cli:cli. Key commands:
xref— generate dedupe candidates using blocking indexdedupe— interactive TUI for judging candidatesapply— apply resolver to entity stream (merge duplicates)match/enrich— enrichment pipeline commandsload-resolver/dump-resolver— import/export resolver decisions
Environment variables: NOMENKLATURA_DB_URL, NOMENKLATURA_DB_POOL_SIZE, NOMENKLATURA_REDIS_URL, NOMENKLATURA_STATEMENT_TABLE, NOMENKLATURA_DUCKDB_MEMORY, NOMENKLATURA_DUCKDB_THREADS, NOMENKLATURA_LEVELDB_MAX_FILES.
from nomenklatura import Resolver, Store, View, Judgement, Linker