repowise is an open-source, self-hostable codebase documentation engine. It generates a structured, hierarchical wiki for any codebase, keeps it accurate as code changes, and exposes everything through an MCP server so AI coding assistants can query it in real time.
This document covers how the system is built, why each piece exists, and how they fit together. Read this before contributing.
For per-package detail (installation, full API reference, all CLI flags, file maps):
| Package | README | What it covers |
|---|---|---|
packages/core |
packages/core/README.md |
Ingestion, generation, persistence, providers — all key classes with code examples |
packages/cli |
packages/cli/README.md |
All 10 CLI commands with every flag documented |
packages/server |
packages/server/README.md |
All REST API endpoints, 11 MCP tools, webhook setup, scheduler jobs |
packages/web |
packages/web/README.md |
Every frontend file with purpose — API client, hooks, components, pages |
- System Overview
- Repository Structure
- The Three Stores
- Provider Abstraction Layer
- Init Path — First-Time Documentation
- Maintenance Path — Keeping Docs in Sync
- Git Intelligence
- Dead Code Detection
- Decision Intelligence
- MCP Server
- REST API and Web UI
- Codebase Chat
- Data Flow Diagrams
- Key Design Decisions
- Editor File Generation
- Adding a New Language
- Adding a New LLM Provider
┌─────────────────────────────────────────────────────────────────────┐
│ Sources │
│ GitHub / GitLab repo Local filesystem │
└────────────────────┬────────────────────────┬───────────────────────┘
│ webhook / git diff │ repowise init/update
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Core Engine │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────────┐ │
│ │ Ingestion Pipeline │ │ Generation Engine │ │
│ │ │ │ │ │
│ │ FileTraverser │─────▶│ ContextAssembler │ │
│ │ ASTParser │ │ PageGenerator (hierarchical) │ │
│ │ GraphBuilder │ │ JobSystem (resumable) │ │
│ │ ChangeDetector │ │ PromptTemplates (Jinja2) │ │
│ │ GitIndexer │ │ │ │
│ │ DeadCodeAnalyzer │ │ │ │
│ └─────────────────────┘ └──────────────┬───────────────┘ │
│ │ │
│ ┌───────────────▼──────────────┐ │
│ │ LLM Provider Layer │ │
│ │ Anthropic │ OpenAI │ Ollama │ │
│ │ │ LiteLLM │ │ │
│ └───────────────────────────────┘ │
└──────────────┬─────────────────────────┬────────────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌─────────────────────────────────────────┐
│ Three Stores │ │ Consumers │
│ │ │ │
│ SQL (wiki pages, │ │ Web UI MCP Server GitHub Action │
│ jobs, symbols, │ │ (Next.js) (10 tools) (CI/CD) │
│ versions) │ │ │
│ │ │ repowise CLI │
│ Vector (LanceDB / │ │ (init, update, watch, │
│ pgvector, semantic │ │ search, export, serve, mcp) │
│ search, RAG ctx) │ │ │
│ │ └─────────────────────────────────────────┘
│ Graph (NetworkX, │
│ dep graph, │
│ PageRank, SCC) │
└──────────────────────┘
repowise has two operational modes that share the same core engine but follow different strategies:
- Init — first-time documentation of an existing codebase. May take minutes to hours on large repos. Resumable. Supports batch API for cheaper generation.
- Maintenance — incremental updates triggered by git commits. Runs in seconds to minutes. Uses change propagation through the dependency graph to only regenerate what actually changed.
repowise/
├── packages/
│ ├── core/ # Python: ingestion + generation engine
│ │ ├── src/repowise/core/
│ │ │ ├── ingestion/
│ │ │ │ ├── traverser.py # file tree walking + gitignore + per-dir .repowiseIgnore + extra exclude patterns
│ │ │ │ ├── parser.py # ASTParser — one class, all languages; _extract_calls(), _extract_import_bindings()
│ │ │ │ ├── parsers/ # per-language parser helpers
│ │ │ │ ├── graph.py # NetworkX dep graph builder; add_file(), _resolve_calls(), file_subgraph()
│ │ │ │ ├── call_resolver.py # CallResolver — 3-tier call resolution engine (NEW)
│ │ │ │ ├── change_detector.py # git diff + change propagation
│ │ │ │ ├── git_indexer.py # git history mining → git_metadata table
│ │ │ │ ├── special_handlers.py # OpenAPI, Protobuf, GraphQL, Dockerfile, CI YAML
│ │ │ │ └── models.py # ParsedFile, FileInfo, Symbol, Import, CallSite, NamedBinding, EdgeType, etc.
│ │ │ ├── analysis/
│ │ │ │ └── dead_code.py # dead code detection (graph + SQL, no LLM)
│ │ │ ├── generation/
│ │ │ │ ├── page_generator.py # hierarchical page generator
│ │ │ │ ├── context_assembler.py# context assembly (RAG + graph + git + source)
│ │ │ │ ├── job_system.py # resumable job state machine
│ │ │ │ ├── models.py # GeneratedPage, GenerationConfig, PageType, etc.
│ │ │ │ ├── templates/ # Jinja2 prompt templates (one per page type)
│ │ │ │ └── editor_files/ # CLAUDE.md / cursor.md generation (no LLM)
│ │ │ │ ├── base.py # BaseEditorFileGenerator (marker-merge, atomic write)
│ │ │ │ ├── data.py # EditorFileData + sub-dataclasses (frozen)
│ │ │ │ ├── fetcher.py # EditorFileDataFetcher (DB → EditorFileData)
│ │ │ │ ├── tech_stack.py # filesystem tech stack + build command detection
│ │ │ │ └── claude_md.py # ClaudeMdGenerator subclass
│ │ │ ├── pipeline/
│ │ │ │ ├── orchestrator.py # run_pipeline(), PipelineResult
│ │ │ │ ├── persist.py # persist_pipeline_result() — shared by CLI + server
│ │ │ │ └── progress.py # ProgressCallback protocol + LoggingProgressCallback
│ │ │ ├── persistence/
│ │ │ │ ├── models.py # SQLAlchemy ORM models
│ │ │ │ ├── crud.py # async CRUD layer
│ │ │ │ ├── database.py # engine factory, session management
│ │ │ │ ├── search.py # SQLite FTS5 full-text search
│ │ │ │ ├── vector_store.py # VectorStore abstraction (LanceDB / pgvector)
│ │ │ │ └── embedder.py # Embedder base class + MockEmbedder
│ │ │ ├── providers/ # LLM provider abstraction
│ │ │ │ ├── base.py # BaseProvider, GeneratedResponse, ProviderError
│ │ │ │ ├── anthropic.py
│ │ │ │ ├── openai.py
│ │ │ │ ├── ollama.py
│ │ │ │ ├── litellm.py
│ │ │ │ ├── mock.py
│ │ │ │ └── registry.py # get_provider(), register_provider(), list_providers()
│ │ │ └── rate_limiter.py # token-bucket RPM + TPM limiter
│ │ └── queries/ # tree-sitter .scm query files (one per language)
│ │ ├── python.scm
│ │ ├── typescript.scm
│ │ ├── javascript.scm
│ │ ├── go.scm
│ │ ├── rust.scm
│ │ ├── java.scm
│ │ ├── cpp.scm
│ │ ├── c.scm
│ │ ├── ruby.scm
│ │ ├── kotlin.scm
│ │ ├── csharp.scm
│ │ ├── swift.scm
│ │ ├── scala.scm
│ │ └── php.scm
│ │
│ ├── server/ # Python: FastAPI REST API + MCP server
│ │ └── src/repowise/server/
│ │ ├── routers/ # FastAPI routers (repos, pages, jobs, symbols, graph, git, dead-code, decisions, search, claude-md)
│ │ ├── mcp_server/ # MCP server package (10 tools, split into focused modules)
│ │ ├── webhooks/ # GitHub + GitLab handlers
│ │ ├── job_executor.py # Background pipeline executor — bridges REST endpoints to core pipeline
│ │ └── scheduler.py # APScheduler background jobs
│ │
│ ├── cli/ # Python: repowise CLI (click + rich)
│ │ └── src/repowise/cli/
│ │ └── commands/ # init, update, watch, serve, search, export, status, doctor, dead-code, decision, mcp, reindex, generate-claude-md
│ │
│ └── web/ # Next.js 15 frontend
│ ├── src/app/ # App Router pages (dashboard, wiki, search, graph, symbols, overview, …)
│ ├── src/components/ # UI primitives, layout, wiki, repos, jobs, settings, dashboard
│ └── src/lib/ # API client, SWR hooks, utilities (health-score, format), design tokens
│
├── integrations/
│ ├── github-action/ # action.yml + Dockerfile entrypoint
│ └── github-app/ # GitHub App webhook handler
│
├── docker/
│ ├── Dockerfile # multi-stage: Python + Node build + final runtime
│ └── docker-compose.yml # with optional Redis profile
│
├── tests/
│ ├── unit/ # unit tests (no LLM calls, no filesystem)
│ ├── integration/ # integration tests using sample_repo fixture
│ ├── e2e/ # full repowise init + update flows
│ └── fixtures/
│ └── sample_repo/ # 30-file multi-language fixture repo
│
└── docs/ # repowise's own generated wiki (dogfooding)
repowise uses three separate storage systems. They are not redundant — each answers a fundamentally different kind of question that the other two cannot answer efficiently.
Answers: what exists, what changed, when.
The source of truth for all structured data. SQLite in development and single-server deployments; PostgreSQL for multi-worker production deployments. The schema is identical for both — SQLAlchemy abstracts the difference.
Key tables:
| Table | Purpose |
|---|---|
repos |
Registered repositories, sync state, provider config |
wiki_pages |
All generated wiki pages with content, metadata, confidence score, and a short LLM-extracted summary (1–3 sentences) used by get_context to keep responses bounded |
page_versions |
Full version history of every page (for diff view) |
symbols |
Symbol index: every function, class, method across all files |
answer_cache |
Memoised get_answer responses keyed by (repository_id, question_hash) plus the provider/model used. Repeated questions return at zero LLM cost; cache entries are invalidated by repository re-indexing. |
generation_jobs |
Job state machine with checkpoint fields for resumability |
webhook_events |
Every received webhook event (deduplication, audit, retry) |
symbol_rename_history |
Detected renames for auditing and targeted text patching |
graph_nodes / graph_edges |
SQLite-backed graph for repos exceeding 30K nodes |
git_metadata |
Per-file git history: commit counts, ownership, co-change partners, hotspot/stable flags |
dead_code_findings |
Dead code findings: unreachable files, unused exports, zombie packages |
If you delete the SQL store, you lose everything and must re-run repowise init.
Answers: what is semantically similar to this query.
repowise uses a VectorStore abstraction with two backends, selected automatically
based on the configured SQL backend:
LanceDB (default — SQLite mode)
LanceDB runs embedded as a library — no separate server process. Data is stored in
.repowise/lancedb/ using the Lance columnar format. This makes self-hosting trivial
and keeps the Docker setup simple. LanceDB is significantly faster than ChromaDB on
both write throughput (batch embedding) and ANN query latency, and it requires no
C++ build tools to install.
pgvector (PostgreSQL mode)
When repowise is configured with a PostgreSQL database, the wiki_pages table gains
an embedding vector(N) column via the pgvector PostgreSQL extension. Embeddings
are stored directly in the same SQL database — no second storage system required.
Vector similarity search uses <=> (cosine distance) with an HNSW index. This is
the preferred backend for multi-worker production deployments.
Every generated wiki page is embedded and stored immediately after generation. The vector store is used in two distinct ways:
During generation (RAG context): When generating a wiki page for file A, the
ContextAssembler queries the vector store with A's exported symbol names to find
pages for files that A imports from. These are included as context in the generation
prompt. Each generated page is aware of what its dependencies actually do —
not just their names. See Section 5.4 for details.
During search and MCP queries: The search_codebase MCP tool and the web UI
search page use the vector store to find the most semantically relevant pages for a
natural language query. This is better than full-text search for questions like
"how does authentication work?" or "where is rate limiting handled?".
If you delete the vector store (LanceDB directory or pgvector embeddings), search
quality degrades and generation context becomes shallower — rebuild it by running
repowise reindex which re-embeds all existing SQL pages into LanceDB using
the configured embedder (Gemini or OpenAI). No LLM calls — only embedding API calls.
Answers: how are things connected, and how important are they.
The dependency graph is a two-tier directed graph where file nodes represent source
files and symbol nodes represent individual functions, classes, and methods. Edges
include imports, DEFINES, HAS_METHOD, CALLS (with confidence 0.0–1.0),
inherits, implements, and co_changes. CALLS edges are built by CallResolver
using 3-tier resolution; all others are built by GraphBuilder during AST ingestion.
It is built by the ASTParser + GraphBuilder + CallResolver during ingestion and
persisted to .repowise/graph.json (for repos ≤ 30K nodes) or the graph_nodes/graph_edges
SQL tables (for larger repos, using the networkit library as a drop-in). The graph_nodes
table includes a kind column (file, symbol, package, external) and confidence column;
graph_edges includes a confidence column for CALLS edges (Alembic migration 0015).
The graph is used for:
- Generation ordering — topological sort determines what to generate first (files with no dependents get generated before files that import them, so the richer context is available via RAG when the importing file is generated)
- Change propagation — when a file changes, walk the graph to find all pages that reference its symbols and mark them as stale
- PageRank — runs on
file_subgraph()(file + package nodes only) to identify the most central files; these get "spotlight" wiki pages and richer generation prompts - SCC detection — circular dependency clusters require a special generation strategy (see Section 5.3)
- Co-change edges — temporal coupling from git history. Files that frequently
change together (but may have no import relationship) get
co_changesedges. These participate in change propagation and are shown in the graph visualization as dashed purple lines. They do NOT affect PageRank. - Dead code detection — files with
in_degree == 0(no importers) are candidates for unreachable file detection - MCP
get_dependency_pathtool — answers "how is module A connected to module B?" - D3 graph visualization in the web UI
If you delete the graph, repowise loses change propagation and generation ordering. It can be rebuilt from scratch by re-parsing the source files.
Every LLM call in the entire system goes through LLMProvider. No provider SDK
is ever imported from business logic packages.
┌─────────────────┐
│ LLMProvider │ (abstract base class)
│ │
│ generate() │
│ generate_stream│
│ embed() │
│ generate_batch │ (optional, default = sequential)
│ estimate_cost │ (optional, returns None if unknown)
└────────┬────────┘
│
┌──────────────────┼──────────────────┬──────────────────┐
▼ ▼ ▼ ▼
AnthropicProvider OpenAIProvider OllamaProvider LiteLLMProvider
(claude-*) (gpt-*, any (any local model, (100+ providers,
batch API + prompt OpenAI-compat fully offline, optional dep)
caching support) endpoint) no API key)
Each provider instance wraps a RateLimiter using a token-bucket algorithm with
two independent buckets: requests-per-minute (RPM) and tokens-per-minute (TPM).
Before every API call, the limiter acquires from both buckets. On a 429 response,
it calls on_rate_limit_error() which applies exponential backoff and temporarily
reduces the refill rate. This is transparent to all callers.
Default limits are configured per provider in .repowise/config.yaml and can be
adjusted for users with higher API tiers.
For repowise init, the Anthropic provider supports the Message Batches API:
instead of firing concurrent streaming requests, all file-page generation requests
for a given level are submitted as a single batch, which is ~50% cheaper and
processes asynchronously (typically within 1 hour).
Batch mode is enabled by default for repowise init when using the Anthropic
provider. Pass --no-batch to use streaming instead (faster wall-clock time,
higher cost).
For Anthropic: the ContextAssembler marks the system prompt + shared repository
context with cache-control breakpoints. Across the hundreds of file-page generation
calls during a large init, this shared prefix is only billed once. Cost reduction
on large repos is typically 60–90%.
Implement LLMProvider, add an entry to LANGUAGE_CONFIGS in providers/registry.py,
and add a section to .repowise/config.yaml. See Section 12.
repowise init runs when documenting a codebase for the first time. It is the
expensive, one-time operation that builds the full wiki from scratch.
FileTraverser walks the repository tree and produces a FileInfo for every
file that should be documented. It respects six layers of exclusion, applied in
priority order:
.gitignore(parsed withpathspec, not simple glob matching)- Root
.repowiseIgnore(same syntax, user-defined at repo root) - Per-directory
.repowiseIgnore— loaded from each directory visited during theos.walk. Patterns are relative to the directory containing the file (like git's per-directory.gitignore). A spec per directory is loaded once and cached for the traversal. Example:generated/insrc/.repowiseIgnoreskipssrc/generated/without affecting other directories with the same name. extra_exclude_patterns(constructor param) — additional gitignore-style patterns passed at runtime from--exclude/-xCLI flags orrepo.settings["exclude_patterns"](set via Web UI or REST API PATCH). Applied to both directory pruning (entire subtree skipped) and individual file filtering.- Hardcoded blocklist (
node_modules,.git,__pycache__,dist,build,*.lock,*.min.js, generated protobuf files, etc.) - Auto-detection of generated files (files with
// Code generatedheaders) - Binary files (detected by null bytes in first 8KB)
- Files over
max_file_size_kb(default: 500KB)
Constructor signature:
FileTraverser(
repo_root: Path,
*,
max_file_size_kb: int = 500,
extra_ignore_filename: str = ".repowiseIgnore",
extra_exclude_patterns: list[str] | None = None,
)Where extra_exclude_patterns comes from:
| Source | How patterns reach traverser |
|---|---|
repowise init -x vendor/ -x 'src/gen/**' |
Merged with config.yaml exclude_patterns, passed directly |
repowise update |
Read from .repowise/config.yaml exclude_patterns |
| Web UI Excluded Paths section | Saved to repo.settings["exclude_patterns"] via REST API |
| Server sync job | Read from repo.settings.get("exclude_patterns", []) |
The traverser also detects monorepo structure by looking for multiple
pyproject.toml, package.json, go.mod, or Cargo.toml files at depth 1–2.
When detected, each package is documented as a semi-independent unit with
cross-package edges tracked in the graph.
Each FileInfo is tagged with: language, is_test, is_config, is_api_contract,
is_entry_point, git_hash. These tags influence generation priority and prompt choice.
Test files are first-class wiki targets. The page generator includes any file
tagged is_test=True that has at least one extracted symbol, even if the file's
PageRank is near zero (which is typical: nothing imports test files back, so
graph-centrality metrics never select them on their own). Test files answer
questions of the form "what test exercises X" / "where is Y verified", and
the doc layer is the right place to surface those. Users who want to exclude
tests from the wiki entirely can pass --skip-tests to repowise init.
ASTParser is a single class that handles all supported languages. There are no
per-language subclasses.
How it works:
- Calls
get_parser(language)fromtree-sitter-languagesto get the right grammar - Parses the source into a tree-sitter parse tree
- Loads the corresponding
queries/<language>.scmfile (cached after first load) - Runs tree-sitter queries to extract symbols, imports, and exports
- Maps results through
LANGUAGE_CONFIGS[language]for language-specific rules (visibility keywords, entry point patterns, etc.) - Returns a
ParsedFilewith a consistent shape regardless of language
The .scm query files use standard tree-sitter S-expression syntax with
consistent capture name conventions across all languages:
@symbol.def— the full symbol node@symbol.name— the name identifier@symbol.params— parameter list@symbol.return_type— return type annotation@symbol.docstring— existing docstring (expanded, not repeated, by the LLM)@import.statement— full import node@import.module— the module being imported
Adding a new language = write one .scm file + add one entry to LANGUAGE_CONFIGS.
No changes to ASTParser itself. See Section 11.
Special handlers live in core/special_handlers/ and are separate from ASTParser.
They handle file types that are not programming languages (OpenAPI specs, Dockerfiles,
GitHub Actions YAML, Makefiles, Protobuf, GraphQL schemas). They produce the same
ParsedFile shape but use purpose-built parsers (PyYAML, graphql-core, etc.)
instead of tree-sitter.
GraphBuilder takes all ParsedFile outputs and builds a networkx.DiGraph.
Node types:
file— every source filesymbol— every function, class, method, interface, etc. (added viaadd_file())package— every package/module directoryexternal— third-party packages (lightweight node, not fully documented)
Edge types:
imports— file A imports from file BDEFINES— file A defines symbol BHAS_METHOD— class A has method BCALLS— symbol A calls symbol B (with confidence score 0.0–1.0)inherits— class A extends class Bimplements— class A implements interface Binstantiates— code in A creates an instance of Breferences— looser reference (type annotation, generic, etc.)re-exports— A re-exports symbols from B (barrel files)inter_package— edge crossing package boundary (monorepos)co_changes— A and B frequently change in the same commit (from git history, added byGitIndexerafter graph construction). Weight = co-change count. Filtered out of PageRank but included in change propagation and visualization.
Call resolution is handled by the CallResolver module (ingestion/call_resolver.py),
which runs after the static import graph is built. It operates in three tiers:
- Same-file resolution (confidence 0.95) — call target defined in the same file
- Import-scoped resolution (confidence 0.85–0.93) — target matched via named bindings from the file's import list
- Global unique match (confidence 0.50) — target is unique across the whole repo
Call sites are extracted by tree-sitter for all 14 supported languages (Python, TypeScript,
JavaScript, Go, Rust, Java, C++, C, Kotlin, Ruby, C#, Swift, Scala, PHP) using per-language .scm query files. Results are stored
as CallSite dataclasses and become CALLS edges in the graph.
Named binding resolution (NamedBinding dataclass in ingestion/models.py) ensures
that aliased imports, barrel re-exports, and namespace imports resolve to the correct
definition site. The parser's _extract_import_bindings() produces bindings for each
import statement, and GraphBuilder.build() populates Import.resolved_file from them.
Barrel files (__init__.py, index.ts) are followed one hop to resolve re-exports.
Two-tier graph isolation:
Symbol nodes and their DEFINES/HAS_METHOD/CALLS edges are stored in the same
DiGraph as file nodes, but file_subgraph() returns a view containing only file
and package nodes. All file-level metrics (PageRank, betweenness, SCCs, Louvain)
run on this subgraph so that the large number of symbol nodes does not distort centrality
scores.
After graph construction, the builder computes:
- PageRank — nodes with high PageRank are central and well-connected. Used to decide generation priority and which symbols get spotlight pages.
- Strongly Connected Components (SCCs) — groups of files with circular imports. Logged as warnings. Require special generation handling (see below).
- Betweenness centrality — identifies "bridge" symbols whose removal would disconnect the graph. These are the most critical to document well.
- Community detection (Louvain) — discovers logical modules even when the directory structure doesn't reflect them. These communities become module pages.
Circular dependency handling:
Files in an SCC cannot be generated in strict bottom-up order because each one depends on the others. Strategy:
- Generate each SCC member with reduced context (own source + signatures of cycle partners only, no full docs)
- Generate a dedicated
SCCwiki page explaining the cycle and how to navigate it - In a second pass, upgrade each member's page with cross-references to the others
Graph scalability:
For repos with fewer than 30K nodes, the graph lives in memory as a NetworkX
DiGraph and is serialized to .repowise/graph.json.
For repos exceeding 30K nodes (configurable via graph_backend: sqlite), repowise
switches to networkit with SQLite backing via the graph_nodes and graph_edges
tables. Only the subgraph needed for each operation is loaded into memory. The
API is identical — this is transparent to all callers.
This is the most important quality driver in the system. The ContextAssembler
builds the prompt context for each generation call — it does not just dump the
raw source file.
For a given file page, it assembles context in this priority order (dropping lower-priority items if the token budget is exceeded):
- Source code — full source (or chunked if file is large)
- Symbol signatures — all symbols extracted from this file with their signatures and existing docstrings
- Graph context — PageRank score, cluster membership, which module this belongs to, entry point status
- Git context — ownership, significant commit messages explaining why code evolved, hotspot/stable classification, co-change partners. This transforms documentation from "here is what this code does" into "here is what it does and why it was written this way." Git context is the last thing dropped when over budget — it is more valuable than import summaries.
- Import summaries — for each file this file imports from: the summary of that file's already-generated wiki page (if available), or just its public API signatures (if not yet generated)
- RAG context — vector store similarity search (LanceDB or pgvector) using this file's top exported
symbols as the query. Returns the top 3 most relevant already-generated pages.
This propagates understanding upward:
AuthService's page will know whatUserRepositoryactually does, not just that it imports from it. - Co-change context — wiki pages for co-change partners (files that change together without an import relationship). Reveals hidden coupling.
- Dead code findings — symbols in this file flagged as unused (if any). Listed in the generation prompt so the LLM notes them as cleanup candidates.
- Reverse import context — which files import this file, and what they use from it. Helps the LLM understand how this file is used in practice.
Token budget: the total assembled context targets 12K tokens, leaving room for the generation output. Items are dropped in reverse priority order when over budget. The source code is never dropped — it is chunked instead if too large.
Large file chunking:
Files over large_file_threshold_kb (default: 100KB) are handled differently.
Rather than passing the full source, repowise:
- Extracts public symbol signatures only for the top-level context
- Generates one sub-page per major class or function group
- Synthesizes a file page from the sub-pages
Pages generated from chunks are tagged
chunked: truein metadata.
Generation must follow a strict dependency-aware order. This is not optional — generating a module page before its file pages means the module page has no content to draw from.
Level 0: External API contracts (OpenAPI, proto, GraphQL)
→ self-contained, no dependencies on other pages
Level 1: Symbol spotlight pages (top 10% by PageRank)
→ short, fast, parallelizable
→ high-PageRank symbols get richer documentation
Level 2: File pages
→ uses: source + symbol docs + import summaries (from RAG)
→ parallelizable within this level
→ SCC members generate with stubs, upgraded in second pass
Level 3: SCC pages
→ must wait for all member files to complete Level 2
Level 4: Module/package pages
→ uses: all file pages within the module
Level 5: Cross-package relationship pages (monorepos only)
→ uses: all package pages
Level 6: Repository overview + architecture diagram
→ uses: all module pages + graph metrics
Level 7: Config/infra pages (Dockerfile, CI YAML, Makefile, etc.)
→ references code pages, so comes after all code docs
Level 8: Index pages (symbol index, search index)
→ built from all completed pages
Within each level, up to concurrent_jobs tasks run in parallel using
asyncio.Semaphore. The generation engine never exceeds this limit regardless
of how large the repo is.
Each generated page gets a confidence_score of 1.0 when first created.
Init on a 50K-file repo can take 30–90 minutes. A crash or network interruption should not require starting over.
The JobSystem persists checkpoint state after every completed page:
checkpoint_level— which generation level is currently activecheckpoint_file_index— position within the current levelcompleted_page_ids— list of already-generated page IDsfailed_page_ids— pages that failed (retried on resume)
On repowise init --resume, the job is loaded from the database, completed pages
are skipped, and generation continues from the last checkpoint.
repowise init is fully idempotent. Running it twice produces the same result.
Running it after a partial previous run completes only the remaining pages.
The persistence logic for storing a PipelineResult into the database (graph nodes,
edges, symbols, pages, git metadata, dead code findings, decision records) was
extracted from the CLI's init_cmd.py into core/pipeline/persist.py. Both the
CLI and the server's background job executor call persist_pipeline_result() — zero
duplication.
FTS indexing is intentionally excluded from this function. Callers must run it separately after the session closes to avoid SQLite write-lock conflicts.
The server can now run full pipeline jobs in the background, triggered by the
POST /api/repos/{id}/sync and POST /api/repos/{id}/full-resync endpoints.
execute_job() is the single entry point, launched via asyncio.create_task():
- Marks the job as
running - Resolves the LLM provider from server config
- Runs
run_pipeline()with aJobProgressCallbackthat writes progress to theGenerationJobtable (the SSE stream endpoint polls this table) - Persists results via
persist_pipeline_result() - Marks the job as
completed(orfailedon error)
Progress updates are batched (every 5 items) to avoid per-item DB overhead. Before
writing the final job status, all in-flight progress tasks are drained to prevent a
late running update from overwriting completed.
Concurrent pipeline runs on the same repository are prevented at the endpoint level (returns HTTP 409 if a pending/running job already exists).
The pipeline orchestrator now keeps the event loop responsive during CPU-bound work:
- File I/O uses
asyncio.wrap_future()instead of blockingas_completed() - Graph building runs in a thread via
asyncio.to_thread() - The parse loop yields control every 50 files with
asyncio.sleep(0) - Thread pool shutdown is non-blocking via
asyncio.to_thread()
repowise update runs after a git push (triggered by webhook, GitHub Action, or
polling fallback). It is fast, targeted, and avoids regenerating pages that
haven't actually changed in a meaningful way.
ChangeDetector uses GitPython to compute the diff between last_sync_commit
and HEAD:
changed_files = differ.get_changed_files(repo_path, since_commit="a1b2c3d")
# → ChangedFiles(added=[...], modified=[...], deleted=[...], renamed=[...])Renamed files get special treatment: repowise updates all source_files references
in existing wiki pages and regenerates the file page (since the path context has
changed), but does not regenerate pages that only reference the file's symbols
(symbol names don't include paths).
ChangeDetector.detect_symbol_renames() uses a heuristic to identify when a
symbol was renamed rather than deleted-and-recreated:
- Same
kind(function → function) - Similar signature (Levenshtein distance under threshold)
- Git blame on the new file confirms line provenance
When a rename is detected, repowise applies a targeted text patch to all pages that only mention the old name (cheaper than full regeneration), and fully regenerates pages that document the old symbol.
All detected renames are stored in symbol_rename_history for audit purposes.
For each changed file, ChangeDetector.get_affected_pages() walks the dependency
graph to find all pages that need updating:
Changed file → find its symbols
→ find all wiki pages that document those symbols (direct)
→ find all wiki pages that reference those symbols (1-hop: inherits/calls)
→ find all wiki pages that reference referencing pages (2-hop: looser reference)
→ apply cascade budget
Cascade budget:
A change to a central utility module (e.g., utils.py imported by 200 files) would
naively require regenerating 200+ pages on every push. This is too expensive.
The cascade_budget config option (default: 30 pages per maintenance run) caps the
number of pages fully regenerated per push. Pages beyond the budget:
- Have their
confidence_scoredecayed (see below) - Are added to the background staleness queue
- Are regenerated by the nightly background job
Every wiki page has a confidence_score (float 0.0–1.0) representing how fresh
it is relative to the source code:
| Score | Status | Badge | Action |
|---|---|---|---|
| ≥ 0.80 | fresh | green | none |
| 0.60–0.79 | stale | yellow | queued for background regen |
| 0.30–0.59 | outdated | red | auto-queued, warning shown |
| < 0.30 | unusable | red (prominent) | force-regenerated immediately |
Decay rules applied on each maintenance run:
- Source file directly changed:
score *= 0.85 - Referenced symbol changed (1-hop: calls/inherits):
score *= 0.95 - Referenced symbol changed (2-hop: looser reference):
score *= 0.98 - Co-change partner changed (no import relationship):
score *= 0.97 - Page beyond cascade budget (time-based, 1 week since change):
score *= 0.90
Git-informed decay modifiers (applied multiplicatively on top of base decay):
- Hotspot file (
is_hotspot=True) → decays faster: direct*= 0.94, 1-hop*= 0.95 - Stable file (
is_stable=True) → decays slower: direct*= 1.03 - Large change in commit message ("rewrite", "refactor", "migrate") → hard decay:
direct
*= 0.71, 1-hop*= 0.84 - Cosmetic change in commit message ("typo", "lint", "format") → soft decay:
direct
*= 1.12
Pages start at 1.0 and decay over time. Regeneration resets to 1.0.
Webhooks can be missed when the repowise server is down during a push. repowise uses a two-layer sync strategy:
Layer 1 (real-time): GitHub/GitLab webhook → POST /api/webhooks/github →
signature verification → store in webhook_events → enqueue GenerationJob.
Response is always 200 OK immediately — processing is async.
Layer 2 (polling fallback): APScheduler job runs every 15 minutes (configurable).
Compares repos.last_sync_commit against actual HEAD via GitHub API or GitPython.
If they differ, triggers an incremental update for the missing commits.
The combination ensures: even if a webhook is missed entirely, docs are at most
polling_interval minutes stale for the default branch.
APScheduler runs a nightly job (configurable via cron expression) to:
- Query all pages where
confidence_score < staleness_regen_threshold(default: 0.60) - Sort by
confidence_score ASC(most stale first) - Regenerate up to
background_regen_budgetpages (default: 100) - Log token usage and cost
This ensures no page stays stale indefinitely regardless of cascade budget constraints.
On pull_request webhook events, repowise runs ChangeDetector in dry-run mode
and posts a GitHub PR comment listing:
- Pages that will be regenerated (with links to current versions)
- Pages that will have confidence decay
- New pages that will be created (new files)
- Pages that will be deleted (deleted files)
- Estimated token cost for this PR's docs update
On merge, the actual incremental update runs.
repowise mines git history to make documentation significantly richer and more useful.
The GitIndexer runs once during repowise init (after graph construction, before
generation) and incrementally during repowise update. All git features degrade
gracefully when git metadata is unavailable — they simply skip git-enriched context.
The GitIndexer class mines git history into the git_metadata SQL table. For each
tracked file, it computes:
- Commit volume — total, last 90 days, last 30 days (churn signals)
- Timeline — first and last commit dates (file age)
- Ownership — primary owner from
git blame(who wrote the most lines), top 3 contributors by commit count - Significant commits — last 10 meaningful commit messages (filtered: no merges, no dependency bumps, no chore/ci, messages > 20 chars). These explain why the code evolved this way and are included in generation prompts.
- Co-change partners — files that changed in the same commit >= 3 times, even without an import relationship. Reveals hidden structural coupling.
- Derived signals —
is_hotspot(top 25% churn AND complexity),is_stable(>10 commits, 0 in 90 days),churn_percentile(0.0–1.0)
Performance targets:
- 3,000 files, 10K commits → < 3 minutes
- 50,000 files, 100K commits → < 20 minutes (uses shallow history for large repos)
- Uses
asyncio.Semaphore(20)to parallelize git log calls across files
Generation prompts: The file_page.j2 template includes optional git context
blocks (gated on {% if git_metadata %}). These add ownership attribution, evolution
context from significant commits, hotspot/stable warnings, and co-change partner
documentation. The LLM uses commit messages to explain why code is structured as
it is, not just what it does.
Generation ordering: Within each level, files are sorted by priority: entry points first, then hotspots, then high PageRank, then high commit count. This ensures the most important files generate first, making them available as RAG context for less important files.
Generation depth: Files can be auto-upgraded to "thorough" depth if they are
hotspots, have > 100 commits with > 10 in 90 days, have >= 8 significant commits,
or have co-change partners. Conversely, stable + low PageRank + low commit count
files can be auto-downgraded to "minimal". Controlled by git.depth_auto_upgrade.
Maintenance prompts: When regenerating during repowise update, the specific
commit that triggered the update (SHA, author, message, diff) is included in the
prompt. This produces targeted updates rather than full rewrites.
Confidence decay: Git signals modify the base confidence decay multiplicatively. Hotspot files decay faster (bugs are more likely). Stable files decay slower. Large changes ("rewrite", "refactor", "migrate" in commit message) trigger aggressive decay. Cosmetic changes ("typo", "lint", "format") trigger softer decay.
Co-change edges: After GitIndexer runs, co-change edges (edge_type="co_changes")
are added to the dependency graph. They participate in change propagation (when file
A changes, its co-change partners get mild confidence decay at factor 0.97) and are
visible in the graph visualization as dashed purple lines.
CLAUDE.md generation: repowise generate-claude-md includes sections for
hotspots, stable core files, ownership map, and hidden coupling pairs.
Module pages include a team ownership summary: who maintains the most files, who was most active recently. Repo overview pages include codebase health signals: hotspot count, stable file count, top churn files, oldest file.
repowise detects unreachable files, unused exports, and zombie packages using graph traversal and SQL queries. No LLM calls — the analysis completes in < 10 seconds for any repo size.
The analyzer runs after GitIndexer during init (Step 3.6) and optionally during
repowise update. It produces findings with confidence scores:
Finding types:
unreachable_file— file within_degree == 0, not an entry point, test, or configunused_export— public symbol with no incoming edges (but file IS imported)unused_internal— private symbol with nocallsedges from same filezombie_package— monorepo package with no incominginter_packageedges
Confidence scoring (conservative — when in doubt, do NOT flag):
- Unreachable + no commits in 90d + last commit > 6 months → 1.0
- Unreachable + no commits in 90d → 0.7
- Unreachable but recently touched → 0.4
safe_to_deleteonly set at confidence >= 0.7, and NOT for files matching dynamic patterns (*Plugin*,*Handler*,*Adapter*,*Middleware*)
Never flagged as dead:
__init__.pypublic re-exports@pytest.fixture,@pytest.mark.*symbols- Files matching
*migrations*,*schema*,*seed* - TypeScript
.d.tsfiles - Files where
is_api_contract == True - Files in
.repowise/dead_code_whitelist.txt - Symbols matching
config.dead_code.dynamic_patterns
The file_page.j2 template includes an optional block for dead code findings.
When present, the LLM notes unused symbols as cleanup candidates in the
documentation, including confidence percentages and safety assessment.
repowise dead-code [PATH]
--min-confidence FLOAT (default: 0.4)
--safe-only only show safe_to_delete=True findings
--kind TYPE filter by kind
--package PKG filter by monorepo package
--format table|json|md
--output FILE
repowise dead-code resolve [FINDING_ID]
--status acknowledged|resolved|false_positive
--note "reason"
Dead code findings are stored in the dead_code_findings SQL table with a status
field (open / acknowledged / resolved / false_positive). The web UI shows
a dedicated dead code page with sortable tables, confidence sliders, and resolve
buttons. The graph visualization supports a "show dead code" filter that adds red
dashed borders to dead code nodes.
The Decision Intelligence layer captures architectural decisions — the why behind how the system is built, what alternatives were rejected, and what constraints exist. While documentation describes what, decisions capture why.
Decisions are extracted from four sources, each with a different confidence level:
| Source | Confidence | Status | Trigger |
|---|---|---|---|
Inline markers (# WHY:, # DECISION:, # TRADEOFF:, # ADR:, # RATIONALE:, # REJECTED:) |
0.95 | active |
File scanning during init/update |
| Git archaeology (commit messages with migration/refactor signals) | 0.70–0.85 | proposed |
Init only, reuses git layer data |
| README/docs mining (implicit decisions in prose) | 0.60 | proposed |
Init only, LLM extraction |
CLI capture (repowise decision add) |
1.00 | active |
Manual entry |
Stored in the decision_records SQL table. JSON arrays for alternatives, consequences,
affected files, modules, tags, and evidence commits. Deduplication key:
(repository_id, title, source, evidence_file).
Decisions have a staleness_score (0.0 = fresh, 1.0 = very stale) computed during
repowise init and recomputed on every repowise update. Staleness rises when affected
files receive commits after the decision was recorded. Decisions with staleness_score > 0.5
are flagged as stale.
get_why(query?)— three modes: natural language search over decisions, path-based lookup for decisions governing a file, or no-arg for decision health dashboard (stale decisions, ungoverned hotspots, proposed decisions needing review)get_context(targets, include?)— includes decisions governing each target in its response
repowise decision add # interactive capture
repowise decision list # tabular list with filters
repowise decision show <id> # full detail
repowise decision confirm # proposed → active
repowise decision dismiss # delete proposed
repowise decision deprecate # active → deprecated
repowise decision health # health summary
| File | Purpose |
|---|---|
core/analysis/decision_extractor.py |
All 4 capture sources + staleness computation |
core/persistence/models.py |
DecisionRecord ORM model |
core/persistence/crud.py |
8 decision CRUD functions |
server/mcp_server/tool_why.py |
MCP tool get_why (3-mode: search, path, health dashboard) |
server/routers/decisions.py |
REST API endpoints |
cli/commands/decision_cmd.py |
CLI command group (7 subcommands) |
The MCP server is repowise's most valuable integration feature. It exposes the entire wiki as a set of queryable tools that any MCP-compatible AI assistant can call in real time.
Instead of an AI assistant reading 40 source files to understand a codebase,
it calls get_overview() and gets a structured, always-current architecture summary.
Instead of calling 5 tools one at a time, it calls get_context(["src/auth/service.py", "AuthService"]) and gets docs, ownership, decisions, and freshness for all targets in one call.
The server is implemented using the MCP Python SDK and supports two transports:
- stdio — for Claude Code, Cursor, Cline (add to their MCP config)
- SSE — for web-based MCP clients (served on port 7338)
| Tool | What it answers | When to call |
|---|---|---|
get_overview |
Architecture summary, module map, entry points. | First call when exploring an unfamiliar codebase. |
get_context(targets, include?) |
Docs, ownership, history, decisions, freshness for files/modules/symbols. Pass multiple targets in one call. | When you need to understand specific code before reading or modifying it. |
get_risk(targets) |
Hotspot score, dependents, co-change partners, risk summary per target. Also returns top 5 global hotspots. | Before modifying files — assess what could break. |
get_why(query?) |
Three modes: NL search over decisions, path-based decisions for a file, no-arg health dashboard. | Before making architectural changes — understand existing intent. |
update_decision_records(action, ...) |
Full CRUD on decision records: create, update, update_status, delete, list, get. | After every coding task — record new decisions and keep existing ones current. |
search_codebase(query) |
Semantic search over the full wiki. Natural language. | When you don't know where something lives. |
get_dependency_path(from, to) |
Connection path between two files/modules in the dependency graph. | When you need to understand how two things are connected. |
get_dead_code |
Dead/unused code findings sorted by confidence and cleanup impact. | Before cleanup tasks. |
get_architecture_diagram |
Mermaid diagram for repo or specific module. | For documentation or presentation. |
get_answer |
One-call RAG: confidence-gated synthesis with cited answers and question cache. | First call on any code question — collapses search → read → reason. |
get_symbol |
Resolve a qualified symbol id to source body, signature, and docstring. | When the question names a specific class, function, or method. |
annotate_file |
Attach human-authored notes to a wiki page — survives re-indexing. | Adding rationale, known issues, or context the LLM shouldn't overwrite. |
repowise init automatically generates .repowise/mcp.json with ready-to-paste
config blocks for Claude Code (~/.claude/claude.json), Cursor (.cursor/mcp.json),
and Cline. This config is printed at the end of repowise init.
Served on port 7337 alongside the web UI. All endpoints are prefixed with /api/.
Key routers:
/api/repos— register repos, trigger sync, full-resync (now launches background pipeline jobs with concurrent-run prevention)/api/pages— read pages, version history, force-regenerate single page/api/search— semantic (LanceDB or pgvector) and full-text (SQLite FTS5 / PostgreSQL tsvector) search/api/jobs— job status, SSE stream for live progress updates/api/symbols— symbol lookup, dependency path queries/api/graph— graph export in D3-compatible JSON format/api/webhooks/github— GitHub webhook handler with HMAC verification/api/webhooks/gitlab— GitLab webhook handler/api/repos/{id}/git-metadata— per-file git metadata/api/repos/{id}/hotspots— high-churn + high-complexity files/api/repos/{id}/ownership— ownership breakdown (file/module/package granularity)/api/repos/{id}/co-changes— co-change partners for a file/api/repos/{id}/git-summary— aggregate git health signals for dashboard/api/repos/{id}/dead-code— dead code findings (GET list, POST trigger analysis)/api/repos/{id}/dead-code/summary— aggregate dead code stats/api/dead-code/{finding_id}— PATCH to resolve/acknowledge findings/api/repos/{id}/claude-md— GET preview of generated CLAUDE.md section (JSON, no disk write)/api/repos/{id}/claude-md/generate— POST to regenerate and write CLAUDE.md to disk/health— liveness + readiness (checks DB + provider)/metrics— Prometheus-compatible metrics (job counts, token totals, stale count)
Server lifecycle:
- On startup, any jobs left in
runningstate from a previous server instance are automatically reset tofailed(crash recovery). - Background pipeline tasks are tracked in
app.state.background_tasksto prevent garbage collection ofasyncio.Taskreferences.
Authentication is optional. Set REPOWISE_API_KEY to require bearer token auth on
all non-/health endpoints. Default (no key set): fully open, suitable for local use.
Served from the same port as the API. All routes under /:
| Route | Content |
|---|---|
/ |
Dashboard: all repos, recent jobs, stale page counts, token usage |
/repos/[id] |
Repo layout with file tree sidebar |
/repos/[id]/overview |
Overview dashboard — health score ring, attention panel, language donut, ownership treemap, hotspots mini, decisions timeline, module minimap, quick actions, active job banner |
/repos/[id]/wiki/[...slug] |
Individual wiki page with MDX rendering |
/repos/[id]/search |
Semantic search results |
/repos/[id]/graph |
D3 force-directed dependency graph |
/repos/[id]/symbols |
Full symbol index, sortable by PageRank |
/repos/[id]/coverage |
Documentation coverage metrics |
/repos/[id]/ownership |
Ownership treemap — files colored by primary owner, sized by LOC |
/repos/[id]/hotspots |
Hotspot list — top 20 files with churn + complexity bars |
/repos/[id]/dead-code |
Dead code report — three tabs: Files, Exports, Internals |
/repos/[id]/decisions |
Architectural decision records |
/repos/[id]/chat |
Codebase chat with streaming LLM responses |
/settings |
Provider config, polling interval, cascade budget |
Key rendering behavior:
Every backtick identifier in a wiki page that matches a known symbol becomes a hover card showing the symbol's signature, file, and confidence score, plus a link to the symbol's wiki page. This is built by post-processing the MDX content client-side after initial render.
Mermaid diagrams are rendered lazily (only when scrolled into viewport) using
mermaid.js initialized with the repowise design theme.
Code blocks use Shiki for syntax highlighting. Each block has a "View in source" button that deep-links to the relevant line in GitHub/GitLab.
The <GenerationProgress> component connects to /api/jobs/{id}/stream via SSE
and shows a live progress display during generation (pages done/total, current
file, tokens used, estimated cost, estimated time remaining).
repowise includes an interactive chat interface that lets users ask questions about their codebase and receive answers grounded in the wiki, dependency graph, git history, and architectural decisions. The chat agent uses whichever LLM provider the user has configured and has access to all 11 MCP tools.
See docs/CHAT.md for the full technical reference covering the
backend agentic loop, SSE streaming protocol, provider abstraction extensions,
database schema, frontend component architecture, and artifact rendering system.
Key design points:
- Provider-agnostic — the chat agent goes through the same provider abstraction
as documentation generation. A
ChatProviderprotocol extendsBaseProviderwithstream_chat()for streaming + tool use without breaking existing callers. - Tool reuse — the 11 MCP tools are called directly as Python functions (no
subprocess round-trip). Tool schemas are defined once in
chat_tools.pyand fed to both the LLM and the executor. - SSE streaming —
POST /api/repos/{repo_id}/chat/messagesruns the agentic loop and streams back Server-Sent Events (text_delta,tool_start,tool_result,done,error). - Conversation persistence — chat history is stored in
conversationsandchat_messagestables, allowing replay and continuation across page refreshes. - Artifact panel — tool results with rich content (wiki pages, Mermaid diagrams, search results, risk reports) open in a slide-in artifact panel that reuses existing frontend components.
repowise init /path/to/repo
│
▼
FileTraverser
walks repo, produces list[FileInfo]
│
▼
ASTParser ──── queries/<lang>.scm ────▶ ParsedFile per file
(parallel, (symbols, imports, exports)
one parser per worker)
│
▼
GraphBuilder
constructs DiGraph
computes PageRank, SCC, Betweenness, Communities
persists to graph.json (or DB for large repos)
│
▼
GitIndexer (if git.enabled)
mines git history → git_metadata table
computes co-change partners → adds co_changes edges to graph
classifies hotspots (top 25% churn + complexity)
classifies stable files (>10 commits, 0 in 90 days)
│
▼
DeadCodeAnalyzer (if dead_code.enabled)
graph traversal + SQL: unreachable files, unused exports, zombie packages
pure analysis — no LLM calls, < 10 seconds
│
▼
JobSystem
creates GenerationJob in DB
shows cost estimate + confirmation
│
├── Level 0: special handlers (OpenAPI, proto, GraphQL, Dockerfile...)
│
├── Level 1: symbol spotlight pages (top 10% PageRank)
│ ┌────────────────────────────────────┐
│ │ ContextAssembler │
│ │ source + graph metadata │
│ │ + RAG (vector store similarity search)│
│ │ + import summaries │
│ └──────────────┬─────────────────────┘
│ │
│ ▼
│ LLMProvider.generate()
│ │
│ ▼
│ WikiPage stored:
│ → SQL (content, metadata, confidence=1.0)
│ → VectorStore (LanceDB or pgvector — embedding for RAG)
│ → Graph (node.page_id linked)
│
├── Level 2: file pages (parallel, RAG pool growing with each completed page)
├── Level 3: SCC pages
├── Level 4: module pages
├── Level 5: cross-package pages (monorepos)
├── Level 6: repo overview + architecture diagram
├── Level 7: config/infra pages
└── Level 8: index pages
│
▼
.repowise/state.json updated
MCP config printed
CLAUDE.md generated (EditorFileDataFetcher → ClaudeMdGenerator, no LLM)
Summary shown (pages, tokens, cost, time)
git push
│
├── GitHub webhook → POST /api/webhooks/github
│ │ (verified, stored in webhook_events)
│ ▼
│ GenerationJob created (type: incremental)
│
└── (fallback) APScheduler polls every 15min
│ (if HEAD != last_sync_commit)
▼
GenerationJob created (type: incremental)
GenerationJob runs:
│
▼
ChangeDetector.get_changed_files(since=last_sync_commit)
│
├── Renamed files → update source_files refs in SQL
▼
GitIndexer.index_changed_files(changed_file_paths)
│ re-indexes only changed files + their co-change partners
│ updates co-change edges in graph
▼
ChangeDetector.detect_symbol_renames()
│
▼
ChangeDetector.get_affected_pages(cascade_budget=30)
│ includes co-change partners in staleness propagation
│
├── regenerate: list[page_id] → full regeneration (up to cascade_budget)
├── rename_patch: list[page_id] → targeted text find-and-replace
└── decay_only: list[page_id] → confidence score decay, add to staleness queue
│
▼
For each page in `regenerate`:
ContextAssembler → LLMProvider → update SQL + VectorStore + Graph
│
▼
For each page in `rename_patch`:
scan content_md for old symbol name → replace → update SQL
│
▼
For each page in `decay_only`:
confidence_score *= decay_factor
regen_queued = True
│
▼
repos.last_sync_commit = HEAD
state.json updated
Web UI "Sync" / "Full Re-index" button
│
▼
POST /api/repos/{id}/sync (or /full-resync)
│
├── Check: no pending/running job for this repo (else → 409)
▼
Create GenerationJob (status=pending, commit)
│
▼
asyncio.create_task(execute_job(job_id, app_state))
│ (strong ref in app.state.background_tasks)
▼
execute_job():
├── Mark job "running"
├── Resolve LLM provider from server config
├── run_pipeline() with JobProgressCallback
│ └── writes progress to GenerationJob table every 5 items
├── persist_pipeline_result() (shared with CLI)
├── FTS index new pages
├── drain_and_stop() progress tasks
└── Mark job "completed" (or "failed" on error)
Claude Code: "how does the auth module work?"
│
▼
MCP client calls repowise tool: search_codebase(query="auth module")
│
▼
VectorStore similarity search (LanceDB or pgvector) → top 5 page IDs
│
▼
SQL: fetch page content for those IDs
│
▼
Return: list[{title, content_md snippet, relevance_score, confidence_score}]
│
▼
Claude Code has structured, current documentation
instead of reading 40 files
repowise offers three complementary ways to skip paths during traversal, each solving a different problem:
-
Root
.repowiseIgnore— project-wide exclusions committed to the repo (like.gitignore). Shared across all users, version-controlled. -
Per-directory
.repowiseIgnore— exclusions that only apply within a subtree. Lets a monorepo package owner exclude its own generated output without polluting the root ignore file. Patterns are relative to the directory, matching git semantics. Specs are loaded once per directory duringos.walkand cached by absolute path; the root spec is pre-seeded in the cache to avoid reading it twice. -
extra_exclude_patterns(runtime) — patterns injected without touching any file on disk: from--exclude/-xCLI flags (dev workflow), fromconfig.yaml exclude_patterns(persisted per repo), or fromrepo.settings["exclude_patterns"](Web UI / REST API). This lets teams configure exclusions through the UI without git access to the target repo.
All three layers use pathspec with gitwildmatch semantics — the same library used
for .gitignore parsing — so the full gitignore syntax works everywhere.
The original save_config() wrote a fixed three-key file (provider, model,
embedder). This meant any other keys — such as exclude_patterns set via the Web UI
— would be silently dropped the next time repowise init ran. The updated function
loads the existing config, merges in the new values, then writes the result back.
This ensures all config sources (CLI, Web UI, REST API, manual edits) coexist safely.
Per-language differences live in .scm query files and LANGUAGE_CONFIGS dict entries.
This means adding a new language requires no changes to Python business logic — just
a new .scm file and a config entry. The ASTParser class itself never has
if lang == "python" branches.
ChromaDB was the original choice but has notable drawbacks: slow write throughput on
large repos, heavy C++ build dependencies (chroma-hnswlib), and a bloated dependency
tree. LanceDB replaces it as the default embedded vector store:
- No build step — pure Python wheel, no C++ compiler required on Windows
- Faster writes — Lance columnar format is optimised for batch appends; embedding
50K pages during
repowise initis measurably faster - Faster queries — IVF-PQ and HNSW index support with sub-millisecond ANN search
- Simpler data model — tables are Arrow-native; filtering by
repo_idorpage_typeuses SQL-style predicates alongside the vector search, no separate metadata store needed - Zero server process — same embedded story as ChromaDB: data lives in
.repowise/lancedb/, no container needed
When PostgreSQL is already in use (multi-worker prod), pgvector is preferred because
it eliminates the second storage system entirely: embeddings live as a vector column
in the existing wiki_pages table, queries are plain SQL with <=> cosine distance,
and backup/restore is a single pg_dump. The HNSW index (CREATE INDEX ... USING hnsw)
gives query latency on par with LanceDB at typical repowise dataset sizes.
The VectorStore abstraction in packages/core/src/repowise/core/persistence/vector.py
selects the backend at startup based on DATABASE_URL: SQLite → LanceDB, PostgreSQL → pgvector.
NetworkX is a Python library that runs in-process. Neo4j is a separate server (Java process, GBs of RAM, authentication setup). For a graph that is derived from source code (rebuildable at any time), the operational overhead of Neo4j is unjustified. NetworkX handles tens of thousands of nodes comfortably.
The SQLite-backed fallback using networkit handles repos that genuinely exceed
in-memory limits without adding a server dependency.
A binary fresh/stale flag would require either: (a) marking everything stale on every push (triggering full regen, expensive), or (b) only marking directly changed pages (missing transitive effects). The confidence score model captures the gradient: directly changed pages decay sharply, indirectly referenced pages decay gently. Users can see at a glance how much to trust each page.
All LLM prompts are Jinja2 templates in packages/core/queries/prompts/. Users can
override any prompt by placing a file with the same name in .repowise/prompts/.
This lets power users tune generation quality without forking the project.
Without a budget, changing a central utility (imported by 200 files) would regenerate 200+ pages on every push. With a budget (default: 30), repowise regenerates the highest-PageRank affected pages immediately and defers the rest to the nightly background job. No page stays stale indefinitely; the nightly job catches everything the cascade budget missed.
The highest-value use of git metadata is enriching the LLM's generation context — not just showing ownership in a sidebar. By including significant commit messages in the prompt, the LLM can explain why code is structured a certain way (e.g., "this was refactored in March 2024 to separate auth concerns from the request pipeline"). This context is unavailable from static analysis alone.
Co-change relationships are temporal coupling — they cannot be detected by AST
parsing. repowise adds them as co_changes edges after GitIndexer runs, but
deliberately filters them out of PageRank computation (they would skew it
artificially toward files that are often edited together for process reasons, not
architectural ones). They participate in change propagation and visualization only.
repowise's dead code detector errs heavily toward false negatives. safe_to_delete
is set to True only at confidence >= 0.7 and after excluding dynamically-loaded
patterns. Dead code analysis is pure graph traversal + SQL (no LLM calls), so it
completes in seconds and can be re-run cheaply. repowise surfaces candidates —
humans decide before deleting anything.
All database operations use async SQLAlchemy with aiosqlite. The event loop
is never blocked. This matters during generation: the LLM call, the DB write,
and the vector store embed (LanceDB or pgvector) can overlap with the next file's context assembly.
See docs/EDITOR_FILES.md for the complete reference covering
architecture, all data sources, how the marker-merge system works, and how to add
support for a new editor file (cursor.md, copilot-instructions.md, etc.).
Quick summary: repowise can generate and maintain AI-editor configuration files (CLAUDE.md, cursor.md, etc.) from the already-indexed codebase data — no LLM calls. The system uses HTML comment markers to split the file into a user-owned section and a Repowise-managed section. The user section is never touched.
The feature runs automatically after repowise init and repowise update, and can
also be run standalone:
repowise generate-claude-md [PATH]
Config opt-out:
# .repowise/config.yaml
editor_files:
claude_md: falseKey files:
| File | Purpose |
|---|---|
core/generation/editor_files/base.py |
BaseEditorFileGenerator — marker-merge logic shared by all editor-file generators |
core/generation/editor_files/data.py |
EditorFileData frozen dataclass — the data contract between fetcher and template |
core/generation/editor_files/fetcher.py |
EditorFileDataFetcher — queries DB for architecture summary, modules, hotspots, decisions |
core/generation/editor_files/tech_stack.py |
Filesystem scan for languages, frameworks, build commands |
core/generation/editor_files/claude_md.py |
ClaudeMdGenerator — 30-line subclass that binds filename + template |
core/generation/templates/claude_md.j2 |
Jinja2 template for the Repowise-managed section |
cli/commands/claude_md_cmd.py |
repowise generate-claude-md CLI command |
server/routers/claude_md.py |
GET/POST /api/repos/{id}/claude-md REST endpoints |
-
Write
packages/core/queries/<language>.scmUse tree-sitter S-expression syntax. Follow the capture name conventions:
@symbol.def,@symbol.name,@symbol.params,@symbol.return_type,@symbol.docstring,@import.statement,@import.module,@import.names.Check the tree-sitter playground for your language's node type names:
https://tree-sitter.github.io/tree-sitter/playground -
Add a
LanguageConfigentry toLANGUAGE_CONFIGSinparser.py"mylang": LanguageConfig( symbol_node_types={ "function_definition": "function", "class_definition": "class", }, import_node_types=["import_statement"], export_node_types=[], visibility_fn=lambda name, mods: "private" if name.startswith("_") else "public", entry_point_patterns=["main.ml", "app.ml"], ),
-
Add a
LanguageSpectoLanguageRegistryiningestion/languages/registry.pyThis registers the language's identity data (extensions, entry points, manifest files, builtin calls, heritage node types, etc.) centrally.
-
Add the grammar dependency to
pyproject.toml"tree-sitter-mylang>=0.23,<1", -
Add test files to
tests/fixtures/sample_repo/At minimum: one file with a function, one with a class, one with imports.
-
(Optional) Add per-language extractors for bindings, heritage, visibility, docstrings, and a dedicated import resolver in
resolvers/mylang.py. -
Run
pytest tests/unit/test_parser.py -k mylangto verify extraction. -
Open a PR. That's it — no other changes needed.
-
Create
packages/core/providers/<name>.pySubclass
LLMProviderand implement:generate(request: GenerationRequest) -> GenerationResponsegenerate_stream(request: GenerationRequest) -> AsyncIterator[str]embed(request: EmbedRequest) -> EmbedResponsenameproperty
Optionally override:
supports_batch→Trueif the provider has a batch APIgenerate_batch(requests) -> list[GenerationResponse]estimate_cost(input_tokens, output_tokens) -> float
-
Register in
providers/registry.pycase "myprovider": return MyProvider(config)
-
Add config section to
.repowise/config.yamldocsmyprovider: api_key: ${MYPROVIDER_API_KEY} base_url: https://api.myprovider.com/v1
-
Add default rate limits to
PROVIDER_DEFAULT_LIMITSinrate_limiter.py -
Add a
MockProviderfixture for tests if the provider has unique response formats -
Update
CONTRIBUTING.mdwith the new provider's environment variables
Full configuration with defaults (.repowise/config.yaml):
provider: anthropic # anthropic | openai | openrouter | gemini | deepseek | ollama | litellm | mock
model: claude-sonnet-4-5 # passed through to the provider
embedding_provider: anthropic
embedding_model: voyage-3
batch_mode: auto # auto | always | never
prompt_caching: true
anthropic:
api_key: ${ANTHROPIC_API_KEY}
openai:
api_key: ${OPENAI_API_KEY}
base_url: https://api.openai.com/v1
ollama:
base_url: http://localhost:11434
generation:
max_tokens: 4096
temperature: 0.2
concurrent_jobs: 5
request_timeout: 120
max_retries: 3
large_file_threshold_kb: 100
max_file_size_kb: 500
graph:
backend: auto # auto | networkx | sqlite
node_threshold: 30000 # switch to sqlite above this
git:
enabled: true # mine git history for richer generation
co_change_commit_limit: 500 # commits to analyze for co-changes (0 = unlimited)
co_change_min_count: 3 # minimum co-occurrences to register a relationship
blame_enabled: true # git blame for ownership (slightly slower)
prompt_commit_count: 10 # significant commits to include in generation prompts
depth_auto_upgrade: true # auto-upgrade depth for hotspots
dead_code:
enabled: true # detect unreachable files, unused exports, zombie packages
detect_unreachable_files: true
detect_unused_exports: true
detect_unused_internals: false # off by default — higher false positive rate
detect_zombie_packages: true
min_confidence: 0.4
safe_to_delete_threshold: 0.7
dynamic_patterns: # symbols matching these are never flagged
- "*Plugin"
- "*Handler"
- "*Adapter"
- "*Middleware"
- "register_*"
- "on_*"
whitelist_file: .repowise/dead_code_whitelist.txt
analyze_on_update: true # re-analyze dead code on incremental updates
maintenance:
cascade_budget: 30
staleness_decay_direct: 0.85
staleness_decay_referenced: 0.95
staleness_regen_threshold: 0.60
staleness_warn_threshold: 0.75
background_regen_schedule: "0 2 * * *"
background_regen_budget: 100
polling_interval_minutes: 15
rate_limits:
anthropic:
rpm: 50
tpm: 100000
openai:
rpm: 500
tpm: 150000
ollama:
rpm: 999999
tpm: 999999