[Superseded] feat(code-index): add code indexing crate with builtin and QMD backends#752
[Superseded] feat(code-index): add code indexing crate with builtin and QMD backends#752Cstewart-HC wants to merge 15 commits intomoltis-org:mainfrom
Conversation
P1 scope (complete): - types.rs: Language, FileEntry, FilteredFile, CodeChunk, IndexStatus, SearchResult - config.rs: CodeIndexConfig with extension allowlist, skip paths, size limits - discover.rs: git-tracked file enumeration via gix - filter.rs: extension filter, binary detection, size limit, path exclusion - error.rs: Error enum with GitRepoNotFound, GitOperation, Io, etc - backend_qmd.rs: QMD collection config builder (feature-gated) - lib.rs: module declarations, feature-gated backend_qmd Tests: 11 passing, clippy: clean
- Remove Language::Bash variant (merges into Shell for round-trip consistency) - Remove dead GitOperation error variant (YAGNI, add back in P2) - Move unused deps out of [dependencies] (anyhow, async-trait, regex, etc.) into commented P2 section; add sqlx as optional dep for pgvector feature - Add serde_json as dev-dependency (used in test only) - Add trailing newlines to all source files (POSIX compliance) - All 11 tests passing, clippy clean
- Add index.rs: CodeIndex struct with config_only() and new() constructors, list_indexable_files(), index_project(), search(), keyword_search(), status() - Add search.rs: QMD result adapter (from_qmd, from_qmd_results) - Update lib.rs: export search module (feature-gated), index module, CodeIndex re-export - Clean up backend_qmd.rs: remove unused import annotations from P1 - Fix API mismatches: QMD search methods don't take collection param; QmdSearchResult.line is i64, score is f32 - Add tokio dev-dependency for async tests - All 14 tests passing, clippy clean
- Add tools.rs with CodebaseSearchTool, CodebasePeekTool, CodebaseStatusTool implementing AgentTool trait from moltis-agents - Fix deferred issue #1: replace SystemTime epoch math with time crate - Fix deferred issue #4: add enable_embeddings parameter to index_project() - Add register_tools() helper for gateway wiring - Add anyhow, async-trait, serde_json, time dependencies - Feature-gate tools module behind qmd (requires CodeIndex with backend) - 6 new tests: peek, search (backend required), status, parameter schemas
…indexing Adds two new modules: - watcher.rs: CodeIndexWatcher using notify_debouncer_full (same pattern as moltis-skills and moltis-openclaw-import). Watches a project directory for file create/modify/delete events, filters through CodeIndexConfig's extension allowlist and path exclusions, emits CodeWatchEvent::Changed and CodeWatchEvent::Removed via tokio mpsc. Feature-gated behind 'file-watcher'. - delta.rs: compute_delta() and build_initial_snapshot() for incremental reindexing. Computes SyncDelta (added/removed/modified file sets) by comparing current git-tracked filtered files against a previous HashSnapshot (HashMap<String, String> of relative_path → sha256). Uses content_hash from filter.rs for change detection. Feature flag 'file-watcher' added to Cargo.toml, depends on notify-debouncer-full (workspace dep) and tokio sync feature. All 32 tests passing, clippy clean with --all-features.
Review fixes applied: - #1: Extract require_str/opt_usize_or to moltis_tools::params — replaced local helpers with params::require_str() and params::u64_param() from the shared workspace crate - #2: Unified error model — CodebaseSearchTool now returns Ok(json!({error:..., search_available: false})) for BackendUnavailable, matching Peek/Status pattern - #3: u64→usize truncating cast replaced with usize::try_from().unwrap_or() - #4: ensure_collections() error remapped from BackendUnavailable to IndexFailed { project_id, message } - moltis-org#6: result.line as usize now clamped with .max(1) minimum - moltis-org#7: compute_delta carries forward previous hash on hash errors so files aren't spuriously marked as removed - moltis-org#8: Added doc comment noting that watcher batches may contain duplicate paths - moltis-org#9: Extracted effective_extension() from filter.rs, removed duplication between filter.rs and watcher.rs - moltis-org#11: Added tracing::debug! for skipped files in build_initial_snapshot - moltis-org#12: Added 'drop to stop' documentation on CodeIndexWatcher::start() 32 tests pass, clippy clean.
…l delta
Adds SnapshotStore — a file-backed, atomic-write store that persists
HashSnapshot per project to <data_dir>/code-index/<project_id>.json.
- SnapshotStore::new(base_dir) / ::default_path() constructors
- load(project_id) -> Option<HashSnapshot> (None if first run)
- save(project_id, &HashSnapshot) — atomic write via .tmp + rename
- delete(project_id) — removes snapshot file
- sanitize_project_id() rejects path traversal (/ \ .. \0)
- CodeIndex now owns a SnapshotStore, wired via config.data_dir
- CodeIndex::{load_snapshot, save_snapshot} delegate to store
- index_project() saves snapshot after successful reindex
- Added Error::Store variant for snapshot I/O errors
- Added moltis-config workspace dep for data_dir() resolution
- 10 new unit tests, all 42 tests pass, clippy clean
- Add init_code_index module (config-only mode, no backend required) - Thread code_index through PostStateInputs → complete_startup → GatewayState - Add code_index field to GatewayState (immutable, after memory_manager) - GatewayState::new() creates a default config-only index for tests - Gateway compiles cleanly with zero new warnings
- Add trailing newline to init_code_index.rs - Align moltis-code-index entry with neighbours in Cargo.toml - Add init_code_index to module doc comment in server/mod.rs
- init_code_index now async: checks QMD availability at startup - Full mode (CodeIndex::new) when QMD binary is present and reachable - Falls back to config-only mode with warn log if QMD is absent - Falls back gracefully when compiled without qmd feature - Gateway qmd feature now also enables moltis-code-index/qmd - Empty collections at init; per-project registration deferred to index_project()
…egistration - Extract single-collection registration from ensure_collections loop into public ensure_collection() method on QmdManager - ensure_collections() now delegates to ensure_collection() per entry - CodeIndex::index_project() builds project-specific QmdCollection via backend_qmd::project_collection_config and registers it idempotently before refresh_index
- Clones Arc<CodeIndex> before move into GatewayState - Registers codebase_search, codebase_peek, codebase_status tools when the qmd feature is enabled - Tools are available to the LLM agent for indexed workspaces
Implements the builtin backend for code-index using SQLite + FTS5: New files: - store.rs: CodeIndexStore trait, RRF merge, cosine similarity, quantization - store_sqlite.rs: SQLite backend with FTS5 keyword search + i8 embeddings - chunker.rs: Line-based code chunker with overlap and byte-size splitting Changes: - index.rs: Backend enum dispatch (Qmd/Builtin/ConfigOnly), builtin index and search pipelines with hybrid vector+keyword RRF - lib.rs: Wire new modules, restore file-watcher gate - types.rs: Rename CodeChunk → DiscoveredChunk to avoid name collision with store::CodeChunk - error.rs: Add IndexStore variant - tools.rs: Adapt to new Backend enum Review fixes applied: - P0: Add crash-safety TODO for clear-then-reindex; fix init/clear ordering - P0: Rename types::CodeChunk → DiscoveredChunk (name collision) - P1: Document quantization tradeoff (space vs recall on non-normalized) - P1: Add PERF note on brute-force vector search memory usage - P1: Log warning on embedding failure instead of silent fallback - P1: Restore #[cfg(feature = \"file-watcher\")] module gate - P1: Use actual embedder.model_name() instead of \"builtin\" stub - P2: Fix negative keyword scores → 1.0/(1.0+rank) for valid (0,1] range - P2: Remove misleading project_dir.join(file.path), use file.path directly - P2: Replace all unwrap()/expect() with safe alternatives (clippy clean) - P2: Redundant closure fix Clippy: 0 errors, 0 warnings (builtin + qmd + file-watcher features) Tests: 39/39 passing
- Add snapshot-based delta indexing: only reindex changed files - Add list_indexable_files() for previewing what gets indexed - Add filter() and chunker() config methods - Rewrite watcher.rs for notify-debouncer-full 0.7.0 API - Fix ConfigOnly backend to return BackendUnavailable (tool compatibility) - Add SearchFailed error variant, Language::from_path(), SearchResult text/source fields - Add FilterConfig struct - Fix IndexStatus construction (backend, last_sync_ms fields) - Fix builtin chunk handling (file_path/content fields) - Fix QMD backend calls (hybrid_search, status, ensure_collections) - Remove quantize/dequantize from vector search (raw f32 cosine) - Fix snapshot_store calls (synchronous API) - Fix test helpers (temp data dirs to avoid path collisions) 74 tests passing, clippy clean, index.rs under 1,500 lines (1,048)
Clippy and test fixes across the code-index crate: - Gate Arc import on cfg(feature = "file-watcher") - Gate tracing imports on cfg(any(feature = "builtin", feature = "file-watcher")) - Add allow(clippy::large_enum_variant) on Backend enum - Add cfg suppressors for unused vars in no-feature builds - Fix &content to content in chunker utf-8 boundary test - Make crate::Error import unconditional in tools.rs Test allow attributes: - Add allow(clippy::unwrap_used, clippy::expect_used) to all test modules Gateway wiring: - Use struct literal with Default::default() for CodeIndexConfig init - Add code_index param to all GatewayState::with_options() test call sites - Add moltis-code-index as dev-dependency of moltis-httpd Default feature fix: - Change default from ["qmd"] to ["builtin"] - Add dep:moltis-agents to builtin feature Format: cargo +nightly fmt --all
Greptile SummaryThis PR adds a new
Confidence Score: 2/5Not safe to merge — the default build ships a completely non-functional code-index feature due to missing builtin wiring in both init and tool registration. Two P1 bugs in init_code_index.rs and post_state.rs mean the entire feature is dead on the default build. The third P1 in reindex_files corrupts incremental-reindex state whenever the file-watcher feature is used. crates/gateway/src/server/init_code_index.rs and crates/gateway/src/server/prepare_core/post_state.rs (missing builtin branches); crates/code-index/src/index.rs (reindex_files absolute-path bug).
|
| Filename | Overview |
|---|---|
| crates/gateway/src/server/init_code_index.rs | Only initializes a real backend under #[cfg(feature = "qmd")]; the default builtin feature always falls back to config_only, making the builtin SQLite+FTS5 backend dead on all default builds. |
| crates/gateway/src/server/prepare_core/post_state.rs | Tool registration block gated on #[cfg(feature = "qmd")] only — with the default builtin feature the three code-index tools are never added to the agent registry. |
| crates/code-index/src/index.rs | Core orchestrator is well-structured; reindex_files incorrectly sets relative_path to the absolute path provided by the watcher, breaking incremental index consistency. |
| crates/code-index/src/store_sqlite.rs | Well-implemented SQLite+FTS5 backend with idempotent schema creation, atomic upsert via delete+insert transaction, and trigger-based FTS sync. |
| crates/code-index/src/snapshot_store.rs | Atomic write (temp+rename) with robust path-traversal sanitization; well-tested. |
| crates/code-index/src/tools.rs | Tool implementations are correct; CodebaseSearchTool docstring says Requires a QMD backend but it works with builtin too. |
| crates/code-index/Cargo.toml | Feature flags are structurally correct, but tracing and metrics feature gates required by CLAUDE.md are absent. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Gateway startup] --> B[init_code_index]
B --> C{feature = qmd?}
C -- yes --> D[QmdManager::is_available?]
D -- yes --> E[CodeIndex::new QMD backend]
D -- no --> F[config_only]
C -- no --> G[log only]
G --> F
F --> H[GatewayState.code_index]
E --> H
H --> I[post_state: register tools]
I --> J{feature = qmd?}
J -- yes --> K[register_tools]
J -- no --> L[tools NOT registered]
Reviews (1): Last reviewed commit: "fix(code-index): address clippy, test co..." | Re-trigger Greptile
| pub(crate) async fn init_code_index( | ||
| data_dir: &std::path::Path, | ||
| ) -> Arc<moltis_code_index::CodeIndex> { | ||
| let code_index_config = moltis_code_index::CodeIndexConfig { | ||
| data_dir: Some(data_dir.join("code-index")), | ||
| ..Default::default() | ||
| }; | ||
|
|
||
| #[cfg(feature = "qmd")] | ||
| { | ||
| let qmd_config = moltis_qmd::QmdManagerConfig { | ||
| command: "qmd".into(), | ||
| collections: std::collections::HashMap::new(), | ||
| max_results: 20, | ||
| timeout_ms: 30_000, | ||
| work_dir: data_dir.to_path_buf(), | ||
| index_name: format!("code-{}", super::helpers::sanitize_qmd_index_name(data_dir)), | ||
| env_overrides: std::collections::HashMap::new(), | ||
| }; | ||
| let qmd = moltis_qmd::QmdManager::new(qmd_config); | ||
|
|
||
| if qmd.is_available().await { | ||
| info!( | ||
| index = %qmd.index_name(), | ||
| "code-index: QMD backend available, initializing in full mode" | ||
| ); | ||
| return Arc::new(moltis_code_index::CodeIndex::new(code_index_config, qmd)); | ||
| } | ||
|
|
||
| warn!( | ||
| "code-index: QMD binary not found, falling back to config-only mode \ | ||
| (search unavailable until QMD is installed)" | ||
| ); | ||
| } | ||
|
|
||
| #[cfg(not(feature = "qmd"))] | ||
| { | ||
| info!( | ||
| "code-index: initialized in config-only mode \ | ||
| (qmd feature disabled — search unavailable)" | ||
| ); | ||
| } | ||
|
|
||
| Arc::new(moltis_code_index::CodeIndex::config_only(code_index_config)) | ||
| } |
There was a problem hiding this comment.
Builtin backend never initialized
init_code_index only creates a real backend inside the #[cfg(feature = "qmd")] block. When the builtin feature is enabled (which is the default) but qmd is not, the function falls through to config_only mode. The SqliteCodeIndexStore is never constructed, so every index_project and search call returns BackendUnavailable, making the entire builtin feature effectively inoperative on the default build.
A #[cfg(feature = "builtin")] path is needed to construct SqliteCodeIndexStore and call CodeIndex::new_builtin.
| // ── Code index tools ───────────────────────────────────────────── | ||
| #[cfg(feature = "qmd")] | ||
| { | ||
| moltis_code_index::tools::register_tools(&mut tool_registry, code_index_for_tools); | ||
| } |
There was a problem hiding this comment.
Code-index tools never registered on default (
builtin) build
Tool registration is wrapped in #[cfg(feature = "qmd")], but lib.rs makes tools available under any(feature = "qmd", feature = "builtin"). On the default build (only builtin enabled) the register_tools call is never reached, so codebase_search, codebase_peek, and codebase_status are absent from the agent tool registry. The code_index_for_tools variable on line 337 also only exists inside #[cfg(feature = "qmd")].
Both the variable declaration and the register_tools call need to be widened to #[cfg(any(feature = "qmd", feature = "builtin"))].
| // Convert absolute paths to FilteredFile entries. | ||
| let files: Vec<FilteredFile> = paths | ||
| .iter() | ||
| .filter_map(|p| { | ||
| if p.is_file() { | ||
| Some(FilteredFile { | ||
| path: p.clone(), | ||
| relative_path: p.clone(), | ||
| size: p.metadata().map(|m| m.len()).unwrap_or(0), | ||
| language: crate::types::Language::from_path(p), | ||
| }) | ||
| } else { | ||
| None | ||
| } | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
reindex_files stores absolute paths as relative_path
The watcher passes absolute PathBufs, and relative_path is set to the same absolute value as path. The comment in index_files_builtin is explicit that relative_path is the repo-relative path used for storage. Storing absolute paths here produces index entries whose file_path key won't match the relative-path keys written by the initial full index, leaving orphaned chunks and breaking incremental search for watcher-triggered updates.
reindex_files needs a project_dir: &Path parameter so it can strip the prefix before constructing FilteredFile.
| }; | ||
| let index = CodeIndex::new_builtin(config, Box::new(store), None); | ||
| // Keep data_dir alive for the test's lifetime. | ||
| std::mem::forget(data_dir); | ||
| (index, repo) |
There was a problem hiding this comment.
std::mem::forget leaks temp directories in tests
std::mem::forget(data_dir) prevents TempDir::drop from running, so the temporary directory is never cleaned up. The same pattern appears in setup_index_with_embedder and setup_index_with_failing_embedder. The correct fix is to return the TempDir from the setup functions so the caller holds it alive for the test's lifetime.
| [features] | ||
| default = ["builtin"] | ||
| qmd = ["dep:moltis-qmd", "dep:moltis-agents"] | ||
| builtin = ["dep:moltis-memory", "dep:moltis-agents", "dep:sqlx", "dep:tokio"] | ||
| file-watcher = ["dep:notify-debouncer-full", "dep:tokio", "dep:tokio-util"] |
|
@penso I broke this into 4 pr for easier reviewing, but if you accept them it will have to be in order. I find greptile not so great at reviewing big diffs too. Right now it uses 50 line chunking,but I plan to add AST aware chunking strategies as well. This PR opens the door to supporting document indexing and retrieval generally in a future release. |
|
Superseded by #756 which contains the complete code-index feature with all review fixes applied. |
This PR has been split into a 4-PR stack for easier review:
Please review the stack above instead. This PR will be closed once the stack is merged.