feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup#756
feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup#756Cstewart-HC wants to merge 21 commits intomoltis-org:mainfrom
Conversation
P1 scope (complete): - types.rs: Language, FileEntry, FilteredFile, CodeChunk, IndexStatus, SearchResult - config.rs: CodeIndexConfig with extension allowlist, skip paths, size limits - discover.rs: git-tracked file enumeration via gix - filter.rs: extension filter, binary detection, size limit, path exclusion - error.rs: Error enum with GitRepoNotFound, GitOperation, Io, etc - backend_qmd.rs: QMD collection config builder (feature-gated) - lib.rs: module declarations, feature-gated backend_qmd Tests: 11 passing, clippy: clean
- Remove Language::Bash variant (merges into Shell for round-trip consistency) - Remove dead GitOperation error variant (YAGNI, add back in P2) - Move unused deps out of [dependencies] (anyhow, async-trait, regex, etc.) into commented P2 section; add sqlx as optional dep for pgvector feature - Add serde_json as dev-dependency (used in test only) - Add trailing newlines to all source files (POSIX compliance) - All 11 tests passing, clippy clean
- Add index.rs: CodeIndex struct with config_only() and new() constructors, list_indexable_files(), index_project(), search(), keyword_search(), status() - Add search.rs: QMD result adapter (from_qmd, from_qmd_results) - Update lib.rs: export search module (feature-gated), index module, CodeIndex re-export - Clean up backend_qmd.rs: remove unused import annotations from P1 - Fix API mismatches: QMD search methods don't take collection param; QmdSearchResult.line is i64, score is f32 - Add tokio dev-dependency for async tests - All 14 tests passing, clippy clean
- Add tools.rs with CodebaseSearchTool, CodebasePeekTool, CodebaseStatusTool implementing AgentTool trait from moltis-agents - Fix deferred issue #1: replace SystemTime epoch math with time crate - Fix deferred issue #4: add enable_embeddings parameter to index_project() - Add register_tools() helper for gateway wiring - Add anyhow, async-trait, serde_json, time dependencies - Feature-gate tools module behind qmd (requires CodeIndex with backend) - 6 new tests: peek, search (backend required), status, parameter schemas
…indexing Adds two new modules: - watcher.rs: CodeIndexWatcher using notify_debouncer_full (same pattern as moltis-skills and moltis-openclaw-import). Watches a project directory for file create/modify/delete events, filters through CodeIndexConfig's extension allowlist and path exclusions, emits CodeWatchEvent::Changed and CodeWatchEvent::Removed via tokio mpsc. Feature-gated behind 'file-watcher'. - delta.rs: compute_delta() and build_initial_snapshot() for incremental reindexing. Computes SyncDelta (added/removed/modified file sets) by comparing current git-tracked filtered files against a previous HashSnapshot (HashMap<String, String> of relative_path → sha256). Uses content_hash from filter.rs for change detection. Feature flag 'file-watcher' added to Cargo.toml, depends on notify-debouncer-full (workspace dep) and tokio sync feature. All 32 tests passing, clippy clean with --all-features.
Review fixes applied: - #1: Extract require_str/opt_usize_or to moltis_tools::params — replaced local helpers with params::require_str() and params::u64_param() from the shared workspace crate - #2: Unified error model — CodebaseSearchTool now returns Ok(json!({error:..., search_available: false})) for BackendUnavailable, matching Peek/Status pattern - #3: u64→usize truncating cast replaced with usize::try_from().unwrap_or() - #4: ensure_collections() error remapped from BackendUnavailable to IndexFailed { project_id, message } - moltis-org#6: result.line as usize now clamped with .max(1) minimum - moltis-org#7: compute_delta carries forward previous hash on hash errors so files aren't spuriously marked as removed - moltis-org#8: Added doc comment noting that watcher batches may contain duplicate paths - moltis-org#9: Extracted effective_extension() from filter.rs, removed duplication between filter.rs and watcher.rs - moltis-org#11: Added tracing::debug! for skipped files in build_initial_snapshot - moltis-org#12: Added 'drop to stop' documentation on CodeIndexWatcher::start() 32 tests pass, clippy clean.
…l delta
Adds SnapshotStore — a file-backed, atomic-write store that persists
HashSnapshot per project to <data_dir>/code-index/<project_id>.json.
- SnapshotStore::new(base_dir) / ::default_path() constructors
- load(project_id) -> Option<HashSnapshot> (None if first run)
- save(project_id, &HashSnapshot) — atomic write via .tmp + rename
- delete(project_id) — removes snapshot file
- sanitize_project_id() rejects path traversal (/ \ .. \0)
- CodeIndex now owns a SnapshotStore, wired via config.data_dir
- CodeIndex::{load_snapshot, save_snapshot} delegate to store
- index_project() saves snapshot after successful reindex
- Added Error::Store variant for snapshot I/O errors
- Added moltis-config workspace dep for data_dir() resolution
- 10 new unit tests, all 42 tests pass, clippy clean
- Add init_code_index module (config-only mode, no backend required) - Thread code_index through PostStateInputs → complete_startup → GatewayState - Add code_index field to GatewayState (immutable, after memory_manager) - GatewayState::new() creates a default config-only index for tests - Gateway compiles cleanly with zero new warnings
- Add trailing newline to init_code_index.rs - Align moltis-code-index entry with neighbours in Cargo.toml - Add init_code_index to module doc comment in server/mod.rs
- init_code_index now async: checks QMD availability at startup - Full mode (CodeIndex::new) when QMD binary is present and reachable - Falls back to config-only mode with warn log if QMD is absent - Falls back gracefully when compiled without qmd feature - Gateway qmd feature now also enables moltis-code-index/qmd - Empty collections at init; per-project registration deferred to index_project()
…egistration - Extract single-collection registration from ensure_collections loop into public ensure_collection() method on QmdManager - ensure_collections() now delegates to ensure_collection() per entry - CodeIndex::index_project() builds project-specific QmdCollection via backend_qmd::project_collection_config and registers it idempotently before refresh_index
- Clones Arc<CodeIndex> before move into GatewayState - Registers codebase_search, codebase_peek, codebase_status tools when the qmd feature is enabled - Tools are available to the LLM agent for indexed workspaces
Implements the builtin backend for code-index using SQLite + FTS5: New files: - store.rs: CodeIndexStore trait, RRF merge, cosine similarity, quantization - store_sqlite.rs: SQLite backend with FTS5 keyword search + i8 embeddings - chunker.rs: Line-based code chunker with overlap and byte-size splitting Changes: - index.rs: Backend enum dispatch (Qmd/Builtin/ConfigOnly), builtin index and search pipelines with hybrid vector+keyword RRF - lib.rs: Wire new modules, restore file-watcher gate - types.rs: Rename CodeChunk → DiscoveredChunk to avoid name collision with store::CodeChunk - error.rs: Add IndexStore variant - tools.rs: Adapt to new Backend enum Review fixes applied: - P0: Add crash-safety TODO for clear-then-reindex; fix init/clear ordering - P0: Rename types::CodeChunk → DiscoveredChunk (name collision) - P1: Document quantization tradeoff (space vs recall on non-normalized) - P1: Add PERF note on brute-force vector search memory usage - P1: Log warning on embedding failure instead of silent fallback - P1: Restore #[cfg(feature = \"file-watcher\")] module gate - P1: Use actual embedder.model_name() instead of \"builtin\" stub - P2: Fix negative keyword scores → 1.0/(1.0+rank) for valid (0,1] range - P2: Remove misleading project_dir.join(file.path), use file.path directly - P2: Replace all unwrap()/expect() with safe alternatives (clippy clean) - P2: Redundant closure fix Clippy: 0 errors, 0 warnings (builtin + qmd + file-watcher features) Tests: 39/39 passing
- Add snapshot-based delta indexing: only reindex changed files - Add list_indexable_files() for previewing what gets indexed - Add filter() and chunker() config methods - Rewrite watcher.rs for notify-debouncer-full 0.7.0 API - Fix ConfigOnly backend to return BackendUnavailable (tool compatibility) - Add SearchFailed error variant, Language::from_path(), SearchResult text/source fields - Add FilterConfig struct - Fix IndexStatus construction (backend, last_sync_ms fields) - Fix builtin chunk handling (file_path/content fields) - Fix QMD backend calls (hybrid_search, status, ensure_collections) - Remove quantize/dequantize from vector search (raw f32 cosine) - Fix snapshot_store calls (synchronous API) - Fix test helpers (temp data dirs to avoid path collisions) 74 tests passing, clippy clean, index.rs under 1,500 lines (1,048)
Clippy and test fixes across the code-index crate: - Gate Arc import on cfg(feature = "file-watcher") - Gate tracing imports on cfg(any(feature = "builtin", feature = "file-watcher")) - Add allow(clippy::large_enum_variant) on Backend enum - Add cfg suppressors for unused vars in no-feature builds - Fix &content to content in chunker utf-8 boundary test - Make crate::Error import unconditional in tools.rs Test allow attributes: - Add allow(clippy::unwrap_used, clippy::expect_used) to all test modules Gateway wiring: - Use struct literal with Default::default() for CodeIndexConfig init - Add code_index param to all GatewayState::with_options() test call sites - Add moltis-code-index as dev-dependency of moltis-httpd Default feature fix: - Change default from ["qmd"] to ["builtin"] - Add dep:moltis-agents to builtin feature Format: cargo +nightly fmt --all
Greptile SummaryThis PR adds the builtin SQLite+FTS5 backend for code-index, incremental delta reindexing, and file-watcher infrastructure, completing the 4-PR stack. The previous Greptile review concerns (debouncer lifetime, absolute path in watcher, gateway initialization, parent directory creation) are addressed in the fixup commits. The implementation is functionally correct for the search and indexing paths; remaining issues are around storage efficiency in the FTS5 schema and whether the file-watcher feature is wired through to a call site. Confidence Score: 4/5Safe to merge for search/indexing correctness; file-watcher auto-reindex is compiled in but never activated and the FTS5 schema doubles content storage. All P0/P1 issues from the prior review rounds (debouncer lifetime, absolute path in watcher, gateway init, parent directory creation) are fixed. Two new P2 findings: the FTS5 migration uses content_rowid without content= making it a no-op that doubles storage, and start_watcher is never called from any gateway code path so the file-watcher feature ships as dead code. Neither blocks correctness of the search or index paths, but together they represent meaningful incomplete wiring of an advertised feature. crates/code-index/migrations/20260416200000_code_index_init.sql (FTS5 double-storage), crates/code-index/src/index.rs (start_watcher not wired to a call site)
|
| Filename | Overview |
|---|---|
| crates/code-index/src/store_sqlite.rs | SQLite+FTS5 implementation with correct transaction handling, quantized embedding roundtrip, multi-project isolation, and comprehensive tests; no new issues. |
| crates/code-index/migrations/20260416200000_code_index_init.sql | FTS5 table uses content_rowid=rowid without content=, making it a no-op and causing content to be stored twice; triggers and indexes are otherwise correct. |
| crates/code-index/src/index.rs | Full/incremental index orchestration, hybrid search, and watcher management are correctly implemented; start_watcher is never called from any gateway code path so file-watcher auto-reindex is currently inert. |
| crates/code-index/src/watcher.rs | Debouncer now stored in struct (lifetime fixed); Remove events correctly bypass is_file() check via is_indexable_by_extension; WatchHandler still cannot distinguish remove from modify events, so chunk deletion for removed files depends on start_watcher's handler being updated. |
| crates/code-index/src/chunker.rs | Line-based chunker with overlap, byte-size splitting, and final index renumbering; all edge cases tested; single-long-line behavior is documented. |
| crates/code-index/src/delta.rs | Delta computation correctly identifies added/modified/removed files via content hash comparison; snapshot helpers are clean and well-tested. |
| crates/code-index/src/snapshot_store.rs | Atomic file write with PID-suffixed temp file and rename; path traversal sanitization is thorough; round-trip tests cover all edge cases. |
| crates/gateway/src/server/init_code_index.rs | Correctly initializes builtin SQLite backend under the code-index-builtin feature, parent directory now created in SqliteCodeIndexStore::new, falls back to config-only gracefully. |
| crates/code-index/src/store.rs | Clean trait definition, correct cosine similarity, quantize/dequantize roundtrip, and RRF hybrid merge; all paths tested. |
| crates/gateway/Cargo.toml | code-index-builtin and file-watcher features correctly threaded through; both are in the default set. |
Sequence Diagram
sequenceDiagram
participant GW as Gateway (init_code_index)
participant CI as CodeIndex
participant SS as SqliteCodeIndexStore
participant FTS as SQLite FTS5
participant SN as SnapshotStore
GW->>SS: new(db_path) — create_dir_all + connect
SS->>FTS: run_migrations (triggers: insert/delete)
GW->>CI: new_builtin(config, store, embedder?)
Note over CI: index_project(project_id, force, project_dir)
CI->>SN: load(project_id)
alt No previous snapshot — full index
CI->>SS: clear_project(project_id)
CI->>SS: upsert_chunks (DELETE + INSERT per file)
SS->>FTS: triggers keep FTS5 in sync
CI->>SN: save(project_id, snapshot)
else Snapshot exists — incremental
CI->>CI: compute_delta(prev_snapshot)
CI->>SS: upsert_chunks for added+modified
CI->>SS: delete_file_chunks for removed
CI->>SN: save(project_id, updated_snapshot)
end
Note over CI: search(project_id, query, limit)
CI->>SS: search_keyword via FTS5 MATCH
opt embedder available
CI->>SS: get_project_chunks (all, brute-force)
CI->>CI: cosine_similarity + RRF merge
end
CI-->>GW: Vec SearchResult
Reviews (5): Last reviewed commit: "fix(code-index): SQLite parent dir creat..." | Re-trigger Greptile
Phase 4 fixes: - Issue A: Add builtin SQLite backend initialization in init_code_index.rs - Issue B: Widen tool registration to code-index-builtin when qmd is absent - Issue C: Fix reindex_files absolute path bug with strip_prefix - Issue G: Store debouncer in FileWatcher struct to prevent drop - Issue H: Return TempDir from test helpers instead of mem::forget - Issue N: Replace git init subprocess with gix::init in tests
|
Contains all fixes from pr 753 754 and 755 plus remaining issues addressed . @greptileai review |
|
@greptileai review |
- P1 gateway/Cargo.toml: add moltis-code-index/file-watcher to the gateway file-watcher feature so the auto-reindex watcher is compiled into gateway builds (was silently dead in all configurations) - P2 store_sqlite.rs: replace inline DDL with sqlx::migrate! using a migrations/ directory (20260416200000_code_index_init.sql) and a public run_migrations() function, matching the project convention
|
@greptileai review |
- project_collection_config → project_collections returning Vec per extension (QMD --mask accepts one glob per collection, not comma-joined) - qmd_config_for_project now takes explicit work_dir parameter - Updated tests to match new API Addresses Greptile review feedback for PR 753.
… watcher - store_sqlite: create parent directory before SQLite connect (create_if_missing only creates the file, not directories) - watcher: skip is_file() check for Remove events since the file is already gone; add is_indexable_by_extension for extension-only validation on deleted paths Addresses Greptile P1 review feedback for PR 756.
|
Cascading fixes through the pr chain, hope this eases review @greptileai review |
…able, snapshot migration - Gate all tracing imports/calls behind #[cfg(feature = "tracing")] - Use cfg_attr for #[instrument] on async methods - Add FileMeta struct with mtime+size for cheap delta skip (fixes incremental being slower than first-run) - FTS5 external content table avoids storing content twice - Add FTS5 update trigger for chunk content changes - Snapshot store: legacy format detection, tempfile for atomic writes - path_skipped: segment matching for nested vendor/node_modules - Move tempfile from dev-dep to dep (used in snapshot_store) - Offload discover+filter to spawn_blocking for builtin/file-watcher - Re-export FileMeta and HashSnapshot from lib.rs 70 tests passing, clippy clean.
|
@greptileai review |
|
Superseded by #771 (single consolidated PR). |
Summary
Final PR in a 4-PR stack. Add the builtin SQLite+FTS5 backend, incremental reindexing with file watcher, and all clippy/test cleanup. Includes all changes from PRs #753, #754, and #755.
What this adds
store.rs:CodeStoretrait for chunk storage and retrievalstore_sqlite.rs: SQLite implementation with FTS5 full-text search, embedding roundtrip, multi-project isolationchunker.rs: Language-aware code chunking with line-based splitting, UTF-8 boundary handling, max chunk size enforcementfile-watcherfeature flag)["qmd"]to["builtin"]; addeddep:moltis-agentsto builtin feature#[allow(...)]attributes on test modules, cfg-gated imports,large_enum_variantsuppression&content→contentin chunker test, unconditionalcrate::Errorimport in tools.rsFeature flags
builtinfile-watcherqmdTesting
-D warnings)Depends on
Stack
Breaking changes
None.
GatewayStategains a requiredcode_indexfield but all internal call sites are updated. Public API additions only.