Skip to content

feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup#756

Closed
Cstewart-HC wants to merge 21 commits intomoltis-org:mainfrom
Cstewart-HC:feat/code-index/4-builtin-backend
Closed

feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup#756
Cstewart-HC wants to merge 21 commits intomoltis-org:mainfrom
Cstewart-HC:feat/code-index/4-builtin-backend

Conversation

@Cstewart-HC
Copy link
Copy Markdown
Contributor

Summary

Final PR in a 4-PR stack. Add the builtin SQLite+FTS5 backend, incremental reindexing with file watcher, and all clippy/test cleanup. Includes all changes from PRs #753, #754, and #755.

What this adds

  • Builtin backend: SQLite + FTS5 keyword search — no external dependencies required
    • store.rs: CodeStore trait for chunk storage and retrieval
    • store_sqlite.rs: SQLite implementation with FTS5 full-text search, embedding roundtrip, multi-project isolation
    • chunker.rs: Language-aware code chunking with line-based splitting, UTF-8 boundary handling, max chunk size enforcement
  • Incremental reindexing: Delta-based reindexing using file content hashes — only re-index changed files
  • File watcher: Notify-based auto-reindex on file changes (behind file-watcher feature flag)
  • Default feature fix: Changed default from ["qmd"] to ["builtin"]; added dep:moltis-agents to builtin feature
  • Clippy cleanup: All #[allow(...)] attributes on test modules, cfg-gated imports, large_enum_variant suppression
  • Test fixes: Fixed &contentcontent in chunker test, unconditional crate::Error import in tools.rs

Feature flags

Feature Default Description
builtin SQLite+FTS5 keyword search
file-watcher File system watcher for auto-reindex
qmd QMD embedding-based backend (additive on builtin)

Testing

  • 70 tests pass (code-index)
  • 353 + 14 tests pass (gateway)
  • 57 tests pass (httpd auth_middleware)
  • Clippy clean (nightly, -D warnings)
  • Format clean (nightly)

Depends on

Stack

  1. feat(code-index): add crate scaffold, config, and file discovery #753 — crate scaffold, config, filter, discover
  2. feat(code-index): add orchestrator, search, tools, delta sync #754 — orchestrator, search, tools, delta sync
  3. feat(gateway): wire code-index into gateway, QMD backend, and tool registry #755 — gateway wiring, QMD backend, tool registry
  4. feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup #756 ← this PR — builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup

Breaking changes

None. GatewayState gains a required code_index field but all internal call sites are updated. Public API additions only.

P1 scope (complete):
- types.rs: Language, FileEntry, FilteredFile, CodeChunk, IndexStatus, SearchResult
- config.rs: CodeIndexConfig with extension allowlist, skip paths, size limits
- discover.rs: git-tracked file enumeration via gix
- filter.rs: extension filter, binary detection, size limit, path exclusion
- error.rs: Error enum with GitRepoNotFound, GitOperation, Io, etc
- backend_qmd.rs: QMD collection config builder (feature-gated)
- lib.rs: module declarations, feature-gated backend_qmd

Tests: 11 passing, clippy: clean
- Remove Language::Bash variant (merges into Shell for round-trip consistency)
- Remove dead GitOperation error variant (YAGNI, add back in P2)
- Move unused deps out of [dependencies] (anyhow, async-trait, regex, etc.)
  into commented P2 section; add sqlx as optional dep for pgvector feature
- Add serde_json as dev-dependency (used in test only)
- Add trailing newlines to all source files (POSIX compliance)
- All 11 tests passing, clippy clean
- Add index.rs: CodeIndex struct with config_only() and new() constructors,
  list_indexable_files(), index_project(), search(), keyword_search(), status()
- Add search.rs: QMD result adapter (from_qmd, from_qmd_results)
- Update lib.rs: export search module (feature-gated), index module, CodeIndex re-export
- Clean up backend_qmd.rs: remove unused import annotations from P1
- Fix API mismatches: QMD search methods don't take collection param;
  QmdSearchResult.line is i64, score is f32
- Add tokio dev-dependency for async tests
- All 14 tests passing, clippy clean
- Add tools.rs with CodebaseSearchTool, CodebasePeekTool, CodebaseStatusTool
  implementing AgentTool trait from moltis-agents
- Fix deferred issue #1: replace SystemTime epoch math with time crate
- Fix deferred issue #4: add enable_embeddings parameter to index_project()
- Add register_tools() helper for gateway wiring
- Add anyhow, async-trait, serde_json, time dependencies
- Feature-gate tools module behind qmd (requires CodeIndex with backend)
- 6 new tests: peek, search (backend required), status, parameter schemas
…indexing

Adds two new modules:

- watcher.rs: CodeIndexWatcher using notify_debouncer_full (same pattern
  as moltis-skills and moltis-openclaw-import). Watches a project directory
  for file create/modify/delete events, filters through CodeIndexConfig's
  extension allowlist and path exclusions, emits CodeWatchEvent::Changed
  and CodeWatchEvent::Removed via tokio mpsc. Feature-gated behind
  'file-watcher'.

- delta.rs: compute_delta() and build_initial_snapshot() for incremental
  reindexing. Computes SyncDelta (added/removed/modified file sets) by
  comparing current git-tracked filtered files against a previous
  HashSnapshot (HashMap<String, String> of relative_path → sha256).
  Uses content_hash from filter.rs for change detection.

Feature flag 'file-watcher' added to Cargo.toml, depends on
notify-debouncer-full (workspace dep) and tokio sync feature.

All 32 tests passing, clippy clean with --all-features.
Review fixes applied:
- #1: Extract require_str/opt_usize_or to moltis_tools::params — replaced
  local helpers with params::require_str() and params::u64_param() from
  the shared workspace crate
- #2: Unified error model — CodebaseSearchTool now returns
  Ok(json!({error:..., search_available: false})) for
  BackendUnavailable, matching Peek/Status pattern
- #3: u64→usize truncating cast replaced with usize::try_from().unwrap_or()
- #4: ensure_collections() error remapped from BackendUnavailable to
  IndexFailed { project_id, message }
- moltis-org#6: result.line as usize now clamped with .max(1) minimum
- moltis-org#7: compute_delta carries forward previous hash on hash errors so
  files aren't spuriously marked as removed
- moltis-org#8: Added doc comment noting that watcher batches may contain
  duplicate paths
- moltis-org#9: Extracted effective_extension() from filter.rs, removed
  duplication between filter.rs and watcher.rs
- moltis-org#11: Added tracing::debug! for skipped files in build_initial_snapshot
- moltis-org#12: Added 'drop to stop' documentation on CodeIndexWatcher::start()

32 tests pass, clippy clean.
…l delta

Adds SnapshotStore — a file-backed, atomic-write store that persists
HashSnapshot per project to <data_dir>/code-index/<project_id>.json.

- SnapshotStore::new(base_dir) / ::default_path() constructors
- load(project_id) -> Option<HashSnapshot> (None if first run)
- save(project_id, &HashSnapshot) — atomic write via .tmp + rename
- delete(project_id) — removes snapshot file
- sanitize_project_id() rejects path traversal (/ \ .. \0)
- CodeIndex now owns a SnapshotStore, wired via config.data_dir
- CodeIndex::{load_snapshot, save_snapshot} delegate to store
- index_project() saves snapshot after successful reindex
- Added Error::Store variant for snapshot I/O errors
- Added moltis-config workspace dep for data_dir() resolution
- 10 new unit tests, all 42 tests pass, clippy clean
- Add init_code_index module (config-only mode, no backend required)
- Thread code_index through PostStateInputs → complete_startup → GatewayState
- Add code_index field to GatewayState (immutable, after memory_manager)
- GatewayState::new() creates a default config-only index for tests
- Gateway compiles cleanly with zero new warnings
- Add trailing newline to init_code_index.rs
- Align moltis-code-index entry with neighbours in Cargo.toml
- Add init_code_index to module doc comment in server/mod.rs
- init_code_index now async: checks QMD availability at startup
- Full mode (CodeIndex::new) when QMD binary is present and reachable
- Falls back to config-only mode with warn log if QMD is absent
- Falls back gracefully when compiled without qmd feature
- Gateway qmd feature now also enables moltis-code-index/qmd
- Empty collections at init; per-project registration deferred to index_project()
…egistration

- Extract single-collection registration from ensure_collections loop
  into public ensure_collection() method on QmdManager
- ensure_collections() now delegates to ensure_collection() per entry
- CodeIndex::index_project() builds project-specific QmdCollection via
  backend_qmd::project_collection_config and registers it idempotently
  before refresh_index
- Clones Arc<CodeIndex> before move into GatewayState
- Registers codebase_search, codebase_peek, codebase_status tools
  when the qmd feature is enabled
- Tools are available to the LLM agent for indexed workspaces
Implements the builtin backend for code-index using SQLite + FTS5:

New files:
- store.rs: CodeIndexStore trait, RRF merge, cosine similarity, quantization
- store_sqlite.rs: SQLite backend with FTS5 keyword search + i8 embeddings
- chunker.rs: Line-based code chunker with overlap and byte-size splitting

Changes:
- index.rs: Backend enum dispatch (Qmd/Builtin/ConfigOnly), builtin index
  and search pipelines with hybrid vector+keyword RRF
- lib.rs: Wire new modules, restore file-watcher gate
- types.rs: Rename CodeChunk → DiscoveredChunk to avoid name collision
  with store::CodeChunk
- error.rs: Add IndexStore variant
- tools.rs: Adapt to new Backend enum

Review fixes applied:
- P0: Add crash-safety TODO for clear-then-reindex; fix init/clear ordering
- P0: Rename types::CodeChunk → DiscoveredChunk (name collision)
- P1: Document quantization tradeoff (space vs recall on non-normalized)
- P1: Add PERF note on brute-force vector search memory usage
- P1: Log warning on embedding failure instead of silent fallback
- P1: Restore #[cfg(feature = \"file-watcher\")] module gate
- P1: Use actual embedder.model_name() instead of \"builtin\" stub
- P2: Fix negative keyword scores → 1.0/(1.0+rank) for valid (0,1] range
- P2: Remove misleading project_dir.join(file.path), use file.path directly
- P2: Replace all unwrap()/expect() with safe alternatives (clippy clean)
- P2: Redundant closure fix

Clippy: 0 errors, 0 warnings (builtin + qmd + file-watcher features)
Tests: 39/39 passing
- Add snapshot-based delta indexing: only reindex changed files
- Add list_indexable_files() for previewing what gets indexed
- Add filter() and chunker() config methods
- Rewrite watcher.rs for notify-debouncer-full 0.7.0 API
- Fix ConfigOnly backend to return BackendUnavailable (tool compatibility)
- Add SearchFailed error variant, Language::from_path(), SearchResult text/source fields
- Add FilterConfig struct
- Fix IndexStatus construction (backend, last_sync_ms fields)
- Fix builtin chunk handling (file_path/content fields)
- Fix QMD backend calls (hybrid_search, status, ensure_collections)
- Remove quantize/dequantize from vector search (raw f32 cosine)
- Fix snapshot_store calls (synchronous API)
- Fix test helpers (temp data dirs to avoid path collisions)

74 tests passing, clippy clean, index.rs under 1,500 lines (1,048)
Clippy and test fixes across the code-index crate:
- Gate Arc import on cfg(feature = "file-watcher")
- Gate tracing imports on cfg(any(feature = "builtin", feature = "file-watcher"))
- Add allow(clippy::large_enum_variant) on Backend enum
- Add cfg suppressors for unused vars in no-feature builds
- Fix &content to content in chunker utf-8 boundary test
- Make crate::Error import unconditional in tools.rs

Test allow attributes:
- Add allow(clippy::unwrap_used, clippy::expect_used) to all test modules

Gateway wiring:
- Use struct literal with Default::default() for CodeIndexConfig init
- Add code_index param to all GatewayState::with_options() test call sites
- Add moltis-code-index as dev-dependency of moltis-httpd

Default feature fix:
- Change default from ["qmd"] to ["builtin"]
- Add dep:moltis-agents to builtin feature

Format: cargo +nightly fmt --all
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

This PR adds the builtin SQLite+FTS5 backend for code-index, incremental delta reindexing, and file-watcher infrastructure, completing the 4-PR stack. The previous Greptile review concerns (debouncer lifetime, absolute path in watcher, gateway initialization, parent directory creation) are addressed in the fixup commits. The implementation is functionally correct for the search and indexing paths; remaining issues are around storage efficiency in the FTS5 schema and whether the file-watcher feature is wired through to a call site.

Confidence Score: 4/5

Safe to merge for search/indexing correctness; file-watcher auto-reindex is compiled in but never activated and the FTS5 schema doubles content storage.

All P0/P1 issues from the prior review rounds (debouncer lifetime, absolute path in watcher, gateway init, parent directory creation) are fixed. Two new P2 findings: the FTS5 migration uses content_rowid without content= making it a no-op that doubles storage, and start_watcher is never called from any gateway code path so the file-watcher feature ships as dead code. Neither blocks correctness of the search or index paths, but together they represent meaningful incomplete wiring of an advertised feature.

crates/code-index/migrations/20260416200000_code_index_init.sql (FTS5 double-storage), crates/code-index/src/index.rs (start_watcher not wired to a call site)

Important Files Changed

Filename Overview
crates/code-index/src/store_sqlite.rs SQLite+FTS5 implementation with correct transaction handling, quantized embedding roundtrip, multi-project isolation, and comprehensive tests; no new issues.
crates/code-index/migrations/20260416200000_code_index_init.sql FTS5 table uses content_rowid=rowid without content=, making it a no-op and causing content to be stored twice; triggers and indexes are otherwise correct.
crates/code-index/src/index.rs Full/incremental index orchestration, hybrid search, and watcher management are correctly implemented; start_watcher is never called from any gateway code path so file-watcher auto-reindex is currently inert.
crates/code-index/src/watcher.rs Debouncer now stored in struct (lifetime fixed); Remove events correctly bypass is_file() check via is_indexable_by_extension; WatchHandler still cannot distinguish remove from modify events, so chunk deletion for removed files depends on start_watcher's handler being updated.
crates/code-index/src/chunker.rs Line-based chunker with overlap, byte-size splitting, and final index renumbering; all edge cases tested; single-long-line behavior is documented.
crates/code-index/src/delta.rs Delta computation correctly identifies added/modified/removed files via content hash comparison; snapshot helpers are clean and well-tested.
crates/code-index/src/snapshot_store.rs Atomic file write with PID-suffixed temp file and rename; path traversal sanitization is thorough; round-trip tests cover all edge cases.
crates/gateway/src/server/init_code_index.rs Correctly initializes builtin SQLite backend under the code-index-builtin feature, parent directory now created in SqliteCodeIndexStore::new, falls back to config-only gracefully.
crates/code-index/src/store.rs Clean trait definition, correct cosine similarity, quantize/dequantize roundtrip, and RRF hybrid merge; all paths tested.
crates/gateway/Cargo.toml code-index-builtin and file-watcher features correctly threaded through; both are in the default set.

Sequence Diagram

sequenceDiagram
    participant GW as Gateway (init_code_index)
    participant CI as CodeIndex
    participant SS as SqliteCodeIndexStore
    participant FTS as SQLite FTS5
    participant SN as SnapshotStore

    GW->>SS: new(db_path) — create_dir_all + connect
    SS->>FTS: run_migrations (triggers: insert/delete)
    GW->>CI: new_builtin(config, store, embedder?)

    Note over CI: index_project(project_id, force, project_dir)
    CI->>SN: load(project_id)
    alt No previous snapshot — full index
        CI->>SS: clear_project(project_id)
        CI->>SS: upsert_chunks (DELETE + INSERT per file)
        SS->>FTS: triggers keep FTS5 in sync
        CI->>SN: save(project_id, snapshot)
    else Snapshot exists — incremental
        CI->>CI: compute_delta(prev_snapshot)
        CI->>SS: upsert_chunks for added+modified
        CI->>SS: delete_file_chunks for removed
        CI->>SN: save(project_id, updated_snapshot)
    end

    Note over CI: search(project_id, query, limit)
    CI->>SS: search_keyword via FTS5 MATCH
    opt embedder available
        CI->>SS: get_project_chunks (all, brute-force)
        CI->>CI: cosine_similarity + RRF merge
    end
    CI-->>GW: Vec SearchResult
Loading

Reviews (5): Last reviewed commit: "fix(code-index): SQLite parent dir creat..." | Re-trigger Greptile

Comment thread crates/code-index/src/watcher.rs
Comment thread crates/code-index/src/index.rs
Comment thread crates/gateway/src/server/init_code_index.rs
Comment thread crates/code-index/src/index.rs
Phase 4 fixes:
- Issue A: Add builtin SQLite backend initialization in init_code_index.rs
- Issue B: Widen tool registration to code-index-builtin when qmd is absent
- Issue C: Fix reindex_files absolute path bug with strip_prefix
- Issue G: Store debouncer in FileWatcher struct to prevent drop
- Issue H: Return TempDir from test helpers instead of mem::forget
- Issue N: Replace git init subprocess with gix::init in tests
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Contains all fixes from pr 753 754 and 755 plus remaining issues addressed .

@greptileai review

Comment thread crates/code-index/src/store_sqlite.rs
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

@greptileai review

- P1 gateway/Cargo.toml: add moltis-code-index/file-watcher to the gateway
  file-watcher feature so the auto-reindex watcher is compiled into gateway
  builds (was silently dead in all configurations)
- P2 store_sqlite.rs: replace inline DDL with sqlx::migrate! using a
  migrations/ directory (20260416200000_code_index_init.sql) and a
  public run_migrations() function, matching the project convention
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

@greptileai review

Comment thread crates/code-index/src/watcher.rs
- project_collection_config → project_collections returning Vec per extension
  (QMD --mask accepts one glob per collection, not comma-joined)
- qmd_config_for_project now takes explicit work_dir parameter
- Updated tests to match new API

Addresses Greptile review feedback for PR 753.
… watcher

- store_sqlite: create parent directory before SQLite connect
  (create_if_missing only creates the file, not directories)
- watcher: skip is_file() check for Remove events since the file
  is already gone; add is_indexable_by_extension for extension-only
  validation on deleted paths

Addresses Greptile P1 review feedback for PR 756.
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Cascading fixes through the pr chain, hope this eases review

@greptileai review

…able, snapshot migration

- Gate all tracing imports/calls behind #[cfg(feature = "tracing")]
- Use cfg_attr for #[instrument] on async methods
- Add FileMeta struct with mtime+size for cheap delta skip (fixes
  incremental being slower than first-run)
- FTS5 external content table avoids storing content twice
- Add FTS5 update trigger for chunk content changes
- Snapshot store: legacy format detection, tempfile for atomic writes
- path_skipped: segment matching for nested vendor/node_modules
- Move tempfile from dev-dep to dep (used in snapshot_store)
- Offload discover+filter to spawn_blocking for builtin/file-watcher
- Re-export FileMeta and HashSnapshot from lib.rs

70 tests passing, clippy clean.
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Cstewart-HC commented Apr 17, 2026

@greptileai review

@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Superseded by #771 (single consolidated PR).

@Cstewart-HC Cstewart-HC deleted the feat/code-index/4-builtin-backend branch April 17, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant