Skip to content

[Superseded] feat(code-index): add code indexing crate with builtin and QMD backends#752

Closed
Cstewart-HC wants to merge 15 commits intomoltis-org:mainfrom
Cstewart-HC:feat/code-index-crate
Closed

[Superseded] feat(code-index): add code indexing crate with builtin and QMD backends#752
Cstewart-HC wants to merge 15 commits intomoltis-org:mainfrom
Cstewart-HC:feat/code-index-crate

Conversation

@Cstewart-HC
Copy link
Copy Markdown
Contributor

@Cstewart-HC Cstewart-HC commented Apr 16, 2026

⚠️ Superseded

This PR has been split into a 4-PR stack for easier review:

  1. feat(code-index): add crate scaffold, config, and file discovery #753 — feat(code-index): add crate scaffold, config, and file discovery
  2. feat(code-index): add orchestrator, search, tools, delta sync #754 — feat(code-index): add orchestrator, search, tools, delta sync
  3. feat(gateway): wire code-index into gateway, QMD backend, and tool registry #755 — feat(gateway): wire code-index into gateway, QMD backend, and tool registry
  4. feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup #756 — feat(code-index): builtin SQLite+FTS5 backend, incremental reindexing, clippy cleanup

Please review the stack above instead. This PR will be closed once the stack is merged.

P1 scope (complete):
- types.rs: Language, FileEntry, FilteredFile, CodeChunk, IndexStatus, SearchResult
- config.rs: CodeIndexConfig with extension allowlist, skip paths, size limits
- discover.rs: git-tracked file enumeration via gix
- filter.rs: extension filter, binary detection, size limit, path exclusion
- error.rs: Error enum with GitRepoNotFound, GitOperation, Io, etc
- backend_qmd.rs: QMD collection config builder (feature-gated)
- lib.rs: module declarations, feature-gated backend_qmd

Tests: 11 passing, clippy: clean
- Remove Language::Bash variant (merges into Shell for round-trip consistency)
- Remove dead GitOperation error variant (YAGNI, add back in P2)
- Move unused deps out of [dependencies] (anyhow, async-trait, regex, etc.)
  into commented P2 section; add sqlx as optional dep for pgvector feature
- Add serde_json as dev-dependency (used in test only)
- Add trailing newlines to all source files (POSIX compliance)
- All 11 tests passing, clippy clean
- Add index.rs: CodeIndex struct with config_only() and new() constructors,
  list_indexable_files(), index_project(), search(), keyword_search(), status()
- Add search.rs: QMD result adapter (from_qmd, from_qmd_results)
- Update lib.rs: export search module (feature-gated), index module, CodeIndex re-export
- Clean up backend_qmd.rs: remove unused import annotations from P1
- Fix API mismatches: QMD search methods don't take collection param;
  QmdSearchResult.line is i64, score is f32
- Add tokio dev-dependency for async tests
- All 14 tests passing, clippy clean
- Add tools.rs with CodebaseSearchTool, CodebasePeekTool, CodebaseStatusTool
  implementing AgentTool trait from moltis-agents
- Fix deferred issue #1: replace SystemTime epoch math with time crate
- Fix deferred issue #4: add enable_embeddings parameter to index_project()
- Add register_tools() helper for gateway wiring
- Add anyhow, async-trait, serde_json, time dependencies
- Feature-gate tools module behind qmd (requires CodeIndex with backend)
- 6 new tests: peek, search (backend required), status, parameter schemas
…indexing

Adds two new modules:

- watcher.rs: CodeIndexWatcher using notify_debouncer_full (same pattern
  as moltis-skills and moltis-openclaw-import). Watches a project directory
  for file create/modify/delete events, filters through CodeIndexConfig's
  extension allowlist and path exclusions, emits CodeWatchEvent::Changed
  and CodeWatchEvent::Removed via tokio mpsc. Feature-gated behind
  'file-watcher'.

- delta.rs: compute_delta() and build_initial_snapshot() for incremental
  reindexing. Computes SyncDelta (added/removed/modified file sets) by
  comparing current git-tracked filtered files against a previous
  HashSnapshot (HashMap<String, String> of relative_path → sha256).
  Uses content_hash from filter.rs for change detection.

Feature flag 'file-watcher' added to Cargo.toml, depends on
notify-debouncer-full (workspace dep) and tokio sync feature.

All 32 tests passing, clippy clean with --all-features.
Review fixes applied:
- #1: Extract require_str/opt_usize_or to moltis_tools::params — replaced
  local helpers with params::require_str() and params::u64_param() from
  the shared workspace crate
- #2: Unified error model — CodebaseSearchTool now returns
  Ok(json!({error:..., search_available: false})) for
  BackendUnavailable, matching Peek/Status pattern
- #3: u64→usize truncating cast replaced with usize::try_from().unwrap_or()
- #4: ensure_collections() error remapped from BackendUnavailable to
  IndexFailed { project_id, message }
- moltis-org#6: result.line as usize now clamped with .max(1) minimum
- moltis-org#7: compute_delta carries forward previous hash on hash errors so
  files aren't spuriously marked as removed
- moltis-org#8: Added doc comment noting that watcher batches may contain
  duplicate paths
- moltis-org#9: Extracted effective_extension() from filter.rs, removed
  duplication between filter.rs and watcher.rs
- moltis-org#11: Added tracing::debug! for skipped files in build_initial_snapshot
- moltis-org#12: Added 'drop to stop' documentation on CodeIndexWatcher::start()

32 tests pass, clippy clean.
…l delta

Adds SnapshotStore — a file-backed, atomic-write store that persists
HashSnapshot per project to <data_dir>/code-index/<project_id>.json.

- SnapshotStore::new(base_dir) / ::default_path() constructors
- load(project_id) -> Option<HashSnapshot> (None if first run)
- save(project_id, &HashSnapshot) — atomic write via .tmp + rename
- delete(project_id) — removes snapshot file
- sanitize_project_id() rejects path traversal (/ \ .. \0)
- CodeIndex now owns a SnapshotStore, wired via config.data_dir
- CodeIndex::{load_snapshot, save_snapshot} delegate to store
- index_project() saves snapshot after successful reindex
- Added Error::Store variant for snapshot I/O errors
- Added moltis-config workspace dep for data_dir() resolution
- 10 new unit tests, all 42 tests pass, clippy clean
- Add init_code_index module (config-only mode, no backend required)
- Thread code_index through PostStateInputs → complete_startup → GatewayState
- Add code_index field to GatewayState (immutable, after memory_manager)
- GatewayState::new() creates a default config-only index for tests
- Gateway compiles cleanly with zero new warnings
- Add trailing newline to init_code_index.rs
- Align moltis-code-index entry with neighbours in Cargo.toml
- Add init_code_index to module doc comment in server/mod.rs
- init_code_index now async: checks QMD availability at startup
- Full mode (CodeIndex::new) when QMD binary is present and reachable
- Falls back to config-only mode with warn log if QMD is absent
- Falls back gracefully when compiled without qmd feature
- Gateway qmd feature now also enables moltis-code-index/qmd
- Empty collections at init; per-project registration deferred to index_project()
…egistration

- Extract single-collection registration from ensure_collections loop
  into public ensure_collection() method on QmdManager
- ensure_collections() now delegates to ensure_collection() per entry
- CodeIndex::index_project() builds project-specific QmdCollection via
  backend_qmd::project_collection_config and registers it idempotently
  before refresh_index
- Clones Arc<CodeIndex> before move into GatewayState
- Registers codebase_search, codebase_peek, codebase_status tools
  when the qmd feature is enabled
- Tools are available to the LLM agent for indexed workspaces
Implements the builtin backend for code-index using SQLite + FTS5:

New files:
- store.rs: CodeIndexStore trait, RRF merge, cosine similarity, quantization
- store_sqlite.rs: SQLite backend with FTS5 keyword search + i8 embeddings
- chunker.rs: Line-based code chunker with overlap and byte-size splitting

Changes:
- index.rs: Backend enum dispatch (Qmd/Builtin/ConfigOnly), builtin index
  and search pipelines with hybrid vector+keyword RRF
- lib.rs: Wire new modules, restore file-watcher gate
- types.rs: Rename CodeChunk → DiscoveredChunk to avoid name collision
  with store::CodeChunk
- error.rs: Add IndexStore variant
- tools.rs: Adapt to new Backend enum

Review fixes applied:
- P0: Add crash-safety TODO for clear-then-reindex; fix init/clear ordering
- P0: Rename types::CodeChunk → DiscoveredChunk (name collision)
- P1: Document quantization tradeoff (space vs recall on non-normalized)
- P1: Add PERF note on brute-force vector search memory usage
- P1: Log warning on embedding failure instead of silent fallback
- P1: Restore #[cfg(feature = \"file-watcher\")] module gate
- P1: Use actual embedder.model_name() instead of \"builtin\" stub
- P2: Fix negative keyword scores → 1.0/(1.0+rank) for valid (0,1] range
- P2: Remove misleading project_dir.join(file.path), use file.path directly
- P2: Replace all unwrap()/expect() with safe alternatives (clippy clean)
- P2: Redundant closure fix

Clippy: 0 errors, 0 warnings (builtin + qmd + file-watcher features)
Tests: 39/39 passing
- Add snapshot-based delta indexing: only reindex changed files
- Add list_indexable_files() for previewing what gets indexed
- Add filter() and chunker() config methods
- Rewrite watcher.rs for notify-debouncer-full 0.7.0 API
- Fix ConfigOnly backend to return BackendUnavailable (tool compatibility)
- Add SearchFailed error variant, Language::from_path(), SearchResult text/source fields
- Add FilterConfig struct
- Fix IndexStatus construction (backend, last_sync_ms fields)
- Fix builtin chunk handling (file_path/content fields)
- Fix QMD backend calls (hybrid_search, status, ensure_collections)
- Remove quantize/dequantize from vector search (raw f32 cosine)
- Fix snapshot_store calls (synchronous API)
- Fix test helpers (temp data dirs to avoid path collisions)

74 tests passing, clippy clean, index.rs under 1,500 lines (1,048)
Clippy and test fixes across the code-index crate:
- Gate Arc import on cfg(feature = "file-watcher")
- Gate tracing imports on cfg(any(feature = "builtin", feature = "file-watcher"))
- Add allow(clippy::large_enum_variant) on Backend enum
- Add cfg suppressors for unused vars in no-feature builds
- Fix &content to content in chunker utf-8 boundary test
- Make crate::Error import unconditional in tools.rs

Test allow attributes:
- Add allow(clippy::unwrap_used, clippy::expect_used) to all test modules

Gateway wiring:
- Use struct literal with Default::default() for CodeIndexConfig init
- Add code_index param to all GatewayState::with_options() test call sites
- Add moltis-code-index as dev-dependency of moltis-httpd

Default feature fix:
- Change default from ["qmd"] to ["builtin"]
- Add dep:moltis-agents to builtin feature

Format: cargo +nightly fmt --all
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

This PR adds a new moltis-code-index crate providing SQLite+FTS5 keyword search and optional QMD embedding-based search for workspace code intelligence. The crate itself is well-implemented — the chunker, snapshot store, SQLite store, RRF merge, and cosine similarity are solid with good test coverage — but the gateway wiring has two bugs that leave the default build entirely non-functional, plus an incremental-reindex correctness issue in the watcher path.

  • init_code_index.rs never initializes the builtin backend: the only branch that creates a real CodeIndex is gated on #[cfg(feature = \"qmd\")]; the default builtin feature always falls to config_only, so all index_project and search calls return BackendUnavailable on every default deployment.
  • Tools not registered on the default build: register_tools in post_state.rs is wrapped in #[cfg(feature = \"qmd\")], so codebase_search, codebase_peek, and codebase_status never appear in the agent tool registry when only builtin is enabled.
  • reindex_files stores absolute paths as relative_path: chunks written by watcher-triggered reindexes will have absolute-path keys, inconsistent with the relative-path keys from initial full indexing.

Confidence Score: 2/5

Not safe to merge — the default build ships a completely non-functional code-index feature due to missing builtin wiring in both init and tool registration.

Two P1 bugs in init_code_index.rs and post_state.rs mean the entire feature is dead on the default build. The third P1 in reindex_files corrupts incremental-reindex state whenever the file-watcher feature is used.

crates/gateway/src/server/init_code_index.rs and crates/gateway/src/server/prepare_core/post_state.rs (missing builtin branches); crates/code-index/src/index.rs (reindex_files absolute-path bug).

Important Files Changed

Filename Overview
crates/gateway/src/server/init_code_index.rs Only initializes a real backend under #[cfg(feature = "qmd")]; the default builtin feature always falls back to config_only, making the builtin SQLite+FTS5 backend dead on all default builds.
crates/gateway/src/server/prepare_core/post_state.rs Tool registration block gated on #[cfg(feature = "qmd")] only — with the default builtin feature the three code-index tools are never added to the agent registry.
crates/code-index/src/index.rs Core orchestrator is well-structured; reindex_files incorrectly sets relative_path to the absolute path provided by the watcher, breaking incremental index consistency.
crates/code-index/src/store_sqlite.rs Well-implemented SQLite+FTS5 backend with idempotent schema creation, atomic upsert via delete+insert transaction, and trigger-based FTS sync.
crates/code-index/src/snapshot_store.rs Atomic write (temp+rename) with robust path-traversal sanitization; well-tested.
crates/code-index/src/tools.rs Tool implementations are correct; CodebaseSearchTool docstring says Requires a QMD backend but it works with builtin too.
crates/code-index/Cargo.toml Feature flags are structurally correct, but tracing and metrics feature gates required by CLAUDE.md are absent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Gateway startup] --> B[init_code_index]
    B --> C{feature = qmd?}
    C -- yes --> D[QmdManager::is_available?]
    D -- yes --> E[CodeIndex::new QMD backend]
    D -- no --> F[config_only]
    C -- no --> G[log only]
    G --> F
    F --> H[GatewayState.code_index]
    E --> H
    H --> I[post_state: register tools]
    I --> J{feature = qmd?}
    J -- yes --> K[register_tools]
    J -- no --> L[tools NOT registered]
Loading

Reviews (1): Last reviewed commit: "fix(code-index): address clippy, test co..." | Re-trigger Greptile

Comment on lines +22 to +66
pub(crate) async fn init_code_index(
data_dir: &std::path::Path,
) -> Arc<moltis_code_index::CodeIndex> {
let code_index_config = moltis_code_index::CodeIndexConfig {
data_dir: Some(data_dir.join("code-index")),
..Default::default()
};

#[cfg(feature = "qmd")]
{
let qmd_config = moltis_qmd::QmdManagerConfig {
command: "qmd".into(),
collections: std::collections::HashMap::new(),
max_results: 20,
timeout_ms: 30_000,
work_dir: data_dir.to_path_buf(),
index_name: format!("code-{}", super::helpers::sanitize_qmd_index_name(data_dir)),
env_overrides: std::collections::HashMap::new(),
};
let qmd = moltis_qmd::QmdManager::new(qmd_config);

if qmd.is_available().await {
info!(
index = %qmd.index_name(),
"code-index: QMD backend available, initializing in full mode"
);
return Arc::new(moltis_code_index::CodeIndex::new(code_index_config, qmd));
}

warn!(
"code-index: QMD binary not found, falling back to config-only mode \
(search unavailable until QMD is installed)"
);
}

#[cfg(not(feature = "qmd"))]
{
info!(
"code-index: initialized in config-only mode \
(qmd feature disabled — search unavailable)"
);
}

Arc::new(moltis_code_index::CodeIndex::config_only(code_index_config))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Builtin backend never initialized

init_code_index only creates a real backend inside the #[cfg(feature = "qmd")] block. When the builtin feature is enabled (which is the default) but qmd is not, the function falls through to config_only mode. The SqliteCodeIndexStore is never constructed, so every index_project and search call returns BackendUnavailable, making the entire builtin feature effectively inoperative on the default build.

A #[cfg(feature = "builtin")] path is needed to construct SqliteCodeIndexStore and call CodeIndex::new_builtin.

Comment on lines +821 to +825
// ── Code index tools ─────────────────────────────────────────────
#[cfg(feature = "qmd")]
{
moltis_code_index::tools::register_tools(&mut tool_registry, code_index_for_tools);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Code-index tools never registered on default (builtin) build

Tool registration is wrapped in #[cfg(feature = "qmd")], but lib.rs makes tools available under any(feature = "qmd", feature = "builtin"). On the default build (only builtin enabled) the register_tools call is never reached, so codebase_search, codebase_peek, and codebase_status are absent from the agent tool registry. The code_index_for_tools variable on line 337 also only exists inside #[cfg(feature = "qmd")].

Both the variable declaration and the register_tools call need to be widened to #[cfg(any(feature = "qmd", feature = "builtin"))].

Comment on lines +363 to +378
// Convert absolute paths to FilteredFile entries.
let files: Vec<FilteredFile> = paths
.iter()
.filter_map(|p| {
if p.is_file() {
Some(FilteredFile {
path: p.clone(),
relative_path: p.clone(),
size: p.metadata().map(|m| m.len()).unwrap_or(0),
language: crate::types::Language::from_path(p),
})
} else {
None
}
})
.collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 reindex_files stores absolute paths as relative_path

The watcher passes absolute PathBufs, and relative_path is set to the same absolute value as path. The comment in index_files_builtin is explicit that relative_path is the repo-relative path used for storage. Storing absolute paths here produces index entries whose file_path key won't match the relative-path keys written by the initial full index, leaving orphaned chunks and breaking incremental search for watcher-triggered updates.

reindex_files needs a project_dir: &Path parameter so it can strip the prefix before constructing FilteredFile.

Comment on lines +888 to +892
};
let index = CodeIndex::new_builtin(config, Box::new(store), None);
// Keep data_dir alive for the test's lifetime.
std::mem::forget(data_dir);
(index, repo)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 std::mem::forget leaks temp directories in tests

std::mem::forget(data_dir) prevents TempDir::drop from running, so the temporary directory is never cleaned up. The same pattern appears in setup_index_with_embedder and setup_index_with_failing_embedder. The correct fix is to return the TempDir from the setup functions so the caller holds it alive for the test's lifetime.

Comment on lines +7 to +11
[features]
default = ["builtin"]
qmd = ["dep:moltis-qmd", "dep:moltis-agents"]
builtin = ["dep:moltis-memory", "dep:moltis-agents", "dep:sqlx", "dep:tokio"]
file-watcher = ["dep:notify-debouncer-full", "dep:tokio", "dep:tokio-util"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing tracing and metrics feature flags

Per CLAUDE.md, all crates must declare tracing and metrics features gated with #[cfg(feature = "...")]. This crate uses tracing unconditionally but defines no tracing or metrics feature flags.

@Cstewart-HC Cstewart-HC changed the title feat(code-index): add code indexing crate with builtin and QMD backends [Superseded] feat(code-index): add code indexing crate with builtin and QMD backends Apr 16, 2026
@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Cstewart-HC commented Apr 16, 2026

@penso I broke this into 4 pr for easier reviewing, but if you accept them it will have to be in order. I find greptile not so great at reviewing big diffs too.
In any case, this PR brings improved support for large code bases. I took advantage of the existing fts5 and mqd integration to add code indexing as a new crate. It is feature gated, requires no new deps, and my benchmarks show it is much faster than existing search methods. I't should allow for contextual codebase searching as well , thanks to the embeddings . The feature is activated in projects and search is bound to projects as well. So the boundary of each code index is project wide.

Right now it uses 50 line chunking,but I plan to add AST aware chunking strategies as well.

This PR opens the door to supporting document indexing and retrieval generally in a future release.

@Cstewart-HC
Copy link
Copy Markdown
Contributor Author

Superseded by #756 which contains the complete code-index feature with all review fixes applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant