feat: add neighborhood-based graph traversal for retrievers#2328
feat: add neighborhood-based graph traversal for retrievers#2328Vasilije1990 wants to merge 1 commit intodevfrom
Conversation
Add configurable k-hop neighborhood extraction to graph retrievers. When neighborhood_depth is set, the retriever extracts a subgraph around vector-search seed nodes instead of projecting the full graph. Changes: - Add get_neighborhood() abstract method to GraphDBInterface - Implement get_neighborhood() in Kuzu, Neo4j, and Neptune adapters - Add project_neighborhood_from_db() to CogneeGraph with shared _process_nodes_and_edges() helper to avoid code duplication - Wire neighborhood_depth parameter through brute_force_triplet_search, GraphCompletionRetriever, GraphCompletionContextExtensionRetriever, search factory, and search API layers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: vasilije <vas.markovic@gmail.com>
WalkthroughThis PR adds neighborhood-based graph querying across the stack. It introduces Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
257-259: Preserve the traceback in these projection logs.Both
except Exceptionblocks currently reduce the failure tostr(e), which drops the stack trace right where adapter/query diagnostics matter most.As per coding guidelines, "Prefer explicit, structured error handling in Python code".🪵 Suggested fix
- except Exception as e: - logger.error(f"Error during graph projection: {str(e)}") + except Exception: + logger.error("Error during graph projection", exc_info=True) raise- except Exception as e: - logger.error(f"Error during neighborhood projection: {str(e)}") + except Exception: + logger.error("Error during neighborhood projection", exc_info=True) raiseAlso applies to: 304-306
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/graph/cognee_graph/CogneeGraph.py` around lines 257 - 259, The except blocks in CogneeGraph.py are logging only str(e), which omits the traceback; replace those logger.error(...) calls inside the graph projection error handlers with logger.exception("Error during graph projection") or logger.error("Error during graph projection", exc_info=True) so the stack trace is preserved in logs, and apply the same change to the other similar except block (around lines 304-306) that currently logs str(e); keep the existing bare "raise" to re-raise the original exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/api/v1/search/search.py`:
- Around line 44-45: The new public parameter neighborhood_depth is forwarded
unchanged to the adapter get_neighborhood(), allowing 0, negative or non-int
values to create invalid path patterns; validate neighborhood_depth early in the
containing function (the public search handler that returns List[SearchResult])
by checking it is an integer > 0 (and within any configured max if applicable),
and if not raise/return a clear API error (e.g., BadRequest/ValueError) before
calling get_neighborhood(); update callers that pass neighborhood_depth through
(the code around where neighborhood_depth is forwarded) to rely on this
validated value.
In `@cognee/infrastructure/databases/graph/neptune_driver/adapter.py`:
- Around line 690-725: get_neighborhood() mixes external node IDs (~id) and
internal Neptune ids (id(n)), causing mismatches; ensure the same ID domain is
used throughout by returning and filtering on the external id property (`~id`).
Update the path_query to RETURN neighbor.`~id` (collect into neighbor_ids),
build all_ids as union of node_ids and those neighbor `~id`s, change nodes_query
to WHERE n.`~id` IN $ids and RETURN n.`~id` AS node_id, and change edges_query
to WHERE source.`~id` IN $ids AND target.`~id` IN $ids and RETURN source.`~id`
AS source_id, target.`~id` AS target_id (keep function name get_neighborhood and
variables path_query, nodes_query, edges_query, all_ids, neighbor_ids, node_ids
to locate changes).
In `@cognee/modules/graph/cognee_graph/CogneeGraph.py`:
- Around line 280-291: project_neighborhood_from_db currently forwards invalid
inputs (depth <= 0 or empty seed_node_ids) to the adapter and treats any empty
edges_data as an error even when nodes_data contains only the requested seeds;
validate inputs early and relax the empty-edge check: in
project_neighborhood_from_db, before calling adapter.get_neighborhood validate
and raise a clear input error if depth < 1 or seed_node_ids is empty (use
InvalidDimensionsError or a new InvalidInputError), then call
adapter.get_neighborhood; after the call, only raise EntityNotFoundError if
nodes_data is empty (no nodes returned); allow edges_data to be empty when
nodes_data contains the requested seed_node_ids (i.e., accept seed-only
neighborhoods) and only treat missing edges as an error when your logic expects
at least one edge type to be present.
In `@cognee/modules/retrieval/utils/brute_force_triplet_search.py`:
- Around line 55-56: The neighborhood_depth flag is being ignored when
relevant_ids_to_filter is falsy because the code calls project_graph_from_db()
inside the neighborhood branch; change the logic in brute_force_triplet_search
(around the neighborhood_depth check) so that if neighborhood_depth is set and
relevant_ids_to_filter is empty you either (A) fail fast by raising a ValueError
indicating seed IDs are required for neighborhood mode, or (B) compute/derive
seed IDs before entering neighborhood mode (e.g., call the existing
seed-derivation helper or add a new get_seed_ids function) and then proceed to
call project_graph_from_db() only with those seed IDs; ensure references to
neighborhood_seed_top_k and relevant_ids_to_filter are used to derive seeds if
you choose option B and do not fall back to full-graph projection silently.
---
Nitpick comments:
In `@cognee/modules/graph/cognee_graph/CogneeGraph.py`:
- Around line 257-259: The except blocks in CogneeGraph.py are logging only
str(e), which omits the traceback; replace those logger.error(...) calls inside
the graph projection error handlers with logger.exception("Error during graph
projection") or logger.error("Error during graph projection", exc_info=True) so
the stack trace is preserved in logs, and apply the same change to the other
similar except block (around lines 304-306) that currently logs str(e); keep the
existing bare "raise" to re-raise the original exception.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b77f4569-41b9-4b38-a164-9607d8b0a297
📒 Files selected for processing (11)
cognee/api/v1/search/search.pycognee/infrastructure/databases/graph/graph_db_interface.pycognee/infrastructure/databases/graph/kuzu/adapter.pycognee/infrastructure/databases/graph/neo4j_driver/adapter.pycognee/infrastructure/databases/graph/neptune_driver/adapter.pycognee/modules/graph/cognee_graph/CogneeGraph.pycognee/modules/retrieval/graph_completion_context_extension_retriever.pycognee/modules/retrieval/graph_completion_retriever.pycognee/modules/retrieval/utils/brute_force_triplet_search.pycognee/modules/search/methods/get_search_type_retriever_instance.pycognee/modules/search/methods/search.py
| neighborhood_depth: Optional[int] = None, | ||
| ) -> List[SearchResult]: |
There was a problem hiding this comment.
Validate neighborhood_depth before forwarding it.
Line 233 passes the new public parameter through unchanged. 0, negative values, or non-ints will currently reach the adapter get_neighborhood() queries and build invalid [*1..N] path patterns instead of returning a clear API error.
🛡️ Suggested guard
async def search(
query_text: str,
@@
retriever_specific_config: Optional[dict] = None,
neighborhood_depth: Optional[int] = None,
) -> List[SearchResult]:
+ if neighborhood_depth is not None and (
+ not isinstance(neighborhood_depth, int) or neighborhood_depth < 1
+ ):
+ raise CogneeValidationError(
+ message="neighborhood_depth must be a positive integer.",
+ name="InvalidNeighborhoodDepth",
+ )
+
"""Also applies to: 217-233
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/api/v1/search/search.py` around lines 44 - 45, The new public
parameter neighborhood_depth is forwarded unchanged to the adapter
get_neighborhood(), allowing 0, negative or non-int values to create invalid
path patterns; validate neighborhood_depth early in the containing function (the
public search handler that returns List[SearchResult]) by checking it is an
integer > 0 (and within any configured max if applicable), and if not
raise/return a clear API error (e.g., BadRequest/ValueError) before calling
get_neighborhood(); update callers that pass neighborhood_depth through (the
code around where neighborhood_depth is forwarded) to rely on this validated
value.
| if edge_types: | ||
| allowed = "|".join(edge_types) | ||
| path_query = f""" | ||
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[:{allowed}*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | ||
| WHERE seed.`~id` IN $node_ids | ||
| RETURN DISTINCT id(neighbor) AS nid | ||
| """ | ||
| else: | ||
| path_query = f""" | ||
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | ||
| WHERE seed.`~id` IN $node_ids | ||
| RETURN DISTINCT id(neighbor) AS nid | ||
| """ | ||
|
|
||
| result = await self.query(path_query, {"node_ids": node_ids}) | ||
| neighbor_ids = [record["nid"] for record in result if record.get("nid")] | ||
|
|
||
| all_ids = list(set(node_ids) | set(neighbor_ids)) | ||
|
|
||
| # Step 2: Fetch all nodes | ||
| nodes_query = f""" | ||
| MATCH (n:{self._GRAPH_NODE_LABEL}) | ||
| WHERE id(n) IN $ids | ||
| RETURN id(n) AS node_id, properties(n) AS properties | ||
| """ | ||
| nodes_result = await self.query(nodes_query, {"ids": all_ids}) | ||
| nodes = [(r["node_id"], r["properties"]) for r in nodes_result] | ||
|
|
||
| # Step 3: Fetch all edges between collected nodes | ||
| edges_query = f""" | ||
| MATCH (source:{self._GRAPH_NODE_LABEL})-[r]->(target:{self._GRAPH_NODE_LABEL}) | ||
| WHERE id(source) IN $ids AND id(target) IN $ids | ||
| RETURN id(source) AS source_id, id(target) AS target_id, | ||
| type(r) AS relationship_name, properties(r) AS properties | ||
| """ | ||
| edges_result = await self.query(edges_query, {"ids": all_ids}) |
There was a problem hiding this comment.
Keep get_neighborhood() on a single ID domain.
Line 694 matches seed nodes by ~id, but Lines 712 and 721 switch to id(n) / id(source). That makes all_ids a mix of external IDs and Neptune internal IDs, so seed nodes and their incident edges can disappear from the returned neighborhood.
🔧 One consistent way to fix it
if edge_types:
allowed = "|".join(edge_types)
path_query = f"""
MATCH (seed:{self._GRAPH_NODE_LABEL})-[:{allowed}*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL})
WHERE seed.`~id` IN $node_ids
- RETURN DISTINCT id(neighbor) AS nid
+ RETURN DISTINCT neighbor.`~id` AS nid
"""
else:
path_query = f"""
MATCH (seed:{self._GRAPH_NODE_LABEL})-[*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL})
WHERE seed.`~id` IN $node_ids
- RETURN DISTINCT id(neighbor) AS nid
+ RETURN DISTINCT neighbor.`~id` AS nid
"""
@@
nodes_query = f"""
MATCH (n:{self._GRAPH_NODE_LABEL})
- WHERE id(n) IN $ids
- RETURN id(n) AS node_id, properties(n) AS properties
+ WHERE n.`~id` IN $ids
+ RETURN n.`~id` AS node_id, properties(n) AS properties
"""
@@
edges_query = f"""
MATCH (source:{self._GRAPH_NODE_LABEL})-[r]->(target:{self._GRAPH_NODE_LABEL})
- WHERE id(source) IN $ids AND id(target) IN $ids
- RETURN id(source) AS source_id, id(target) AS target_id,
+ WHERE source.`~id` IN $ids AND target.`~id` IN $ids
+ RETURN source.`~id` AS source_id, target.`~id` AS target_id,
type(r) AS relationship_name, properties(r) AS properties
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if edge_types: | |
| allowed = "|".join(edge_types) | |
| path_query = f""" | |
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[:{allowed}*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | |
| WHERE seed.`~id` IN $node_ids | |
| RETURN DISTINCT id(neighbor) AS nid | |
| """ | |
| else: | |
| path_query = f""" | |
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | |
| WHERE seed.`~id` IN $node_ids | |
| RETURN DISTINCT id(neighbor) AS nid | |
| """ | |
| result = await self.query(path_query, {"node_ids": node_ids}) | |
| neighbor_ids = [record["nid"] for record in result if record.get("nid")] | |
| all_ids = list(set(node_ids) | set(neighbor_ids)) | |
| # Step 2: Fetch all nodes | |
| nodes_query = f""" | |
| MATCH (n:{self._GRAPH_NODE_LABEL}) | |
| WHERE id(n) IN $ids | |
| RETURN id(n) AS node_id, properties(n) AS properties | |
| """ | |
| nodes_result = await self.query(nodes_query, {"ids": all_ids}) | |
| nodes = [(r["node_id"], r["properties"]) for r in nodes_result] | |
| # Step 3: Fetch all edges between collected nodes | |
| edges_query = f""" | |
| MATCH (source:{self._GRAPH_NODE_LABEL})-[r]->(target:{self._GRAPH_NODE_LABEL}) | |
| WHERE id(source) IN $ids AND id(target) IN $ids | |
| RETURN id(source) AS source_id, id(target) AS target_id, | |
| type(r) AS relationship_name, properties(r) AS properties | |
| """ | |
| edges_result = await self.query(edges_query, {"ids": all_ids}) | |
| if edge_types: | |
| allowed = "|".join(edge_types) | |
| path_query = f""" | |
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[:{allowed}*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | |
| WHERE seed.`~id` IN $node_ids | |
| RETURN DISTINCT neighbor.`~id` AS nid | |
| """ | |
| else: | |
| path_query = f""" | |
| MATCH (seed:{self._GRAPH_NODE_LABEL})-[*1..{depth}]-(neighbor:{self._GRAPH_NODE_LABEL}) | |
| WHERE seed.`~id` IN $node_ids | |
| RETURN DISTINCT neighbor.`~id` AS nid | |
| """ | |
| result = await self.query(path_query, {"node_ids": node_ids}) | |
| neighbor_ids = [record["nid"] for record in result if record.get("nid")] | |
| all_ids = list(set(node_ids) | set(neighbor_ids)) | |
| # Step 2: Fetch all nodes | |
| nodes_query = f""" | |
| MATCH (n:{self._GRAPH_NODE_LABEL}) | |
| WHERE n.`~id` IN $ids | |
| RETURN n.`~id` AS node_id, properties(n) AS properties | |
| """ | |
| nodes_result = await self.query(nodes_query, {"ids": all_ids}) | |
| nodes = [(r["node_id"], r["properties"]) for r in nodes_result] | |
| # Step 3: Fetch all edges between collected nodes | |
| edges_query = f""" | |
| MATCH (source:{self._GRAPH_NODE_LABEL})-[r]->(target:{self._GRAPH_NODE_LABEL}) | |
| WHERE source.`~id` IN $ids AND target.`~id` IN $ids | |
| RETURN source.`~id` AS source_id, target.`~id` AS target_id, | |
| type(r) AS relationship_name, properties(r) AS properties | |
| """ | |
| edges_result = await self.query(edges_query, {"ids": all_ids}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/infrastructure/databases/graph/neptune_driver/adapter.py` around lines
690 - 725, get_neighborhood() mixes external node IDs (~id) and internal Neptune
ids (id(n)), causing mismatches; ensure the same ID domain is used throughout by
returning and filtering on the external id property (`~id`). Update the
path_query to RETURN neighbor.`~id` (collect into neighbor_ids), build all_ids
as union of node_ids and those neighbor `~id`s, change nodes_query to WHERE
n.`~id` IN $ids and RETURN n.`~id` AS node_id, and change edges_query to WHERE
source.`~id` IN $ids AND target.`~id` IN $ids and RETURN source.`~id` AS
source_id, target.`~id` AS target_id (keep function name get_neighborhood and
variables path_query, nodes_query, edges_query, all_ids, neighbor_ids, node_ids
to locate changes).
| if node_dimension < 1 or edge_dimension < 1: | ||
| raise InvalidDimensionsError() | ||
| try: | ||
| logger.info(f"Retrieving {depth}-hop neighborhood for {len(seed_node_ids)} seed nodes.") | ||
| nodes_data, edges_data = await adapter.get_neighborhood( | ||
| node_ids=seed_node_ids, | ||
| depth=depth, | ||
| edge_types=edge_types, | ||
| ) | ||
|
|
||
| if not nodes_data or not edges_data: | ||
| raise EntityNotFoundError(message="Empty neighborhood projected from the database.") |
There was a problem hiding this comment.
Validate neighborhood inputs and allow seed-only results.
project_neighborhood_from_db() currently forwards depth <= 0 and empty seed_node_ids straight to the adapter, and Line 290 also raises when the neighborhood contains seed nodes but no edges. That makes malformed requests and sparse-but-valid neighborhoods fail deep in the backend instead of producing a clear boundary behavior.
💡 One way to harden this path
if node_dimension < 1 or edge_dimension < 1:
raise InvalidDimensionsError()
+ if depth < 1:
+ raise ValueError("depth must be >= 1")
+ if not seed_node_ids:
+ raise ValueError("seed_node_ids must not be empty")
try:
logger.info(f"Retrieving {depth}-hop neighborhood for {len(seed_node_ids)} seed nodes.")
nodes_data, edges_data = await adapter.get_neighborhood(
node_ids=seed_node_ids,
depth=depth,
edge_types=edge_types,
)
- if not nodes_data or not edges_data:
+ if not nodes_data:
raise EntityNotFoundError(message="Empty neighborhood projected from the database.")
+ edges_data = edges_data or []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/modules/graph/cognee_graph/CogneeGraph.py` around lines 280 - 291,
project_neighborhood_from_db currently forwards invalid inputs (depth <= 0 or
empty seed_node_ids) to the adapter and treats any empty edges_data as an error
even when nodes_data contains only the requested seeds; validate inputs early
and relax the empty-edge check: in project_neighborhood_from_db, before calling
adapter.get_neighborhood validate and raise a clear input error if depth < 1 or
seed_node_ids is empty (use InvalidDimensionsError or a new InvalidInputError),
then call adapter.get_neighborhood; after the call, only raise
EntityNotFoundError if nodes_data is empty (no nodes returned); allow edges_data
to be empty when nodes_data contains the requested seed_node_ids (i.e., accept
seed-only neighborhoods) and only treat missing edges as an error when your
logic expects at least one edge type to be present.
| neighborhood_depth: Optional[int] = None, | ||
| neighborhood_seed_top_k: Optional[int] = 10, |
There was a problem hiding this comment.
Don't silently fall back to full-graph projection in neighborhood mode.
With neighborhood_depth set, Lines 68-88 still call project_graph_from_db() whenever relevant_ids_to_filter is empty/falsy. That makes the new flag a silent no-op and can turn a bounded neighborhood request back into a full-graph projection. Please either fail fast here or derive seed IDs before entering neighborhood mode.
Also applies to: 68-88
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/modules/retrieval/utils/brute_force_triplet_search.py` around lines 55
- 56, The neighborhood_depth flag is being ignored when relevant_ids_to_filter
is falsy because the code calls project_graph_from_db() inside the neighborhood
branch; change the logic in brute_force_triplet_search (around the
neighborhood_depth check) so that if neighborhood_depth is set and
relevant_ids_to_filter is empty you either (A) fail fast by raising a ValueError
indicating seed IDs are required for neighborhood mode, or (B) compute/derive
seed IDs before entering neighborhood mode (e.g., call the existing
seed-derivation helper or add a new get_seed_ids function) and then proceed to
call project_graph_from_db() only with those seed IDs; ensure references to
neighborhood_seed_top_k and relevant_ids_to_filter are used to derive seeds if
you choose option B and do not fall back to full-graph projection silently.
Summary
get_neighborhood(node_ids, depth, edge_types)toGraphDBInterfaceand implement in Kuzu, Neo4j, and Neptune adapters using variable-length Cypher path patterns ([*1..N])project_neighborhood_from_db()toCogneeGraphwith extracted_process_nodes_and_edges()helper to eliminate duplicationneighborhood_depthhyperparameter tobrute_force_triplet_search,GraphCompletionRetriever, andGraphCompletionContextExtensionRetrieverneighborhood_depthend-to-end through the search API (search()→authorized_search()→search_in_datasets_context()→ retriever factory)When
neighborhood_depthis set, the retriever extracts a k-hop subgraph around the top vector-search seed nodes instead of projecting the full graph. This gives more focused, structurally relevant context for graph-based completions.Usage:
Test plan
neighborhood_depthis not set (defaultNone)neighborhood_depth=1andneighborhood_depth=2on a populated knowledge graphget_neighborhood()returns correct nodes/edges formatget_neighborhood()returns correct nodes/edges format🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor