fix: search both edge directions during deduplication#1303
fix: search both edge directions during deduplication#1303prasmussen15 wants to merge 3 commits intomainfrom
Conversation
The edge deduplication flow only searched for existing edges in the forward direction (source→target). Edges with the same endpoints but reversed direction were missed, leading to duplicate edges. Now searches both directions and merges results before sending candidates to the LLM. Bumps version to 0.29.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| forward_edges_list: list[list[EntityEdge]] = await semaphore_gather( | ||
| *[ | ||
| EntityEdge.get_between_nodes(driver, edge.source_node_uuid, edge.target_node_uuid) | ||
| for edge in extracted_edges | ||
| ] | ||
| ) | ||
| inverse_edges_list: list[list[EntityEdge]] = await semaphore_gather( | ||
| *[ | ||
| EntityEdge.get_between_nodes(driver, edge.target_node_uuid, edge.source_node_uuid) | ||
| for edge in extracted_edges | ||
| ] | ||
| ) |
There was a problem hiding this comment.
The two semaphore_gather calls are sequential, but forward_edges_list and inverse_edges_list don't depend on each other. Consider combining them into a single parallel call to reduce latency:
| forward_edges_list: list[list[EntityEdge]] = await semaphore_gather( | |
| *[ | |
| EntityEdge.get_between_nodes(driver, edge.source_node_uuid, edge.target_node_uuid) | |
| for edge in extracted_edges | |
| ] | |
| ) | |
| inverse_edges_list: list[list[EntityEdge]] = await semaphore_gather( | |
| *[ | |
| EntityEdge.get_between_nodes(driver, edge.target_node_uuid, edge.source_node_uuid) | |
| for edge in extracted_edges | |
| ] | |
| ) | |
| all_edge_queries = [ | |
| EntityEdge.get_between_nodes(driver, edge.source_node_uuid, edge.target_node_uuid) | |
| for edge in extracted_edges | |
| ] + [ | |
| EntityEdge.get_between_nodes(driver, edge.target_node_uuid, edge.source_node_uuid) | |
| for edge in extracted_edges | |
| ] | |
| all_results: list[list[EntityEdge]] = await semaphore_gather(*all_edge_queries) | |
| n = len(extracted_edges) | |
| forward_edges_list = all_results[:n] | |
| inverse_edges_list = all_results[n:] |
…el call Addresses PR review feedback to reduce latency by running both direction queries in a single semaphore_gather call instead of two sequential ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| all_edge_queries = [ | ||
| EntityEdge.get_between_nodes(driver, edge.source_node_uuid, edge.target_node_uuid) | ||
| for edge in extracted_edges | ||
| ] + [ | ||
| EntityEdge.get_between_nodes(driver, edge.target_node_uuid, edge.source_node_uuid) | ||
| for edge in extracted_edges | ||
| ] | ||
| all_results: list[list[EntityEdge]] = await semaphore_gather(*all_edge_queries) | ||
| n = len(extracted_edges) | ||
| forward_edges_list = all_results[:n] | ||
| inverse_edges_list = all_results[n:] | ||
|
|
||
| valid_edges_list: list[list[EntityEdge]] = [] | ||
| for forward_edges, inverse_edges in zip(forward_edges_list, inverse_edges_list, strict=True): | ||
| seen_uuids: set[str] = set() | ||
| combined: list[EntityEdge] = [] | ||
| for edge in [*forward_edges, *inverse_edges]: | ||
| if edge.uuid not in seen_uuids: | ||
| seen_uuids.add(edge.uuid) | ||
| combined.append(edge) | ||
| valid_edges_list.append(combined) |
There was a problem hiding this comment.
Performance concern: This change doubles the number of database queries from n to 2n for edge deduplication. Consider modifying get_between_nodes to use a bidirectional match in the Cypher query instead:
MATCH (n:Entity {uuid: $uuid1})-[e:RELATES_TO]-(m:Entity {uuid: $uuid2})Using - instead of -> would match edges in both directions with a single query, avoiding the performance regression.
If modifying get_between_nodes is not desired (to preserve its current directional semantics), consider adding a new get_between_nodes_bidirectional method instead.
|
|
||
| valid_edges_list: list[list[EntityEdge]] = [] | ||
| for forward_edges, inverse_edges in zip(forward_edges_list, inverse_edges_list, strict=True): | ||
| seen_uuids: set[str] = set() |
There was a problem hiding this comment.
The variable seen_uuids shadows the seen dict used earlier in the function (line 256). While not a bug due to scope, consider using a more descriptive name like combined_edge_uuids to avoid confusion during code review.
Review FeedbackMissing test coverage: This PR adds important bidirectional edge lookup behavior but doesn't include unit tests to validate it. Consider adding tests that verify:
This would help ensure the fix works as intended and prevent future regressions. |
Adds a new EntityEdge.get_between_nodes_bidirectional method that uses undirected Cypher matching (-) instead of directed (->), finding edges in both directions with a single query. This preserves the existing directional semantics of get_between_nodes for other callers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @classmethod | ||
| async def get_between_nodes_bidirectional( | ||
| cls, driver: GraphDriver, node_uuid_a: str, node_uuid_b: str | ||
| ): |
There was a problem hiding this comment.
Missing graph_operations_interface check: Unlike the existing get_between_nodes method (lines 405-411), this new method doesn't check for a custom graph_operations_interface implementation before executing the default query. This inconsistency means:
- Users with custom
graph_operations_interfaceimplementations will have their custom logic bypassed during edge deduplication - There's no
edge_get_between_nodes_bidirectionalmethod defined inGraphOperationsInterfaceto call even if a check were added
Consider either:
- Adding the interface check with a new
edge_get_between_nodes_bidirectionalmethod toGraphOperationsInterface, or - Documenting that this method intentionally bypasses the interface (if that's the intended behavior)
Summary
source→target) and inverse (target→source) directions when finding candidate duplicates viaget_between_nodesTest plan
🤖 Generated with Claude Code