Conversation
Major changes: - Add fetch-sql endpoint for NL to SQL generation - Add unified GopieClient for API requests - Add new SQL agent workflow with sql_planner_graph and nl_to_sql_graph - Refactor single/multi dataset graphs with dedicated sql_agent nodes - Add DuckDB docs search functionality - Update prompts and validation logic - Add new tools: respond_to_user, return_result_paths - Update dependencies: langgraph>=0.6.0, dspy==3.1.0, beautifulsoup4 - Various bug fixes and improvements
- Updated `_build_ilike_query` and `_build_levenshtein_query` to ensure proper casting of columns to VARCHAR for SQL queries. - Changed `data` type in `VisualizationResult` from `list[bytes]` to `list[str]` for better compatibility. - Refactored `update_sandbox_timeout` to return a tuple indicating whether a new sandbox was created. - Adjusted `pre_model_hook` to utilize the updated sandbox handling logic. - Enhanced prompt guidelines to exclude non-string columns from fuzzy value generation. - Updated test cases to validate new query handling and data type changes.
|
Claude finished @paul-tharun's task —— View job PR Review: Feat/e2e datasets metadataI've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings: Review Checklist
🏗️ Architecture & Design✅ Positive Changes
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⚪ | Well-designed but complex |
| Security | ⭐⭐⭐⚪⚪ | Good practices with some gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code with room for improvement |
Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds org-scoped request handling and a GopieClient; implements NL→SQL and SQL planner graphs with semantic search and SQL agents; introduces OLAP query builders (DuckDB/ClickHouse), sampling helpers, hybrid dense+sparse Qdrant support (SPLADE), refactors single/multi-dataset workflows, and expands tests, viz tooling, and indexing/migration scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant API as Fetch SQL API
participant NLToSQL as NL-to-SQL Graph
participant Supervisor
participant Semantic as SemanticSearch
participant Qdrant
participant SQLAgent
participant Planner as SQL Planner Agent
User->>API: POST /fetch-sql (user_query, dataset_ids, project_ids)
API->>NLToSQL: invoke(payload)
NLToSQL->>Supervisor: decide route
alt single dataset
Supervisor->>SQLAgent: goto sql_agent
SQLAgent->>Planner: invoke sql_planning_agent
Planner->>SQLAgent: sql_queries
else multi-dataset
Supervisor->>Semantic: goto semantic_search
Semantic->>Qdrant: search_schemas(user_query,...)
Qdrant->>Semantic: schema results
Semantic->>NLToSQL: semantic results
NLToSQL->>SQLAgent: provide semantic results
SQLAgent->>Planner: invoke sql_planning_agent
Planner->>SQLAgent: sql_queries
end
SQLAgent->>API: return sql_queries + optional message
API->>User: FetchSqlResponse
sequenceDiagram
participant State
participant Prepare as PrepareData
participant SQLAgent
participant Execute as ExecuteSQL
participant Validate
State->>Prepare: dataset_info, user_query
Prepare->>Prepare: fetch schema, sample data, column assumptions
Prepare->>SQLAgent: dataset_info
SQLAgent->>SQLAgent: call sql_planning_agent -> generate sql_queries or non_sql_response
alt sql_queries present
SQLAgent->>Execute: execute_sql (sql_queries)
Execute->>Validate: results
else non_sql_response
SQLAgent->>Validate: non_sql_response
end
Validate->>State: updated query_result + recommendation
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (15)
chat-server/tests/performance_tools/performance_tracker.py (1)
40-82: Unusedtest_casesparameter detected.The
test_casesparameter is accepted bysave_runbut is never used in the method body. Therun_datadictionary includesresults_filebut omitstest_cases.If test cases should be persisted with the run data, add them to
run_data. Otherwise, consider removing the unused parameter.🔧 Option A: Include test_cases in run_data
run_data = { "run_id": run_id, "timestamp": timestamp, "model_config": model_config, "notes": notes, "summary": { "total_tests": total, "scores": scores, "avg_score": round(score_stats.get("average", 0), 2), "median_score": round(score_stats.get("median", 0), 2), "min_score": round(score_stats.get("min", 0), 2), "max_score": round(score_stats.get("max", 0), 2), "errors": errors, "avg_request_time": round(timing_stats.get("avg_request_time", 0), 2), "median_request_time": round(timing_stats.get("median_request_time", 0), 2), "min_request_time": round(timing_stats.get("min_request_time", 0), 2), "max_request_time": round(timing_stats.get("max_request_time", 0), 2), "total_time": round(timing_stats.get("total_time", 0), 2), }, "results_file": results_file, + "test_cases": test_cases, }🔧 Option B: Remove the unused parameter
def save_run( self, model_config: dict[str, str], summary: dict[str, Any], notes: str = "", - test_cases: list[dict[str, Any]] | None = None, results_file: str = "", ) -> str:chat-server/app/workflow/agent/node/multi_dataset.py (1)
80-87: Fix incorrect state field reference: useprevious_sql_queriesnotrelevant_sql_queries.Line 86 reads from
state.get("relevant_sql_queries", []), butAgentStateonly definesprevious_sql_queries(line 26 ofagent/types.py). This causes SQL history to always be empty, breaking multi-dataset reasoning. Change tostate.get("previous_sql_queries", []).The same bug exists in
single_dataset.py:74.chat-server/tests/scripts/replicate_prod_to_local.py (1)
576-596: Update the example JSON file path to use an actual file or wildcard pattern.The example path
tests/chat_server_tests/output/golden_dataset.jsondoes not exist. Actual golden dataset files generated by the test tools use timestamp suffixes (e.g.,golden_dataset_20250806_010759.json). Either update the example to reference an actual generated file, use a wildcard pattern likegolden_dataset_*.json, or clarify that this is a placeholder path. Theoutput/directory structure also doesn't currently exist in the repository.chat-server/app/workflow/agent/node/single_dataset.py (1)
70-75: Fix state key mismatch: useprevious_sql_queriesinstead ofrelevant_sql_queriesLine 74 attempts to retrieve from
state.get("relevant_sql_queries", []), but the AgentState defines the key asprevious_sql_queries(set by context_processor). Therelevant_sql_querieskey exists only in the visualize_data_graph InputState. Using the wrong key means the agent context is lost unless the default empty list masks the issue.- "prev_sql_queries": state.get("relevant_sql_queries", []), + "prev_sql_queries": state.get("previous_sql_queries", []),Note: The same issue exists in
chat-server/app/workflow/agent/node/multi_dataset.py:86.chat-server/pyproject.toml (1)
10-30: Movepre-committo the dev dependency group.Pre-commit is a development tool and should be in the
[dependency-groups] devsection (which currently contains only test tools), not in the main dependencies. The langgraph 1.0.6 APIs (ToolNode, Command, StateGraph, add_messages) are actively used throughout the codebase and appear compatible with the version bump.Diff to move pre-commit to dev
dependencies = [ "aioboto3>=14.3.0", "beautifulsoup4>=4.14.2", "altair>=5.5.0", "dspy==3.1.0", "e2b-code-interpreter>=1.5.1", "fastapi[standard]>=0.115.12", "langchain>=0.3.21", "langchain-community>=0.3.20", "langchain-openai>=0.3.9", "langchain-qdrant>=0.2.0", "langgraph>=1.0.6", "langgraph-cli[inmem]>=0.3.3", "langsmith>=0.3.42", "logging>=0.4.9.6", "model2vec>=0.7.0", "pillow>=12.0.0", "portkey-ai>=1.11.1", - "pre-commit>=4.2.0", "pydantic>=2.10.6", "qdrant-client>=1.13.3", "vega-datasets>=0.9.0", "vl-convert-python>=1.8.0", ] [dependency-groups] dev = [ "pytest>=8.4.1", "pytest-asyncio>=0.24.0", "pytest-cache>=1.0", "pytest-sugar>=1.0.0", "pytest-testmon>=2.1.3", "pytest-xdist>=3.8.0", + "pre-commit>=4.2.0", ]chat-server/app/tool_utils/tool_node.py (1)
26-64: Ensure tool metadata is always applied regardless of config type.The current code returns
configunchanged if it's not a dict (line 64), meaningtool_text,tool_category, andshould_display_toolare dropped when config isNoneor non-dict. This can occur in langgraph's ToolNode execution flow. Fix by defaulting to{}when config is not a dict and always merging tool_config.🛠️ Proposed fix
- if isinstance(config, dict): - merged_config: RunnableConfig = config.copy() + base_config: RunnableConfig = config if isinstance(config, dict) else {} + merged_config: RunnableConfig = base_config.copy() if "tags" in tool_config: existing_tags = merged_config.get("tags", []) if not isinstance(existing_tags, list): existing_tags = [existing_tags] if existing_tags else [] merged_config["tags"] = existing_tags + tool_config["tags"] if "metadata" in tool_config: existing_metadata = merged_config.get("metadata", {}) if not isinstance(existing_metadata, dict): existing_metadata = {} merged_config["metadata"] = {**existing_metadata, **tool_config["metadata"]} - return merged_config - - return config + return merged_configchat-server/app/tool_utils/tools/get_table_schema.py (2)
69-73: Filter uses OR (should) instead of AND (must) logic.When both
dataset_idsandorg_idfilters are present, usingshouldcombines them with OR semantics—returning records that match any dataset ID or the org ID. This likely isn't the intended behavior; you probably want records that match the specified dataset IDs and belong to the organization.🐛 Proposed fix to use AND logic
if filter_conditions: search_result = await client.scroll( collection_name=settings.QDRANT_COLLECTION, - scroll_filter=Filter(should=filter_conditions), + scroll_filter=Filter(must=filter_conditions), )
80-86: Exception type mismatch: catchingJSONDecodeErrorbut code uses Pydantic.
DatasetSchema(**metadata)performs Pydantic validation, which raisesValidationError, notjson.JSONDecodeError. The current handler won't catch actual validation errors.🐛 Proposed fix
+from pydantic import ValidationError ... try: metadata = payload.get("metadata", {}) dataset_schema = DatasetSchema(**metadata) schemas.append(dataset_schema.format_for_prompt()) - except json.JSONDecodeError as e: + except (json.JSONDecodeError, ValidationError) as e: logger.exception(f"Error parsing schema JSON: {e}") continuechat-server/tests/unit/test_schema_search.py (1)
153-154: Inconsistent mock signature will cause test failure.This
fake_get_vector_storeonly acceptsembeddings, while all other test cases in this file were updated to accept bothembeddings=Noneandcollection_name=None. If the realget_vector_storeis called withcollection_name, this mock will raise aTypeError.🐛 Proposed fix
- def fake_get_vector_store(embeddings): + def fake_get_vector_store(embeddings=None, collection_name=None): return object()chat-server/app/services/qdrant/schema_search.py (1)
18-61: Fix tenant isolation: move org_id to must filter.Using
org_idinshouldallows results to match by project/dataset ID alone, ignoring the organization context. This creates a cross-tenant data leak risk. With OR semantics, a schema from a different organization with the same project ID could be returned. Additionally,org_idis often not provided to these functions (callers insemantic_search.py,context_processor.py, andidentify_datasets.pyomit it), resulting in no tenant filtering in those cases.Separate
org_idinto amustcondition to enforce tenant isolation:Proposed fix
- if org_id: - filter_conditions.append( - models.FieldCondition( - key="metadata.org_id", - match=models.MatchValue(value=org_id), - ) - ) + must_conditions = [] + if org_id: + must_conditions.append( + models.FieldCondition( + key="metadata.org_id", + match=models.MatchValue(value=org_id), + ) + ) - if filter_conditions: - query_filter = models.Filter(should=filter_conditions) + if filter_conditions or must_conditions: + query_filter = models.Filter( + should=filter_conditions or None, + must=must_conditions or None, + )Apply the same fix to
get_schema_from_qdrant,get_schema_by_dataset_ids, andget_project_schemasinget_schema.py.chat-server/tests/unit/test_column_value_matching.py (1)
71-79: Duplicate dictionary key "error" - only the second definition is used.The dummy logger has
"error"defined twice (lines 75 and 77). Python dictionaries only keep the last value for duplicate keys, so this is redundant. While it doesn't affect test behavior (both are no-op lambdas), it should be cleaned up.🧹 Proposed fix
dummy_logger = type( "L", (), { "error": lambda *args, **kwargs: None, "debug": lambda *args, **kwargs: None, - "error": lambda *args, **kwargs: None, }, )()chat-server/app/services/qdrant/get_schema.py (1)
100-112: Filter logic may not behave as intended for multi-dataset + org_id queries.Using
Filter(should=filter_conditions)applies OR logic across all conditions. Whenorg_idis provided alongside multipledataset_ids, this will match records wheredataset_idmatches ANY of the IDs ORorg_idmatches—potentially returning datasets from other organizations.If the intent is to filter datasets within a specific organization, consider using nested filters:
🐛 Proposed fix for AND semantics between dataset_ids (OR) and org_id
+ dataset_conditions = [ + FieldCondition( + key="metadata.dataset_id", + match=MatchValue(value=dataset_id), + ) + for dataset_id in dataset_ids + ] + + filter_obj = Filter(should=dataset_conditions) + if org_id: - filter_conditions.append( - FieldCondition( - key="metadata.org_id", - match=MatchValue(value=org_id), - ) + filter_obj = Filter( + must=[ + Filter(should=dataset_conditions), + FieldCondition( + key="metadata.org_id", + match=MatchValue(value=org_id), + ), + ] ) search_result = await client.scroll( collection_name=settings.QDRANT_COLLECTION, - scroll_filter=Filter(should=filter_conditions), + scroll_filter=filter_obj, limit=len(dataset_ids), )chat-server/app/tool_utils/tools/list_datasets.py (3)
79-85: Mutable default arguments and type hint issue.Line 82-83: Using mutable default arguments (
list[str] = []) is a Python antipattern that can lead to unexpected behavior if the list is mutated. Line 84: Theconfigparameter should useOptional[RunnableConfig]type hint for consistency.🐛 Proposed fix
`@tool` async def get_all_datasets( status_message: str = "", - project_ids: list[str] = [], - dataset_ids: list[str] = [], - config: RunnableConfig = None, + project_ids: list[str] | None = None, + dataset_ids: list[str] | None = None, + config: RunnableConfig | None = None, ) -> str:Then update the usage inside the function:
- if not project_ids and not dataset_ids: + if not project_ids and not dataset_ids: ... - if project_ids: + if project_ids: dataset_names = await get_dataset_names_from_project_ids(project_ids, org_id=org_id)
38-52: Missing exception handling for network errors.Unlike
get_dataset_names_from_project_idswhich has a try-except block (lines 27-28), this function lacks exception handling for network failures. This inconsistency could cause unhandled exceptions to propagate.🐛 Proposed fix
async def get_project_ids_for_datasets_ids( dataset_ids: list[str], org_id: Optional[str] = None, ) -> dict[str, str]: dataset_id_project_map = {} client = GopieClient(org_id=org_id) for dataset_id in dataset_ids: - path = f"/v1/api/datasets/{dataset_id}/project" - async with await client.get(path, ssl=False) as response: - if response.status == 200: - dataset_data = await response.json() - dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "") - else: - logger.warning(f"Dataset {dataset_id} not found") + try: + path = f"/v1/api/datasets/{dataset_id}/project" + async with await client.get(path, ssl=False) as response: + if response.status == 200: + dataset_data = await response.json() + dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "") + else: + logger.warning(f"Dataset {dataset_id} not found") + except Exception as e: + logger.exception(f"Error fetching project for dataset {dataset_id}: {e}") return dataset_id_project_map
55-76: Missing exception handling (same as above).This function also lacks the try-except block present in
get_dataset_names_from_project_ids. Consider adding consistent error handling.
🤖 Fix all issues with AI agents
In `@chat-server/app/services/gopie/client.py`:
- Around line 51-97: The get/post methods (get and post in client.py) currently
lack timeouts and can hang; add a GOPIE_API_TIMEOUT integer setting (following
E2B_TIMEOUT pattern) to config/settings and use aiohttp.ClientTimeout with that
value when making outbound calls: either configure the SingletonAiohttp session
(_session) with ClientTimeout(total=GOPIE_API_TIMEOUT) on creation or pass
timeout=aiohttp.ClientTimeout(total=GOPIE_API_TIMEOUT) to self._session.get(...)
and self._session.post(...); update the client to read the GOPIE_API_TIMEOUT
setting and apply it so get/post never send requests without an explicit
timeout.
In `@chat-server/app/services/gopie/dataset_info.py`:
- Around line 11-31: The async context manager usage in get_dataset_info (and
similarly in get_project_info) incorrectly uses "async with await
client.get(path)"; remove the explicit await and use "async with
client.get(path)" so the coroutine returned by client.get is awaited by the
context manager itself; update both functions to change the context manager
lines to use client.get(path) without await and keep the inner await
response.json() as-is to preserve reading the body.
In `@chat-server/app/services/qdrant/get_schema.py`:
- Around line 47-55: The code assumes search_result is always set but if
filter_conditions is empty the client.scroll branch won't run and search_result
is unbound; update the function containing this logic (the block using
filter_conditions, search_result, client.scroll and settings.QDRANT_COLLECTION)
to either (a) early-return None when filter_conditions is empty/invalid (e.g.,
when dataset_id is falsy and org_id is None) or (b) initialize search_result to
a safe default before the if (e.g., None) and add a guard before accessing
search_result[0]; pick one approach and apply it consistently so that the
subsequent check "if not search_result[0] or not search_result[0][0]" cannot
raise UnboundLocalError.
In `@chat-server/app/tool_utils/tools/plan_sql_query.py`:
- Around line 39-41: The plan_sql_query function calls
get_prompt_llm_chain("generate_sql", config) but omits schema enforcement;
update that call in plan_sql_query to pass schema=PlanQueryOutput so the LLM
response is validated into the PlanQueryOutput dataclass, and update the
plan_sql_query docstring to describe the actual return contract: a
PlanQueryOutput containing sql_queries (list of SqlQueryOutput objects which
include nested tables_used), non_sql_response, user_friendly_response, and
limitations (instead of the old top-level reasoning/expected_result/tables_used
fields).
In `@chat-server/app/utils/graph_utils/column_value_matching.py`:
- Around line 21-65: The _build_ilike_query function (and related query
builders) currently interpolate table_name and column_name directly into SQL,
creating SQL injection risk; update match_column_values() to validate both table
and column against a whitelist of actual schema identifiers (e.g., query the
database schema or use a precomputed set) and only pass validated names into
_build_ilike_query, and likewise change get_table_estimated_size() to require
validated table_name before interpolation; ensure validation rejects or
normalizes any identifier that is not an exact match to the known table/column
names and fail fast rather than interpolating untrusted values.
In `@chat-server/app/utils/graph_utils/table_utils.py`:
- Around line 45-46: calculate_sampling_percentage currently divides by
estimated_size and will raise ZeroDivisionError when estimated_size is 0; update
the function (calculate_sampling_percentage) to guard against estimated_size <=
0 by returning a safe formatted value instead of dividing (e.g., return
"100.000000" to indicate full sampling) and ensure the computed sample_pct is
clamped between 0 and 100 when using settings.TARGET_ROWS / estimated_size to
avoid extremes; locate the logic that uses settings.TARGET_ROWS and sample_pct
and add the conditional check and clamping there.
- Around line 18-22: The code directly interpolates table_name into size_query
creating an SQL injection risk and also can divide by zero in
calculate_sampling_percentage; to fix, validate/sanitize table_name before
building size_query (e.g., allow only /^[A-Za-z0-9_]+$/ or check against a
whitelist) and reject or raise a clear error for invalid names instead of
interpolating raw input (this change touches the size_query construction and
callers that pass table_name and can optionally use/extend execute_sql to accept
safe parameters later); also add a zero-check in calculate_sampling_percentage
to handle estimated_size == 0 (return 0 sampling percentage or raise a
controlled error) before performing any division.
In `@chat-server/app/workflow/graph/multi_dataset_graph/node/analyze_dataset.py`:
- Around line 47-49: The indexing of subqueries using subquery_index is unsafe:
before assigning current_query = subqueries[state.get("subquery_index", 0)],
validate that subqueries is a non-empty list and that
state.get("subquery_index") is an int within range; if not, fallback to a safe
default (e.g., None or the first element) and handle that downstream. Update the
code around subqueries, state.get("subquery_index"), and current_query to
perform a type-and-range check (or coerce the index) and choose a safe fallback
to avoid IndexError/TypeError.
In `@chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py`:
- Line 31: The code reads query_result.subqueries[query_index].retry_count
without checking that subqueries exists and query_index is in range; add a
defensive check before access in the function/method where this line appears
(use query_result, subqueries, query_index and retry_count as references):
verify that query_result.subqueries is truthy and that 0 <= query_index <
len(query_result.subqueries) and only then read .retry_count, otherwise assign a
safe default (e.g., 0) or handle the out-of-bounds case (raise a clear error or
skip processing) so an IndexError cannot occur.
In `@chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py`:
- Around line 8-20: The semantic_search function calls search_schemas without
scoping by organization; extract org_id from the RunnableConfig metadata (e.g.,
config.metadata.get("org_id") or the equivalent pattern used in execute_query.py
/ analyze_dataset.py) and pass it into the search_schemas call as the org_id
parameter so search_schemas(user_query=..., embeddings=..., dataset_ids=...,
project_ids=..., org_id=org_id). Ensure you use the same metadata access pattern
as get_model_provider and other workflow nodes to avoid KeyError when metadata
is absent.
In `@chat-server/app/workflow/graph/nl_to_sql_graph/node/sql_agent.py`:
- Around line 35-38: The schema lookup is missing the org_id, so change the
get_schema_from_qdrant call to pass the tenant id (e.g.,
get_schema_from_qdrant(dataset_id=dataset_id, org_id=org_id)); ensure org_id is
obtained from the current context/inputs (where this node receives tenant info)
or added to the function signature if absent, and propagate it alongside
dataset_ids to avoid cross-tenant schema leakage while keeping the same return
handling when schema is None.
In `@chat-server/app/workflow/graph/single_dataset_graph/graph.py`:
- Around line 40-42: The graph currently always routes from prepare_data to
sql_agent even when prepare_data returns no dataset_info; update the graph
and/or agents to handle that case: add a conditional edge from "prepare_data" to
an error handler node (e.g. "prepare_error" or "validate_result") and only add
the edge prepare_data -> "sql_agent" when dataset_info is present, or modify
sql_agent to validate dataset_info (the variable name dataset_info) before
calling sql_planning_agent and route to the error handler if missing; touch the
graph_builder edges around START, "prepare_data", "sql_agent", "execute_sql",
and "validate_result" and the sql_agent/sql_planning_agent dataset_info check to
ensure safe conditional routing.
In `@chat-server/app/workflow/graph/single_dataset_graph/node/execute_sql.py`:
- Around line 47-52: The loop over
query_result.single_dataset_query_result.sql_queries calls
execute_sql_with_full_and_truncated without passing the tenant/org id, causing
org_id to be None; update the call in execute_sql.py to pass the current org id
(e.g., org_id variable or attribute available in the surrounding scope) into
execute_sql_with_full_and_truncated(query=sql_info.sql_query, org_id=org_id) and
ensure sql_info.sql_query_result is assigned the truncated_result_data as before
so tenant isolation is preserved.
- Around line 21-35: The guard incorrectly checks isinstance(last_message,
ErrorMessage) while last_message is a list; change how messages are read so you
inspect the actual last message element: fetch messages = state.get("messages",
[]) and set last_message = messages[-1] if messages else None, then update the
conditional to check if non_sql_response is present or isinstance(last_message,
ErrorMessage) and raise/return accordingly (use the existing ErrorMessage class
and the query_result variable names shown).
In `@chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py`:
- Around line 106-109: The code builds sample_data_query via f-string with
dataset_name, which allows SQL injection; instead, stop interpolating raw
dataset_name into SQL in prepare_data.py and use a safe approach when calling
execute_sql_with_limit: either pass a parameterized query where possible (use
placeholders for values) and supply parameters to execute_sql_with_limit, or
(for SQL identifiers) validate/sanitize dataset_name against a strict pattern
(e.g., /^[A-Za-z0-9_]+$/) or use the DB driver's identifier-quoting mechanism
before injecting it; update the sample_data_query construction and the call to
execute_sql_with_limit (referencing sample_data_query, dataset_name and
execute_sql_with_limit) so only a validated/quoted identifier is used and all
dynamic values are passed as parameters, then keep using convert_rows_to_csv on
the returned rows.
In `@chat-server/app/workflow/graph/single_dataset_graph/node/routing.py`:
- Around line 5-10: The code sets query_result = state.get("query_result", [])
which can be a list and causes attribute access errors; change the default to
None (query_result = state.get("query_result")) and ensure you only access
query_result.single_dataset_query_result when query_result is truthy, e.g. set
single_dataset_res = query_result.single_dataset_query_result if query_result
else None, then compute sql_queries from single_dataset_res only when it and its
sql_queries exist (keeping the existing list comprehension for [sq.sql_query for
sq in single_dataset_res.sql_queries if sq.sql_query]).
In `@chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py`:
- Around line 23-28: The code accesses query_result.single_dataset_query_result
before verifying query_result is present, which can raise AttributeError; update
the logic in sql_agent.py so you first retrieve query_result =
state.get("query_result") and immediately check if not query_result (or
query_result is None) and raise the ValueError there, and only after that access
query_result.single_dataset_query_result to assign single_dataset_result; ensure
the check covers both query_result and single_dataset_result as intended.
In `@chat-server/tests/e2e/viz_utils/per_example_workflow.py`:
- Around line 401-543: The test scoring logic in process_example assumes a 0–10
scale (score >= 8) but the evaluator may return 0–1; detect and normalize the
score before comparing: after retrieving evaluation =
test_result.get("evaluation", {}) and score = evaluation.get("score", 0), coerce
score to a float, and if score <= 1 then scale it (e.g., score_scaled =
float(score) * 10) else keep as-is (score_scaled = float(score)); then use
passed = score_scaled >= 8 and use score_scaled in any printed messages and
result fields so the pass/fail decision matches the intended 0–10 threshold.
Ensure this change is applied inside process_example where
test_result/evaluation/score are handled.
In `@chat-server/tests/e2e/viz_utils/viz_test_case_runner.py`:
- Around line 39-46: The VizEvaluationSchema currently constrains score to 0–1
but the runner uses a 0–10 pass threshold (score >= 8); update the schema field
in VizEvaluationSchema (the score Field) to ge=0.0 and le=10.0 and change its
description to "Similarity score between 0 and 10 (higher is better)"; then
check other places that parse/validate or prompt for the score (any code that
consumes VizEvaluationSchema or performs the score >= 8 check) and ensure
prompts, validation, and test expectations consistently use the 0–10 scale.
In `@chat-server/tests/e2e/viz_utils/viz_utils.py`:
- Around line 192-201: The dict_hash function is using hashlib.sha256
incorrectly by slicing the hash object; instead ensure `serialized` is bytes
(encode when needed), pass that to `hashlib.sha256(...)`, then call
`.hexdigest()` (or `.digest()` if you want raw bytes) and slice the resulting
bytes/hex string as intended before returning; update the logic in dict_hash to
encode `serialized` when it's a str, compute the hash via
`hashlib.sha256(serialized_bytes)`, then return the proper sliced `.hexdigest()`
(or slice `.digest()` and then hexlify) so `m.hexdigest()` is called on a real
digest rather than on the hash object.
🟡 Minor comments (10)
chat-server/app/workflow/graph/graph_mermaid/sql_agent_graph.mmd-8-12 (1)
8-12: Connect or remove the orphansearch_duckdb_docsnode.Right now it’s defined but not part of the flow, so the graph renders a floating node and misrepresents the pipeline. If it’s intended, wire it in; otherwise remove it.
✅ Example fix (wire into flow)
__start__ --> generate_sql; -generate_sql --> __end__; +generate_sql --> search_duckdb_docs; +search_duckdb_docs --> __end__;chat-server/tests/dspy/evaluator.py-82-92 (1)
82-92: Broad exception catch masks specific failure modes that could aid debugging.The exception handler at line 82 catches all exceptions indiscriminately, making it difficult to distinguish between parsing failures (recoverable) and unexpected errors. Similar code in this file (e.g.,
optimize_evaluator.pyline 28) catches specific exception types likeValueErrorandAttributeError. Consider narrowing the catch to specific exceptions that thedspy.ChainOfThoughtevaluation is expected to raise (e.g.,ValueError,TypeError,AttributeError) to let unexpected errors propagate for visibility.Additionally, returning
evaluation_score: 0.0silently on any failure could cause misleading test results—tests expecting high scores will correctly fail, but those expecting low scores might incorrectly pass.chat-server/tests/dspy/optimize_evaluator.py-40-42 (1)
40-42:--no-reasoningflag anduse_reasoningparameter are unusedThe CLI flag and function parameter are collected and passed through but never actually used to control behavior. EvaluatorModule hardcodes
dspy.ChainOfThoughtwith no provision to switch todspy.Predictor. Remove these unused parameters or implement the feature by adding a constructor parameter to EvaluatorModule and conditionally instantiating the appropriate module type.chat-server/tests/test_config.py-33-33 (1)
33-33: Update the module docstring default to match the new URL.The header comment still mentions port 8001.
🔧 Suggested docstring update
- CHAT_SERVER_URL - Chat server endpoint (default: http://localhost:8001/api/v1/chat/completions) + CHAT_SERVER_URL - Chat server endpoint (default: http://localhost:8003/api/v1/chat/completions)chat-server/tests/test_config.py-22-26 (1)
22-26: Addpython-dotenvas an explicit dependency inpyproject.toml.The import will work because
python-dotenvis available as a transitive dependency in the lockfile, but it's not explicitly declared. Add it to the maindependencieslist to ensure it remains available if upstream dependencies change.chat-server/tests/e2e/utils/dataset_test_cases.py-11-26 (1)
11-26: Replace non-breaking hyphens and en dashes with regular ASCII hyphens in expected_result strings.The file contains multiple instances of Unicode non-breaking hyphens (U+2011) and en dashes (U+2013) in expected_result strings (lines 55, 205, 235, 250, 265, 434, 494, 584, 590, 629, 644, 659, 689). These characters differ from regular hyphens (U+002D) and will cause string comparison failures if test assertions compare expected results programmatically.
Examples:
division‑by‑zero,Non‑SQL,group‑by,side‑by‑side,X‑axis,Y‑axis.Replace all instances with standard ASCII hyphens (the
-character).chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py-86-86 (1)
86-86: Typo: "Succesfully" should be "Successfully".Proposed fix
- content="Succesfully Completed query planning step in multidataset workflow." + content="Successfully completed query planning step in multi-dataset workflow."chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py-21-21 (1)
21-21: PotentialTypeErrorifsubqueriesisNone.If
state.get("subqueries")returnsNone, the indexing[query_index]will raise aTypeError. The fallback"No input"only applies if the entire conditional is falsy.Proposed fix
- user_query = state.get("subqueries")[query_index] if state.get("subqueries") else "No input" + subqueries = state.get("subqueries") + user_query = subqueries[query_index] if subqueries else "No input"chat-server/app/services/gopie/generate_schema.py-15-49 (1)
15-49: Remove unnecessary else clause and add defensive quoting for dataset_name.The
elseblock on line 37 is unnecessary since the precedingifstatement returns early. The code can be simplified by removing the else and unindenting the subsequent statements.Additionally, while
dataset_namecomes from the trusted Gopie API response, consider quoting it defensively to align with test patterns elsewhere in the codebase (e.g.,f'SELECT * FROM "{dataset_name}"'). This provides defense-in-depth against any unexpected values.Finally, add error handling around the
execute_sqlcall on line 72 to provide comprehensive error messages and logging per coding guidelines.♻️ Remove unnecessary else clause and add quoting
if not should_use_sampling(estimated_size): logger.debug( f"[{dataset_name}] Small dataset detected ({estimated_size} rows). " "Using standard nested query logic." ) return f""" SELECT DISTINCT * FROM ( - SELECT * FROM {dataset_name} LIMIT 200000 + SELECT * FROM "{dataset_name}" LIMIT 200000 ) LIMIT {limit} """ - else: - pct_str = calculate_sampling_percentage(estimated_size) + pct_str = calculate_sampling_percentage(estimated_size) - logger.debug( - f"[{dataset_name}] Large dataset detected ({estimated_size} rows). " - f"Sampling {pct_str}% (system) to retrieve approx {settings.TARGET_ROWS} rows." - ) + logger.debug( + f"[{dataset_name}] Large dataset detected ({estimated_size} rows). " + f"Sampling {pct_str}% (system) to retrieve approx {settings.TARGET_ROWS} rows." + ) - return f""" - SELECT DISTINCT * FROM {dataset_name} - USING SAMPLE {pct_str}% (system) - LIMIT {limit} - """ + return f""" + SELECT DISTINCT * FROM "{dataset_name}" + USING SAMPLE {pct_str}% (system) + LIMIT {limit} + """chat-server/tests/e2e/viz_utils/per_example_workflow.py-181-238 (1)
181-238: Quote dataset names in sample SQL to avoid syntax errors/injection.
dataset_nameis interpolated directly into SQL without quoting. If the name contains spaces/reserved words or quotes, this will fail (and can be abused if names aren’t sanitized).✅ Safer quoting
- body = {"query": f"SELECT * FROM {dataset_name} LIMIT 5"} + safe_name = dataset_name.replace('"', '""') + body = {"query": f'SELECT * FROM "{safe_name}" LIMIT 5'}
| async def get_dataset_info( | ||
| dataset_id: str, | ||
| project_id: str, | ||
| org_id: Optional[str] = None, | ||
| ) -> DatasetDetails: | ||
| """Get dataset information from Gopie API. | ||
|
|
||
| url = f"{settings.GOPIE_API_ENDPOINT}/v1/api/projects/{project_id}/datasets/{dataset_id}" | ||
| headers = {"accept": "application/json"} | ||
| Args: | ||
| dataset_id: The dataset ID | ||
| project_id: The project ID | ||
| org_id: Optional organization ID for multi-tenant support | ||
|
|
||
| try: | ||
| async with http_session.get(url, headers=headers) as response: | ||
| data = await response.json() | ||
| return DatasetDetails(**data) | ||
| except Exception as e: | ||
| logger.exception(f"Error getting dataset info: {e!s}") | ||
| raise e | ||
| Returns: | ||
| DatasetDetails object | ||
| """ | ||
| client = GopieClient(org_id=org_id) | ||
| path = f"/v1/api/projects/{project_id}/datasets/{dataset_id}" | ||
|
|
||
| async with await client.get(path) as response: | ||
| data = await response.json() | ||
| return DatasetDetails(**data) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find and inspect GopieClient.get method
find chat-server -name "*.py" -type f | xargs rg -l "class GopieClient" | head -5Repository: factly/gopie
Length of output: 96
🏁 Script executed:
# Read the GopieClient implementation
cat -n chat-server/app/services/gopie/client.pyRepository: factly/gopie
Length of output: 3577
🏁 Script executed:
# Also check lines 34-52 in dataset_info.py to see other affected functions
sed -n '30,55p' chat-server/app/services/gopie/dataset_info.pyRepository: factly/gopie
Length of output: 679
Remove explicit await from async context manager pattern.
async with await client.get(path) is non-standard for aiohttp. The async context manager protocol automatically awaits the coroutine; use async with client.get(path) instead. This applies to both get_dataset_info and get_project_info functions.
🔧 Fix
- async with await client.get(path) as response:
+ async with client.get(path) as response:
data = await response.json()🤖 Prompt for AI Agents
In `@chat-server/app/services/gopie/dataset_info.py` around lines 11 - 31, The
async context manager usage in get_dataset_info (and similarly in
get_project_info) incorrectly uses "async with await client.get(path)"; remove
the explicit await and use "async with client.get(path)" so the coroutine
returned by client.get is awaited by the context manager itself; update both
functions to change the context manager lines to use client.get(path) without
await and keep the inner await response.json() as-is to preserve reading the
body.
| if filter_conditions: | ||
| search_result = await client.scroll( | ||
| collection_name=settings.QDRANT_COLLECTION, | ||
| scroll_filter=Filter(should=filter_conditions), | ||
| limit=1, | ||
| ) | ||
|
|
||
| if not search_result[0][0]: | ||
| if not search_result[0] or not search_result[0][0]: | ||
| return None |
There was a problem hiding this comment.
Potential UnboundLocalError when filter_conditions is empty.
If dataset_id is falsy and org_id is None, filter_conditions will be empty, the if filter_conditions: block won't execute, and search_result will be unbound when accessed on line 54.
🐛 Proposed fix
if filter_conditions:
search_result = await client.scroll(
collection_name=settings.QDRANT_COLLECTION,
scroll_filter=Filter(should=filter_conditions),
limit=1,
)
+ else:
+ return None
if not search_result[0] or not search_result[0][0]:
return NoneAlternatively, add early validation:
+ if not dataset_id:
+ logger.warning("get_schema_from_qdrant called without dataset_id")
+ return None
+
try:
client = await QdrantSetup.get_async_client(settings.QDRANT_COLLECTION)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if filter_conditions: | |
| search_result = await client.scroll( | |
| collection_name=settings.QDRANT_COLLECTION, | |
| scroll_filter=Filter(should=filter_conditions), | |
| limit=1, | |
| ) | |
| if not search_result[0][0]: | |
| if not search_result[0] or not search_result[0][0]: | |
| return None | |
| if filter_conditions: | |
| search_result = await client.scroll( | |
| collection_name=settings.QDRANT_COLLECTION, | |
| scroll_filter=Filter(should=filter_conditions), | |
| limit=1, | |
| ) | |
| else: | |
| return None | |
| if not search_result[0] or not search_result[0][0]: | |
| return None |
🤖 Prompt for AI Agents
In `@chat-server/app/services/qdrant/get_schema.py` around lines 47 - 55, The code
assumes search_result is always set but if filter_conditions is empty the
client.scroll branch won't run and search_result is unbound; update the function
containing this logic (the block using filter_conditions, search_result,
client.scroll and settings.QDRANT_COLLECTION) to either (a) early-return None
when filter_conditions is empty/invalid (e.g., when dataset_id is falsy and
org_id is None) or (b) initialize search_result to a safe default before the if
(e.g., None) and add a guard before accessing search_result[0]; pick one
approach and apply it consistently so that the subsequent check "if not
search_result[0] or not search_result[0][0]" cannot raise UnboundLocalError.
| try: | ||
| chain = get_prompt_llm_chain("plan_sql_query_tool", config) | ||
| chain = get_prompt_llm_chain("generate_sql", config) | ||
| response = await chain.ainvoke({"user_query": user_query, "dataset_info": dataset_info}) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find generate_sql prompt definition
find chat-server -type f -name "*.py" -o -name "*.json" -o -name "*.yaml" -o -name "*.yml" | head -50 | xargs grep -l "generate_sql" 2>/dev/nullRepository: factly/gopie
Length of output: 168
🏁 Script executed:
#!/bin/bash
# Search for generate_sql in prompts directory
fd "prompts" chat-server --type dRepository: factly/gopie
Length of output: 168
🏁 Script executed:
#!/bin/bash
# Search for get_prompt_llm_chain implementation
rg -n "def get_prompt_llm_chain" chat-server --type py -A 10Repository: factly/gopie
Length of output: 893
🏁 Script executed:
#!/bin/bash
# Inspect plan_sql_query to see full implementation
cat -n chat-server/app/tool_utils/tools/plan_sql_query.pyRepository: factly/gopie
Length of output: 2585
🏁 Script executed:
#!/bin/bash
# Get full get_prompt_llm_chain implementation
sed -n '86,150p' chat-server/app/utils/langsmith/prompt_manager.pyRepository: factly/gopie
Length of output: 1362
🏁 Script executed:
#!/bin/bash
# Search for generate_sql prompt or prompt mapping
rg -n "generate_sql" chat-server --type py -B 2 -A 5Repository: factly/gopie
Length of output: 9931
🏁 Script executed:
#!/bin/bash
# Search for PromptSelector or NodeName enum/mapping
rg -n "class NodeName|generate_sql" chat-server/app --type pyRepository: factly/gopie
Length of output: 1836
🏁 Script executed:
#!/bin/bash
# Check format_generate_sql_input full signature and expected parameter types
sed -n '111,140p' chat-server/app/workflow/prompts/generate_sql_prompt.pyRepository: factly/gopie
Length of output: 1107
🏁 Script executed:
#!/bin/bash
# Check PlanQueryOutput schema definition
rg -n "class PlanQueryOutput" chat-server --type py -A 15Repository: factly/gopie
Length of output: 1977
🏁 Script executed:
#!/bin/bash
# Get full PlanQueryOutput schema and SqlQueryOutput
sed -n '1,50p' chat-server/app/workflow/graph/sql_planner_graph/generate_sql.pyRepository: factly/gopie
Length of output: 2137
Add missing schema=PlanQueryOutput parameter and update tool docstring to match actual contract.
plan_sql_query calls get_prompt_llm_chain("generate_sql", config) without schema=PlanQueryOutput, causing unstructured LLM output instead of validated PlanQueryOutput. This contradicts the workflow implementation (generate_sql.py:80) which enforces the schema. Additionally, the tool's docstring documents incorrect return fields:
- Promises:
reasoning,expected_result,tables_used(top-level) - Actual
PlanQueryOutput:sql_queries(list ofSqlQueryOutputobjects with nestedtables_used),non_sql_response,user_friendly_response,limitations
Pass schema=PlanQueryOutput to enforce structured output and correct the docstring to match the actual schema.
🤖 Prompt for AI Agents
In `@chat-server/app/tool_utils/tools/plan_sql_query.py` around lines 39 - 41, The
plan_sql_query function calls get_prompt_llm_chain("generate_sql", config) but
omits schema enforcement; update that call in plan_sql_query to pass
schema=PlanQueryOutput so the LLM response is validated into the PlanQueryOutput
dataclass, and update the plan_sql_query docstring to describe the actual return
contract: a PlanQueryOutput containing sql_queries (list of SqlQueryOutput
objects which include nested tables_used), non_sql_response,
user_friendly_response, and limitations (instead of the old top-level
reasoning/expected_result/tables_used fields).
| def _build_ilike_query( | ||
| table_name: str, | ||
| column_name: str, | ||
| escaped_value: str, | ||
| estimated_size: int, | ||
| limit: int = 5, | ||
| ) -> str: | ||
| """Build optimized ILIKE query based on table size. | ||
|
|
||
| Args: | ||
| table_name: Name of the table | ||
| column_name: Name of the column to search | ||
| escaped_value: Escaped search value | ||
| estimated_size: Estimated row count from metadata | ||
| limit: Number of rows to return | ||
|
|
||
| Returns: | ||
| SQL query string | ||
| """ | ||
| where_clause = f"WHERE LOWER(CAST({column_name} AS VARCHAR)) LIKE '%' || LOWER('{escaped_value}') || '%'" | ||
|
|
||
| if not should_use_sampling(estimated_size): | ||
| return f""" | ||
| SELECT DISTINCT {column_name} | ||
| FROM (SELECT * FROM {table_name} LIMIT 200000) | ||
| {where_clause} | ||
| LIMIT {limit} | ||
| """ | ||
| else: | ||
| pct_str = calculate_sampling_percentage(estimated_size) | ||
|
|
||
| logger.debug( | ||
| f"[{table_name}] Large dataset detected ({estimated_size} rows) for ILIKE search. " | ||
| f"Sampling {pct_str}% (system)." | ||
| ) | ||
|
|
||
| return f""" | ||
| SELECT DISTINCT {column_name} | ||
| FROM ( | ||
| SELECT * FROM {table_name} | ||
| USING SAMPLE {pct_str}% (system) | ||
| ) | ||
| {where_clause} | ||
| LIMIT {limit} | ||
| """ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the full file to understand context
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | head -100Repository: factly/gopie
Length of output: 3655
🏁 Script executed:
# Find who calls _build_ilike_query and related functions
rg -n "_build_ilike_query|_build_levenshtein_query|verify_fuzzy_values|find_similar_values" --type=py chat-server/app/ -A 3 -B 3Repository: factly/gopie
Length of output: 5354
🏁 Script executed:
# Check if there's any schema validation or whitelisting
rg -n "table_name|column_name" --type=py chat-server/app/utils/graph_utils/ -B 2 -A 2 | head -150Repository: factly/gopie
Length of output: 13328
🏁 Script executed:
# Find callers of find_similar_column_values
rg -n "find_similar_column_values" --type=py chat-server/app/ -B 2 -A 5Repository: factly/gopie
Length of output: 38
🏁 Script executed:
# Check if there's any schema validation or whitelist before these values are used
rg -n "schema|whitelist|validate.*column|validate.*table" --type=py chat-server/app/utils/graph_utils/ -iRepository: factly/gopie
Length of output: 162
🏁 Script executed:
# Look for how database schema metadata is accessed
rg -n "get_table|get_column|table_metadata|column_metadata|schema" --type=py chat-server/app/utils/graph_utils/ -i | head -50Repository: factly/gopie
Length of output: 543
🏁 Script executed:
# Get the full file to understand the function signatures and data sources
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | tail -200Repository: factly/gopie
Length of output: 8903
🏁 Script executed:
# Look at the beginning of the main public function
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | head -160Repository: factly/gopie
Length of output: 6204
🏁 Script executed:
# Search for calls to these functions from other modules
rg -n "from.*column_value_matching|import.*find_similar\|import.*verify_fuzzy" --type=py chat-server/app/ -A 3Repository: factly/gopie
Length of output: 533
🏁 Script executed:
# Check get_table_estimated_size to see if it validates table names
cat -n chat-server/app/utils/graph_utils/table_utils.py | head -50Repository: factly/gopie
Length of output: 1996
🏁 Script executed:
# Find who calls analyze_dataset
rg -n "analyze_dataset" --type=py chat-server/app/ -B 2 -A 5 | head -100Repository: factly/gopie
Length of output: 9503
🏁 Script executed:
# Check if there's any schema discovery or validation before these values are used
rg -n "get_columns|describe|schema.*table" --type=py chat-server/app/ -i | head -40Repository: factly/gopie
Length of output: 2051
🏁 Script executed:
# Check the execute_sql implementation
cat -n chat-server/app/services/gopie/sql_executor.py | head -100Repository: factly/gopie
Length of output: 2584
🌐 Web query:
DuckDB parameterized identifiers support table column names
💡 Result:
Short answer: No — DuckDB prepared parameters can only substitute values, not arbitrary identifiers (like column names). You can parameterize table names using query_table/query($1) or build SQL on the client side, but you cannot bind a parameter to an identifier (column name) inside a prepared statement. [1][2][3]
References:
- Prepared statements / parameters (values only). [1]
- query_table / query functions (allow passing a table name string). [2]
- Identifiers / quoting rules (identifiers must be literal/quoted; not parameter placeholders). [3]
SQL injection risk: table_name and column_name are interpolated directly into queries without parameterization or validation.
While escaped_value is escaped for single quotes, the table_name and column_name parameters are interpolated directly into the SQL string via f-strings. Since these values are extracted from user-provided column_assumptions with only minimal falsy checks, an attacker could inject arbitrary SQL by crafting malicious table or column names.
DuckDB prepared statements do not support parameterized identifiers (only values), making whitelist validation against known schema elements the appropriate mitigation. Apply this validation in match_column_values() before passing these values to the query builders, or ensure the upstream data source guarantees only valid schema identifiers.
Additionally, get_table_estimated_size() in table_utils.py (line 21) has the same vulnerability when interpolating table_name.
🧰 Tools
🪛 Ruff (0.14.13)
43-48: Possible SQL injection vector through string-based query construction
(S608)
57-65: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@chat-server/app/utils/graph_utils/column_value_matching.py` around lines 21 -
65, The _build_ilike_query function (and related query builders) currently
interpolate table_name and column_name directly into SQL, creating SQL injection
risk; update match_column_values() to validate both table and column against a
whitelist of actual schema identifiers (e.g., query the database schema or use a
precomputed set) and only pass validated names into _build_ilike_query, and
likewise change get_table_estimated_size() to require validated table_name
before interpolation; ensure validation rejects or normalizes any identifier
that is not an exact match to the known table/column names and fail fast rather
than interpolating untrusted values.
| query_result = state.get("query_result", []) | ||
| single_dataset_res = query_result.single_dataset_query_result | ||
|
|
||
| sql_queries = ( | ||
| [sq.sql_query for sq in single_dataset_res.sql_queries if sq.sql_query] | ||
| if single_dataset_res and single_dataset_res.sql_queries |
There was a problem hiding this comment.
Guard against missing query_result before attribute access.
state.get("query_result", []) can return [], and [] has no single_dataset_query_result, which will raise at runtime. Default to None and short-circuit when missing.
✅ Suggested fix
-def route_after_sql_generation(state: State) -> str:
- query_result = state.get("query_result", [])
- single_dataset_res = query_result.single_dataset_query_result
+def route_after_sql_generation(state: State) -> str:
+ query_result = state.get("query_result")
+ if not query_result or not getattr(query_result, "single_dataset_query_result", None):
+ return "no_sql_queries"
+ single_dataset_res = query_result.single_dataset_query_result🤖 Prompt for AI Agents
In `@chat-server/app/workflow/graph/single_dataset_graph/node/routing.py` around
lines 5 - 10, The code sets query_result = state.get("query_result", []) which
can be a list and causes attribute access errors; change the default to None
(query_result = state.get("query_result")) and ensure you only access
query_result.single_dataset_query_result when query_result is truthy, e.g. set
single_dataset_res = query_result.single_dataset_query_result if query_result
else None, then compute sql_queries from single_dataset_res only when it and its
sql_queries exist (keeping the existing list comprehension for [sq.sql_query for
sq in single_dataset_res.sql_queries if sq.sql_query]).
| query_result = state.get("query_result") | ||
|
|
||
| single_dataset_result = query_result.single_dataset_query_result | ||
|
|
||
| if not query_result or not single_dataset_result: | ||
| raise ValueError("query_result is not properly initialized") |
There was a problem hiding this comment.
Potential AttributeError: accessing query_result before null check.
Line 25 accesses query_result.single_dataset_query_result before line 27 validates that query_result is not None. If state.get("query_result") returns None, this will raise AttributeError instead of the intended ValueError.
🐛 Proposed fix - reorder validation
query_result = state.get("query_result")
- single_dataset_result = query_result.single_dataset_query_result
-
if not query_result or not single_dataset_result:
+ if not query_result:
+ raise ValueError("query_result is not properly initialized")
+
+ single_dataset_result = query_result.single_dataset_query_result
+
+ if not single_dataset_result:
raise ValueError("query_result is not properly initialized")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| query_result = state.get("query_result") | |
| single_dataset_result = query_result.single_dataset_query_result | |
| if not query_result or not single_dataset_result: | |
| raise ValueError("query_result is not properly initialized") | |
| query_result = state.get("query_result") | |
| if not query_result: | |
| raise ValueError("query_result is not properly initialized") | |
| single_dataset_result = query_result.single_dataset_query_result | |
| if not single_dataset_result: | |
| raise ValueError("query_result is not properly initialized") |
🧰 Tools
🪛 Ruff (0.14.13)
28-28: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py` around
lines 23 - 28, The code accesses query_result.single_dataset_query_result before
verifying query_result is present, which can raise AttributeError; update the
logic in sql_agent.py so you first retrieve query_result =
state.get("query_result") and immediately check if not query_result (or
query_result is None) and raise the ValueError there, and only after that access
query_result.single_dataset_query_result to assign single_dataset_result; ensure
the check covers both query_result and single_dataset_result as intended.
| async def process_example( | ||
| self, example_name: str, example_code: str, example_index: int, total_examples: int | ||
| ) -> ExampleResult: | ||
| """ | ||
| Complete workflow for ONE example. | ||
|
|
||
| Args: | ||
| example_name: Name of the example | ||
| example_code: Python code from example | ||
| example_index: Current example number (1-indexed) | ||
| total_examples: Total number of examples | ||
|
|
||
| Returns: | ||
| ExampleResult with status and details | ||
| """ | ||
| print(f"\n{'='*70}") | ||
| print(f"[{example_index}/{total_examples}] Processing: {example_name}") | ||
| print(f"{'='*70}") | ||
|
|
||
| result = ExampleResult(example_name=example_name, status="error") | ||
|
|
||
| # Step 1: Execute code and generate image | ||
| print(" → Step 1: Generating chart image...") | ||
| image_path, error = self._execute_python_code(example_code) | ||
|
|
||
| if error or not image_path: | ||
| result.status = "skip" | ||
| result.error_message = f"Failed to generate image: {error}" | ||
| print(f" ⚠️ Skipped - {result.error_message}") | ||
| return result | ||
|
|
||
| print(f" ✓ Image saved: {Path(image_path).name}") | ||
| result.image_path = image_path | ||
|
|
||
| # Step 2: Extract dataset names | ||
| print(" → Step 2: Extracting dataset names...") | ||
| dataset_names = self._extract_dataset_names(example_code) | ||
|
|
||
| if not dataset_names: | ||
| result.status = "skip" | ||
| result.error_message = "No datasets found in code" | ||
| print(f" ⚠️ Skipped - {result.error_message}") | ||
| return result | ||
|
|
||
| print(f" ✓ Found datasets: {', '.join(dataset_names)}") | ||
| result.datasets = dataset_names | ||
|
|
||
| # Step 3: Find datasets in Gopie (assumes already uploaded) | ||
| print(" → Step 3: Finding datasets in Gopie...") | ||
| dataset_ids = [] | ||
| for ds_name in dataset_names: | ||
| ds_id = await self._find_dataset_in_gopie(ds_name) | ||
| if ds_id: | ||
| dataset_ids.append(ds_id) | ||
| print(f" ✓ Found '{ds_name}' (ID: {ds_id[:8]}...)") | ||
| else: | ||
| print(f" ⚠️ Dataset '{ds_name}' not found in Gopie") | ||
|
|
||
| if not dataset_ids: | ||
| result.status = "skip" | ||
| result.error_message = "No matching datasets found in Gopie" | ||
| print(f" ⚠️ Skipped - {result.error_message}") | ||
| return result | ||
|
|
||
| # Step 4: Fetch schemas | ||
| print(f" → Step 4: Fetching {len(dataset_ids)} schema(s)...") | ||
| fetch_tasks = [ | ||
| self._fetch_schema_for_dataset(dataset_id=did, project_id=self.project_id) | ||
| for did in dataset_ids | ||
| ] | ||
| schemas_results = await asyncio.gather(*fetch_tasks, return_exceptions=True) | ||
| schemas = [s for s in schemas_results if isinstance(s, dict)] | ||
|
|
||
| if not schemas: | ||
| result.status = "error" | ||
| result.error_message = "Failed to fetch schemas" | ||
| print(f" ❌ {result.error_message}") | ||
| return result | ||
|
|
||
| print(f" ✓ Fetched {len(schemas)} schema(s)") | ||
|
|
||
| # Step 5: Generate test case with LLM | ||
| print(" → Step 5: Generating test case with LLM...") | ||
| try: | ||
| prompt = self._create_prompt_for_image(image_path=image_path, datasets_schema=schemas) | ||
| test_case = await self._call_llm(prompt) | ||
|
|
||
| unique_project_ids = {s.get("project_id", "") for s in schemas if s.get("project_id")} | ||
| unique_dataset_ids = [s.get("dataset_id", "") for s in schemas if s.get("dataset_id")] | ||
| unique_dataset_ids = [d for d in unique_dataset_ids if d] | ||
|
|
||
| test_case.project_id = next(iter(unique_project_ids)) if unique_project_ids else "" | ||
| test_case.dataset_id = unique_dataset_ids[0] if len(unique_dataset_ids) == 1 else "" | ||
|
|
||
| dataset_names_for_sql = [ | ||
| s.get("dataset_name", "") for s in schemas if s.get("dataset_name") | ||
| ] | ||
| test_case.sql_queries = [f'SELECT * FROM "{dn}"' for dn in dataset_names_for_sql] | ||
| test_case.image_path = image_path | ||
|
|
||
| result.test_case = test_case | ||
| print(f" ✓ Test case generated: {test_case.query[:60]}...") | ||
|
|
||
| except Exception as e: | ||
| result.status = "error" | ||
| result.error_message = f"LLM generation failed: {e}" | ||
| print(f" ❌ {result.error_message}") | ||
| return result | ||
|
|
||
| # Step 6: Run test case immediately | ||
| print(" → Step 6: Running test case...") | ||
| try: | ||
| case_dict = { | ||
| "project_id": test_case.project_id, | ||
| "dataset_id": test_case.dataset_id, | ||
| "query": test_case.query, | ||
| "sql_queries": test_case.sql_queries, | ||
| "image_path": test_case.image_path, | ||
| } | ||
|
|
||
| test_result = await run_viz_test_case(case_dict) | ||
| result.test_result = test_result | ||
|
|
||
| if test_result.get("success"): | ||
| evaluation = test_result.get("evaluation", {}) | ||
| score = evaluation.get("score", 0) | ||
| passed = score >= 8 | ||
| status_emoji = "✅" if passed else "❌" | ||
| result.status = "success" if passed else "error" | ||
| print( | ||
| f" {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score}/10)" | ||
| ) | ||
| else: | ||
| result.status = "error" | ||
| result.error_message = test_result.get("error", "Unknown error") | ||
| print(f" ❌ Test error: {result.error_message}") | ||
|
|
||
| except Exception as e: | ||
| result.status = "error" | ||
| result.error_message = f"Test execution failed: {e}" | ||
| print(f" ❌ {result.error_message}") | ||
|
|
||
| return result |
There was a problem hiding this comment.
Score threshold assumes a 0–10 scale but evaluator appears 0–1.
If the evaluator returns 0–1 (as in the current prompt/schema), score >= 8 will always fail. Align the scale or convert before comparison.
🔧 Option if keeping 0–1 scale
- score = evaluation.get("score", 0)
- passed = score >= 8
+ score = evaluation.get("score", 0)
+ passed = score >= 0.8
status_emoji = "✅" if passed else "❌"
result.status = "success" if passed else "error"
print(
- f" {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score}/10)"
+ f" {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score:.2f}/1.0)"
)🧰 Tools
🪛 Ruff (0.14.13)
498-498: Possible SQL injection vector through string-based query construction
(S608)
504-504: Do not catch blind exception: Exception
(BLE001)
538-538: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/per_example_workflow.py` around lines 401 -
543, The test scoring logic in process_example assumes a 0–10 scale (score >= 8)
but the evaluator may return 0–1; detect and normalize the score before
comparing: after retrieving evaluation = test_result.get("evaluation", {}) and
score = evaluation.get("score", 0), coerce score to a float, and if score <= 1
then scale it (e.g., score_scaled = float(score) * 10) else keep as-is
(score_scaled = float(score)); then use passed = score_scaled >= 8 and use
score_scaled in any printed messages and result fields so the pass/fail decision
matches the intended 0–10 threshold. Ensure this change is applied inside
process_example where test_result/evaluation/score are handled.
| class VizEvaluationSchema(BaseModel): | ||
| passed: bool = Field(description="Whether the generated chart matches the reference") | ||
| score: float = Field( | ||
| ge=0.0, | ||
| le=1.0, | ||
| description="Similarity score between 0 and 1 (higher is better)", | ||
| ) | ||
| reasoning: str = Field(description="Short explanation of the judgment") |
There was a problem hiding this comment.
Align evaluation score scale with the pass/fail threshold.
The schema/prompt specify a 0–1 score, but the runner treats it as 0–10 (score >= 8). This will mark nearly all cases as failed or fail schema validation. Consider standardizing on 0–10 to match existing threshold.
🔧 Align schema & prompt to 0–10
class VizEvaluationSchema(BaseModel):
passed: bool = Field(description="Whether the generated chart matches the reference")
score: float = Field(
ge=0.0,
- le=1.0,
- description="Similarity score between 0 and 1 (higher is better)",
+ le=10.0,
+ description="Similarity score between 0 and 10 (higher is better)",
)
reasoning: str = Field(description="Short explanation of the judgment")- "passed (boolean), score (float 0-1), reasoning (short string)."
+ "passed (boolean), score (float 0-10), reasoning (short string)."Also applies to: 69-83, 314-321
🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/viz_test_case_runner.py` around lines 39 -
46, The VizEvaluationSchema currently constrains score to 0–1 but the runner
uses a 0–10 pass threshold (score >= 8); update the schema field in
VizEvaluationSchema (the score Field) to ge=0.0 and le=10.0 and change its
description to "Similarity score between 0 and 10 (higher is better)"; then
check other places that parse/validate or prompt for the score (any code that
consumes VizEvaluationSchema or performs the score >= 8 check) and ensure
prompts, validation, and test expectations consistently use the 0–10 scale.
| def dict_hash(dct: dict[Any, Any]) -> Any: | ||
| """Return a hash of the contents of a dictionary.""" | ||
| serialized = json.dumps(dct, sort_keys=True) | ||
|
|
||
| try: | ||
| m = hashlib.sha256(serialized)[:32] # pyright: ignore[reportArgumentType,reportIndexIssue] | ||
| except TypeError: | ||
| m = hashlib.sha256(serialized.encode())[:32] # pyright: ignore[reportIndexIssue] | ||
|
|
||
| return m.hexdigest() |
There was a problem hiding this comment.
Bug: hashlib.sha256() returns a hash object, not bytes—slicing it won't work.
The code attempts to slice hashlib.sha256(serialized)[:32], but hashlib.sha256() returns a hash object, not bytes. This will raise a TypeError (which the code then catches and tries the encoded version, which also won't work correctly). The # pyright: ignore comments are masking the issue.
🐛 Proposed fix
def dict_hash(dct: dict[Any, Any]) -> Any:
"""Return a hash of the contents of a dictionary."""
serialized = json.dumps(dct, sort_keys=True)
-
- try:
- m = hashlib.sha256(serialized)[:32] # pyright: ignore[reportArgumentType,reportIndexIssue]
- except TypeError:
- m = hashlib.sha256(serialized.encode())[:32] # pyright: ignore[reportIndexIssue]
-
- return m.hexdigest()
+ return hashlib.sha256(serialized.encode()).hexdigest()[:32]🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/viz_utils.py` around lines 192 - 201, The
dict_hash function is using hashlib.sha256 incorrectly by slicing the hash
object; instead ensure `serialized` is bytes (encode when needed), pass that to
`hashlib.sha256(...)`, then call `.hexdigest()` (or `.digest()` if you want raw
bytes) and slice the resulting bytes/hex string as intended before returning;
update the logic in dict_hash to encode `serialized` when it's a str, compute
the hash via `hashlib.sha256(serialized_bytes)`, then return the proper sliced
`.hexdigest()` (or slice `.digest()` and then hexlify) so `m.hexdigest()` is
called on a real digest rather than on the hash object.
fix(workflow): standardize SQL query state key naming - Changed state key from "relevant_sql_queries" to "previous_sql_queries" in multi_dataset and single_dataset agent nodes - Changed state key from "previous_sql_queries" to "prev_sql_queries" in analyze_query node to match input parameter naming ```
|
Claude finished @paul-tharun's task —— View job PR Review: Feat/e2e datasets metadataI've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings: Review Checklist
🏗️ Architecture & Design✅ Positive Changes
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⚪ | Well-designed but complex |
| Security | ⭐⭐⭐⚪⚪ | Good practices with some gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code with room for improvement |
Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@chat-server/app/utils/providers/embedding_providers/vllm.py`:
- Around line 26-31: The VLLMEmbeddingProvider __init__ currently accepts but
never stores metadata and calls no super; either remove the unused metadata
parameter from VLLMEmbeddingProvider or persist it similarly to
LiteLLMEmbeddingProvider (e.g., store self.metadata and extract org-scoped keys
like user/trace_id as PortkeyEmbeddingProvider does) so downstream methods can
use it; additionally implement a fallback strategy by updating either
VLLMEmbeddingProvider (add a guarded get_embedding call that catches failures
and delegates to a configured fallback provider) or enhance
get_embedding_provider() to return a wrapper that attempts the primary provider
(VLLMEmbeddingProvider) and on exception switches to a secondary provider,
ensuring clear identifiers (VLLMEmbeddingProvider, LiteLLMEmbeddingProvider,
PortkeyEmbeddingProvider, get_embedding_provider) are used to locate and wire
the fallback logic.
🧹 Nitpick comments (5)
chat-server/app/workflow/graph/multi_dataset_graph/node/analyze_query.py (2)
71-93: Consider generating a unique ID for the fallback tool_call.The hardcoded
"id": "fallback_respond_to_user"may cause issues if any downstream code validates or indexes by tool_call_id expecting unique values. LangChain typically uses UUIDs.♻️ Suggested improvement
+import uuid + ... fallback = AIMessage( content=getattr(response, "content", "") or "", tool_calls=[ { "name": "respond_to_user", "args": { "query_type": "data_query", "confidence_score": 5, "reasoning": "Model did not return a tool call; defaulting to full workflow.", "clarification_needed": "", "status_message": "Analyzing your question...", "response_data": None, }, - "id": "fallback_respond_to_user", + "id": f"fallback_{uuid.uuid4().hex[:12]}", } ], )
179-185: Consider adding defensive checks for robustness.Direct access to
response.tool_calls[0]["args"]could raiseIndexErrororKeyErrorif the response structure is unexpected. While the outer try-except catches this, a defensive check would provide clearer error handling.♻️ Suggested defensive check
async def _handle_analysis_response( response: Any, query_result: QueryResult, tool_call_count: int, config: RunnableConfig, ) -> dict: - tool_args = response.tool_calls[0]["args"] + tool_calls = getattr(response, "tool_calls", None) or [] + if not tool_calls or "args" not in tool_calls[0]: + logger.warning("_handle_analysis_response: missing tool_calls or args; using defaults") + tool_args = {} + else: + tool_args = tool_calls[0]["args"] query_type = tool_args.get("query_type", "conversational") confidence_score = tool_args.get("confidence_score", 5)chat-server/app/core/config.py (1)
113-115: Consider surfacing these sampling knobs in env/docs.
SinceTARGET_ROWSandSAMPLING_THRESHOLDare new tunables, adding them to.env.exampleor config docs would make them more discoverable for operators.chat-server/app/utils/providers/embedding_providers/vllm.py (1)
12-23: Avoid clobbering or duplicatingmodel_kwargsin embed_query.If a caller supplies
model_kwargs, this currently risks aTypeError(duplicate keyword) or silently discarding caller options. Prefer merging.♻️ Suggested merge pattern
class VLLMEmbeddingFunctionProvider(OpenAIEmbeddings): def embed_query(self, text: str, **kwargs: Any) -> list[float]: text = f"Instruct: {TASK_DESCRIPTION}\nQuery: {text}" + model_kwargs = {"extra_body": {"truncate_prompt_tokens": -1}} + model_kwargs |= kwargs.pop("model_kwargs", {}) return super().embed_query( - text, model_kwargs={"extra_body": {"truncate_prompt_tokens": -1}}, **kwargs + text, model_kwargs=model_kwargs, **kwargs ) async def aembed_query(self, text: str, **kwargs: Any) -> list[float]: text = f"Instruct: {TASK_DESCRIPTION}\nQuery: {text}" + model_kwargs = {"extra_body": {"truncate_prompt_tokens": -1}} + model_kwargs |= kwargs.pop("model_kwargs", {}) return await super().aembed_query( - text, model_kwargs={"extra_body": {"truncate_prompt_tokens": -1}}, **kwargs + text, model_kwargs=model_kwargs, **kwargs )chat-server/app/utils/model_registry/model_provider.py (1)
175-203: Guard retries when tools are bound to avoid duplicate side effects in downstream chains.
bind_tools()binds tool schemas only and doesn't execute tools directly. However, when this LLM is used downstream in a chain that executes tools, retries at higher levels can cause tool side effects to repeat. As a defensive measure, skipwith_retry()when tools are bound, since side-effectful operations (database writes, external API calls) should be idempotent or prevented from retrying altogether.♻️ Suggested guard to avoid retrying tool calls
- if schema: - structured_llm = llm.with_structured_output(schema=schema, method="json_schema") # type: ignore - return structured_llm.with_retry() # type: ignore + if schema: + structured_llm = llm.with_structured_output(schema=schema, method="json_schema") # type: ignore + return structured_llm.with_retry() # type: ignore elif json_mode: llm = llm.bind(response_format={"type": "json_object"}) if force_tool_calls: llm = llm.bind(tool_choice="required") - return llm.with_retry() # type: ignore + if tool_names or force_tool_calls: + return llm # avoid retrying side-effectful tool calls + return llm.with_retry() # type: ignore
| class VLLMEmbeddingProvider(BaseEmbeddingProvider): | ||
| def __init__( | ||
| self, | ||
| metadata: dict[str, str], | ||
| ): | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find and examine BaseEmbeddingProvider and other embedding providers
echo "=== Finding base.py ==="
fd -t f 'base.py' chat-server/app/utils/providers/embedding_providers
echo -e "\n=== Listing embedding providers directory ==="
fd -t f '\.py$' chat-server/app/utils/providers/embedding_providers
echo -e "\n=== Checking BaseEmbeddingProvider definition ==="
rg -n "class BaseEmbeddingProvider" chat-server/app/utils/providers/embedding_providers -A 20Repository: factly/gopie
Length of output: 1120
🏁 Script executed:
#!/bin/bash
# Get full BaseEmbeddingProvider implementation
echo "=== Full BaseEmbeddingProvider class ==="
cat -n chat-server/app/utils/providers/embedding_providers/base.py
echo -e "\n=== VLLMEmbeddingProvider __init__ ==="
rg -n "class VLLMEmbeddingProvider" chat-server/app/utils/providers/embedding_providers/vllm.py -A 10
echo -e "\n=== OpenAI provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/openai.py -A 10
echo -e "\n=== LiteLLM provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/litellm.py -A 10
echo -e "\n=== Local provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/local.py -A 10
echo -e "\n=== Custom provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/custom.py -A 10
echo -e "\n=== Portkey provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/portkey.py -A 10Repository: factly/gopie
Length of output: 3474
🏁 Script executed:
#!/bin/bash
# Check embedding provider usage and instantiation
echo "=== Check __init__.py for embedding providers ==="
cat -n chat-server/app/utils/providers/embedding_providers/__init__.py
echo -e "\n=== Search for VLLMEmbeddingProvider usage ==="
rg -r "VLLMEmbeddingProvider" chat-server -g '*.py' -B 2 -A 2
echo -e "\n=== Check provider manager or factory for fallback logic ==="
fd -t f '*.py' chat-server/app/utils/providers | head -20 | xargs -I {} bash -c 'echo "=== {} ===" && head -5 {}'
echo -e "\n=== Look for fallback or routing patterns ==="
rg -i "fallback|fallback_provider" chat-server -g '*.py' -A 2Repository: factly/gopie
Length of output: 14425
🏁 Script executed:
#!/bin/bash
# Check how embedding providers are instantiated/used
echo "=== Search for embedding provider instantiation ==="
rg "EmbeddingProvider\(" chat-server/app -g '*.py' -B 2 -A 2
echo -e "\n=== Check if there's a provider manager/factory ==="
fd -t f '*.py' chat-server/app/utils/providers | grep -E "(manager|factory|init)" | xargs -I {} bash -c 'echo "=== {} ===" && head -20 {}'
echo -e "\n=== Check __init__.py in providers parent ==="
cat -n chat-server/app/utils/providers/__init__.py 2>/dev/null || echo "File not found"
echo -e "\n=== Check for EmbeddingProvider in main app init ==="
rg "EmbeddingProvider" chat-server/app -g '*.py' | head -20Repository: factly/gopie
Length of output: 8631
🏁 Script executed:
#!/bin/bash
# Get complete model_provider.py context
echo "=== Full model_provider.py or relevant section ==="
wc -l chat-server/app/utils/model_registry/model_provider.py
head -80 chat-server/app/utils/model_registry/model_provider.py
echo -e "\n=== Check for fallback in get_embedding_provider ==="
rg -n "get_embedding_provider" chat-server/app/utils/model_registry/model_provider.py -A 20
echo -e "\n=== Check if metadata is used in VLLM get_embeddings_model ==="
cat -n chat-server/app/utils/providers/embedding_providers/vllm.pyRepository: factly/gopie
Length of output: 6287
Remove unused metadata parameter and implement fallback mechanism.
BaseEmbeddingProvider has no __init__ method, so super().__init__() is not applicable. However, VLLMEmbeddingProvider accepts metadata but never stores or uses it, unlike PortkeyEmbeddingProvider (which extracts user/trace_id for org-scoped routing) and LiteLLMEmbeddingProvider (which stores metadata for auth). Either remove the unused parameter for consistency, or document why it's needed.
Additionally, per coding guidelines, implement fallback mechanisms at the provider level or in get_embedding_provider() factory—currently there is no fallback strategy if the primary provider fails.
🤖 Prompt for AI Agents
In `@chat-server/app/utils/providers/embedding_providers/vllm.py` around lines 26
- 31, The VLLMEmbeddingProvider __init__ currently accepts but never stores
metadata and calls no super; either remove the unused metadata parameter from
VLLMEmbeddingProvider or persist it similarly to LiteLLMEmbeddingProvider (e.g.,
store self.metadata and extract org-scoped keys like user/trace_id as
PortkeyEmbeddingProvider does) so downstream methods can use it; additionally
implement a fallback strategy by updating either VLLMEmbeddingProvider (add a
guarded get_embedding call that catches failures and delegates to a configured
fallback provider) or enhance get_embedding_provider() to return a wrapper that
attempts the primary provider (VLLMEmbeddingProvider) and on exception switches
to a secondary provider, ensuring clear identifiers (VLLMEmbeddingProvider,
LiteLLMEmbeddingProvider, PortkeyEmbeddingProvider, get_embedding_provider) are
used to locate and wire the fallback logic.
|
Claude finished @paul-tharun's task —— View job 🔍 PR Review: Feat/e2e datasets metadataReview Progress
🏗️ Architecture ReviewThis PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support. ✅ Positive Changes1. Clean NL-to-SQL Graph Architecture
2. Multi-Tenant Organization Support
3. SQL Injection Prevention
🚨 Critical Security Issues1. Missing Organization ID in Semantic SearchLocation: The semantic search function calls # VULNERABLE: No org_id passed
results = await search_schemas(
user_query=user_query,
embeddings=embeddings_model,
dataset_ids=dataset_ids,
project_ids=project_ids,
# Missing: org_id=org_id
)Fix Required: Extract org_id from config and pass it to search_schemas. 2. Division by Zero RiskLocation: # VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100Fix Required: Add zero-check before division. 3. Direct SQL InterpolationLocation: # POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"While
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⭐ | Excellent modular design |
| Security | ⭐⭐⭐⚪⚪ | Good foundations, critical gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code, room for improvement |
Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.
There was a problem hiding this comment.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
chat-server/tests/e2e/utils/generate_app_cases.py (1)
290-309:⚠️ Potential issue | 🔴 CriticalComplete the multi-dataset routing description and add "ai" → "assistant" role normalization.
Lines 290–309: The routing overview for
identify_datasetsomits theretry_semantic_searchpath. Update line 295 to:• identify_datasets → routes: plan_query | route_response (no_datasets_found) | retry_semantic_searchLines 401–420: The
_normalize_messagefunction converts "human" to "user" but does not convert "ai" to "assistant". Since LangChain'sAIMessage.typeis "ai" but provider APIs (OpenAI, etc.) requirerole: "assistant", add this mapping after line 405:if role == "ai": role = "assistant"These omissions will cause incomplete test-path coverage and generated test cases with invalid message roles that fail API compatibility checks.
chat-server/tests/unit/test_column_value_matching.py (2)
36-61:⚠️ Potential issue | 🟠 MajorBug:
captureddict is never populated, assertion will fail.The test declares
captured = {}on line 39 but never assignscaptured["query"]in the mock function. The assertion on line 61 will raise aKeyError.🐛 Proposed fix
`@pytest.mark.asyncio` async def test_find_similar_values_falls_back_to_levenshtein(monkeypatch): """Test fallback to Levenshtein when ILIKE returns empty results.""" calls = {"count": 0} captured = {} async def fake_execute_sql(query: str, org_id=None): calls["count"] += 1 + captured["query"] = query # Check for both DuckDB (levenshtein) and ClickHouse (levenshteinDistance) syntax if "levenshtein" in query.lower(): return [ {"name": "Alison"}, {"name": "Alicia"}, ] return []
73-83:⚠️ Potential issue | 🟡 MinorBug: Duplicate dictionary key
"error"in dummy logger.The dictionary literal has
"error"defined twice. The second definition silently overwrites the first. This appears to be a copy-paste error.🐛 Proposed fix
dummy_logger = type( "L", (), { "error": lambda *args, **kwargs: None, "debug": lambda *args, **kwargs: None, - "error": lambda *args, **kwargs: None, }, )()chat-server/tests/unit/test_vector_store.py (1)
48-77:⚠️ Potential issue | 🟠 MajorTest mocks
get_vector_storebut implementation usesget_async_client.The test mocks
QdrantSetup.get_vector_store, but examining the actualadd_document_to_vector_storeimplementation (invector_store.pylines 73-90), the function callsQdrantSetup.get_async_clientand performs a directupsert, notget_vector_store.aadd_documents. This test will pass but doesn't exercise the actual code path.🔧 Proposed fix - update test to match implementation
`@pytest.mark.asyncio` async def test_add_document_to_vector_store(self, mock_document): - mock_vector_store = AsyncMock() + mock_async_client = AsyncMock() with ( patch("app.services.qdrant.vector_store.QdrantSetup") as mock_qdrant_setup_class, patch("app.services.qdrant.vector_store.get_model_provider") as mock_get_model_provider, + patch("app.services.qdrant.vector_store.generate_sparse_vector") as mock_sparse, ): mock_model_provider = Mock() mock_embeddings = Mock() + mock_embeddings.embed_query.return_value = [0.1, 0.2, 0.3] mock_model_provider.get_embeddings_model.return_value = mock_embeddings mock_get_model_provider.return_value = mock_model_provider - mock_qdrant_setup_class.get_vector_store.return_value = mock_vector_store + mock_qdrant_setup_class.get_async_client = AsyncMock(return_value=mock_async_client) mock_qdrant_setup_class.get_document_id.return_value = "doc_id_123" + mock_sparse.return_value = Mock(indices=[1, 2], values=[0.5, 0.5]) await add_document_to_vector_store(mock_document) - mock_qdrant_setup_class.get_vector_store.assert_called_once_with( - embeddings=mock_embeddings, collection_name="dataset_collection" - ) mock_qdrant_setup_class.get_document_id.assert_called_once_with("proj1", "ds1") - mock_vector_store.aadd_documents.assert_called_once_with( - documents=[mock_document], ids=["doc_id_123"] - ) + mock_async_client.upsert.assert_called_once()chat-server/tests/unit/test_prompts.py (1)
107-168:⚠️ Potential issue | 🟠 MajorAdd
@pytest.mark.unitto mark these tests for unit-only test runs.The test file is missing the module-level pytest marker required for unit tests in
chat-server/tests/unit/. This causes these tests to be skipped when runningpytest -m unit.✅ Suggested fix
from unittest.mock import Mock, patch import pytest +pytestmark = pytest.mark.unit + from app.utils.langsmith.prompt_manager import PromptManager, get_prompt from app.workflow.prompts.prompt_selector import PromptSelectorchat-server/app/api/v1/routers/dataset_upload.py (1)
21-50:⚠️ Potential issue | 🟠 MajorGuard against missing
x-organization-idheader before Qdrant storage.
store_schema_in_qdrantandcreate_dataset_schemaboth requireorg_id: str, but the header is optional (str | None). PassingNonerisks type errors and tenant-mis-scoping. Make the header required or validate/reject requests without it.Suggested fix (header required)
async def upload_schema( payload: UploadSchemaRequest, x_organization_id: Annotated[str | None, Header()] = None, ): """ Processes and index dataset schema. @@ """ try: + if x_organization_id is None: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="x-organization-id header is required", + ) project_id = payload.project_id dataset_id = payload.dataset_id dataset_details, project_details = await asyncio.gather(chat-server/app/services/gopie/generate_schema.py (1)
70-76:⚠️ Potential issue | 🟡 MinorMissing error handling for the GET request.
Unlike
execute_sqlwhich checksresponse.status != HTTPStatus.OK, the GET request here doesn't validate the response status before parsing JSON. A non-200 response could result in unexpected behavior or unclear error messages.🛡️ Proposed fix to add error handling
+ from http import HTTPStatus + async with await client.get(path) as response: + if response.status != HTTPStatus.OK: + error_data = await response.json() + raise Exception( + f"Failed to fetch summary for {dataset_name}: " + f"{error_data.get('error', 'Unknown error')}" + ) data = await response.json()chat-server/app/tool_utils/tools/list_datasets.py (2)
38-52:⚠️ Potential issue | 🟡 MinorAdd error handling for consistency with other functions.
This function lacks try/except handling unlike
get_dataset_names_from_project_ids(lines 17-28) andget_dataset_names_for_dataset_ids(lines 62-70). An unhandled exception here could propagate and cause issues in the calling code.🛡️ Proposed fix to add error handling
async def get_project_ids_for_datasets_ids( dataset_ids: list[str], org_id: Optional[str] = None, ) -> dict[str, str]: dataset_id_project_map = {} client = GopieClient(org_id=org_id) for dataset_id in dataset_ids: - path = f"/v1/api/datasets/{dataset_id}/project" - async with await client.get(path, ssl=False) as response: - if response.status == 200: - dataset_data = await response.json() - dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "") - else: - logger.warning(f"Dataset {dataset_id} not found") + try: + path = f"/v1/api/datasets/{dataset_id}/project" + async with await client.get(path, ssl=False) as response: + if response.status == 200: + dataset_data = await response.json() + dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "") + else: + logger.warning(f"Dataset {dataset_id} not found") + except Exception as e: + logger.exception(f"Error fetching project for dataset {dataset_id}: {e}") return dataset_id_project_map
55-76:⚠️ Potential issue | 🟡 MinorPotential data loss: only storing one dataset per project.
In
get_dataset_names_for_dataset_ids, when multiple datasets belong to the same project, only the last one is retained becauseproject_dataset_map[project_id]is overwritten on each iteration (line 68).🐛 Proposed fix to accumulate datasets per project
async def get_dataset_names_for_dataset_ids( dataset_project_ids_map: dict[str, str], org_id: Optional[str] = None, ) -> str: - project_dataset_map = {} + project_dataset_map: dict[str, list[str]] = {} client = GopieClient(org_id=org_id) for dataset_id, project_id in dataset_project_ids_map.items(): path = f"/v1/api/projects/{project_id}/datasets/{dataset_id}" async with await client.get(path, ssl=False) as response: if response.status == 200: dataset_data = await response.json() - project_dataset_map[project_id] = dataset_data.get("alias", "") + if project_id not in project_dataset_map: + project_dataset_map[project_id] = [] + project_dataset_map[project_id].append(dataset_data.get("alias", "")) else: logger.warning(f"Dataset {dataset_id} not found") list_of_datasets_str = "" - for project_id, dataset_name in project_dataset_map.items(): + for project_id, dataset_names in project_dataset_map.items(): list_of_datasets_str += f"Project {project_id}:\n" - list_of_datasets_str += f" - {dataset_name}\n" + for dataset_name in dataset_names: + list_of_datasets_str += f" - {dataset_name}\n" return list_of_datasets_strchat-server/app/workflow/graph/multi_dataset_graph/node/identify_datasets.py (2)
88-111:⚠️ Potential issue | 🟡 MinorIncorrect data type passed to
set_node_message.At line 91-94, a
setis being passed as the message argument instead of astr. The curly braces{...}create a set, not a dict. This should be a string.🐛 Proposed fix
query_result.set_node_message( "identify_datasets", - { - "No relevant datasets found by doing semantic search. This subquery is " - "not relevant to any datasets. Treating as conversational query." - }, + "No relevant datasets found by doing semantic search. This subquery is " + "not relevant to any datasets. Treating as conversational query.", )
197-218:⚠️ Potential issue | 🟡 MinorDocstring mentions "analyze_dataset" but returns "plan_query".
The docstring at line 202 still references
"analyze_dataset"as a possible return value, but the function now returns"plan_query"at line 218.📝 Proposed fix to update docstring
""" Route to the appropriate next node based on dataset identification results. Returns: - str: One of "no_datasets_found", "retry_semantic_search", or "analyze_dataset" + str: One of "no_datasets_found", "retry_semantic_search", or "plan_query" """
🤖 Fix all issues with AI agents
In `@chat-server/app/api/v1/routers/dataset_upload.py`:
- Around line 2-4: Remove the duplicate Header import (only import Header once)
and add validation for x_organization_id before calling store_schema_in_qdrant:
ensure x_organization_id (the header bound variable) is not None/empty and if
missing raise an HTTPException with status code 400; then pass the validated
org_id (str) into store_schema_in_qdrant to satisfy its required str parameter
(refer to Header import, x_organization_id, and store_schema_in_qdrant).
In `@chat-server/app/core/log.py`:
- Around line 14-17: The custom logger method exception in
chat-server/app/core/log.py currently calls self._logger.critical(...,
stack_info=True) which records the call stack but loses the actual exception
traceback; update the exception method to pass exc_info=True (and keep
stack_info=True if desired) to self._logger so the logged output includes the
exception traceback; modify the exception function (named exception) to call
self._logger with exc_info=True (and stack_info=True if you want both) and
preserve the existing behavior of raising when self.dev_mode is true.
In `@chat-server/app/models/data.py`:
- Around line 28-37: DatasetDetails.columns is now Optional and can be None;
update all consumers to guard against None before iterating or accessing it —
e.g., in regenerate_fuzzy_values_prompt.py where matching_schema.columns is used
in list/generator comprehensions and in dataset_info.py where schema.columns is
looped, replace direct iterations with a null-safe form (check
matching_schema.columns is not None or use a fallback like
(matching_schema.columns or [])). Also add null checks before using row_count in
calculations and before using file_path for file operations so you don't assume
they are present (guard expressions around row_count and file_path where used).
Ensure you update usages referencing DatasetDetails, matching_schema.columns,
schema.columns, row_count, and file_path accordingly.
In `@chat-server/app/services/gopie/dataset_info.py`:
- Around line 34-43: Replace the re-raising pattern in the except blocks that
currently use "raise e" with a bare "raise" so the original traceback is
preserved; specifically update the except Exception as e blocks in
chat-server/app/services/gopie/dataset_info.py (the block that logs via
logger.exception("Error getting dataset info...") and the other similar block
later) to log with logger.exception(...) and then perform a bare raise rather
than "raise e".
In `@chat-server/app/services/qdrant/schema_search.py`:
- Around line 86-88: The async function search_schemas is calling the blocking
embeddings.embed_query which will block the event loop; change the call to the
async variant await embeddings.aembed_query(user_query) and assign its result
back to dense_vector, leaving generate_sparse_vector(user_query) as-is; ensure
the embeddings object supports aembed_query (as used elsewhere, e.g., health.py)
and update any nearby error handling or type expectations for dense_vector if
needed.
In `@chat-server/app/utils/olap/clickhouse.py`:
- Around line 13-19: The SQL builder currently injects table_name and
column_name directly (e.g., in get_estimated_size_query and the other builder
methods referenced), which permits SQL injection; update all methods that
interpolate identifiers to validate them against a strict identifier whitelist
or a conservative regex (e.g., only ASCII letters, numbers, and underscores,
starting with a letter) before using them, and then format identifiers using a
safe quoting strategy (e.g., wrap with ClickHouse backticks after validation and
escape any backticks found), ensuring you apply this change to
get_estimated_size_query and every builder method that inserts table_name or
column_name (the ones in the review ranges) so all identifier interpolation uses
the validated-and-quoted identifier rather than raw f-string insertion.
In `@chat-server/app/utils/olap/duckdb.py`:
- Around line 65-91: The two methods build_large_table_ilike_query and
build_large_table_levenshtein_query currently interpolate user inputs directly
into SQL and need the same sanitization as other query builders: stop inserting
raw value, table_name, and column_name into f-strings; instead use the existing
sanitization routine used elsewhere (e.g., the same escaping/quoting used by
build_sample_query or the project's SQL sanitizer) for table and column
identifiers and use parameterized placeholders or safely-escaped literal values
for the search term (escape single quotes and wrap properly) so that value,
table_name, and column_name cannot inject SQL; update both functions to call the
sanitizer for identifiers and to bind or escape the value consistently with
other query builders.
- Around line 38-63: The Levenshtein and ILIKE query builders
(build_levenshtein_query and build_ilike_query) directly interpolate the value
and identifiers into SQL, creating an injection risk; fix by
validating/whitelisting table_name and column_name (e.g., allow only
[A-Za-z_][A-Za-z0-9_]*), and sanitize or escape the value string (e.g., ensure
it's a str and escape single quotes) or better yet switch to DuckDB parameter
binding for the value; update these two methods to call the new
_validate_identifier and _sanitize_sql_string helpers (or use parameter
placeholders) before constructing the final SQL string.
- Around line 13-19: The get_estimated_size_query and other query-builder
methods currently interpolate table_name/column_name directly; to fix, add
parameterized query support to the execution layer (extend execute_sql to accept
a query plus params and use parameter binding) and update these methods (e.g.,
get_estimated_size_query) to return a parameterized SQL string with placeholders
and a corresponding params list instead of raw string interpolation; as an
immediate mitigation also validate identifier inputs (table_name, column_name)
in the class using a strict regex allowing only [A-Za-z0-9_] and raise on
invalid values before building the query so only safe identifiers are used until
execute_sql parameterization is implemented.
In
`@chat-server/app/workflow/graph/multi_dataset_graph/node/generate_subqueries.py`:
- Around line 18-67: _format_context_info currently only uses project_ids[0];
update it to iterate over all provided project_ids and append each project's
context and datasets to context_info: loop for pid in project_ids, call
get_project_info(pid, org_id=org_id) and get_project_schemas(pid, limit=50,
org_id=org_id), prepend each block with a project header (e.g., "PROJECT
CONTEXT: <pid> - <name>"), and for datasets slice the returned list to respect
max_datasets (or distribute max_datasets across projects if desired) before
formatting names/descriptions as the existing code does; keep the try/except
around per-project calls to continue on errors. Ensure you reference the same
function name (_format_context_info), and helper calls (get_project_info,
get_project_schemas) when making edits.
In `@chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py`:
- Around line 82-84: The success message string in the IntermediateStep call
contains a typo ("Succesfully"); update the content passed to IntermediateStep
in sql_agent.py (the IntermediateStep(...) invocation) to read "Successfully
Completed query planning step in multidataset workflow." so the spelling is
corrected while preserving the rest of the message and punctuation.
In `@chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py`:
- Around line 85-112: The code extracts org_id but doesn't pass it into the
schema lookup, which breaks tenant isolation; update the call in prepare_data.py
where get_schema_from_qdrant is invoked (currently
get_schema_from_qdrant(dataset_id=dataset_id)) to include the org_id parameter
(e.g., get_schema_from_qdrant(dataset_id=dataset_id, org_id=org_id)), and if
necessary update get_schema_from_qdrant's signature/implementation to accept and
use org_id for tenant-scoped schema resolution; keep dataset_id usage and
existing error handling intact.
In
`@chat-server/app/workflow/prompts/multi_dataset_prompts/analyze_query_prompt.py`:
- Around line 177-186: The human prompt path is using the raw
previous_sql_queries list (previous_sql_queries variable) which differs from the
formatted representation produced by format_analyze_query_input used elsewhere;
to fix it, normalize previous_sql_queries before formatting the
human_template_str by running the same formatting helper
(format_analyze_query_input) on previous_sql_queries (preserving the "None"
fallback) so human_template_str.format receives the consistent "[id:1] SELECT
..." style strings as other code paths expect; update the place where
human_template_str.format is called to pass the formatted_previous_sql_queries
instead of the raw list.
In `@chat-server/scripts/scrape_duckdb_docs.py`:
- Around line 5-6: The project imports httpx in scrape_duckdb_docs.py but
pyproject.toml lacks that dependency; update the pyproject.toml dependencies
section to include "httpx" (alongside the existing "beautifulsoup4") so CI and
production environments can install it—ensure the exact package name "httpx" is
added under the same dependencies list used for other runtime packages.
- Around line 16-23: The _fetch_page function currently catches all Exceptions
and uses print; change it to catch httpx.RequestError for transport-level
failures around await client.get(url, ...) and httpx.HTTPStatusError for non-2xx
responses raised by response.raise_for_status(), log errors via a module logger
(e.g., logger = logging.getLogger(__name__)) instead of print and include the
URL and exception details in the log (use logger.error or logger.exception), and
still return response.text on success or None on handled errors; keep the
function signature and return semantics the same.
In `@chat-server/tests/e2e/utils/generate_app_cases.py`:
- Around line 401-420: The _normalize_message function currently escapes content
but not role, which can allow special chars in role to break generated Python
code; update _normalize_message to escape role the same way as content (replace
backslash with \\\\, double-quote with \\", newline with \\n, and remove \r) and
return the escaped role, and also normalize role values by mapping
"human"→"user" and "ai"→"assistant" (in addition to the existing "type"
fallback) so generated templates that inject role values use safe, normalized
strings.
In `@chat-server/tests/unit/test_generate_sql_prompt_formatting.py`:
- Around line 31-33: Add the missing module-level pytest marker by declaring
pytestmark = pytest.mark.unit near the top of the test module (after imports and
any module docstring) so this file follows the same convention as other unit
tests; modify the TestGenerateSqlPromptFormatting test module to include that
pytestmark declaration (referencing the module that contains class
TestGenerateSqlPromptFormatting).
In `@chat-server/tests/unit/test_schema_search.py`:
- Around line 48-51: The test accesses query call kwargs with
call_args.kwargs["filter"] which can raise KeyError because search_schemas
places filter inside each Prefetch rather than as a top-level kwarg; update the
assertion to use call_args.kwargs.get("filter") (matching other tests) and keep
the existing assertion on prefetch length (mock_client.query_points,
search_schemas, Prefetch are the symbols to update/inspect).
🧹 Nitpick comments (23)
chat-server/tests/e2e/utils/test_utils.py (1)
202-219: Consider usingaiohttpfor async HTTP requests.The function is declared as
asyncbut uses synchronousrequests.post, which blocks the event loop. Since the codebase already usesaiohttp(as seen ingopie/client.py), consider migrating toaiohttpfor consistency and true non-blocking behavior.This is a pre-existing pattern, so deferring is acceptable.
♻️ Suggested approach using aiohttp
import aiohttp async def send_chat_request(test_case: Dict[str, Any], url: str) -> Dict[str, Any]: query_copy = test_case.copy() query_copy.pop("expected_result", None) try: headers = { "Content-Type": "application/json", "Accept": "text/event-stream", "X-organization-id": TestConfig.GOPIE_ORG_ID, "X-user-id": TestConfig.GOPIE_USER_ID, } async with aiohttp.ClientSession() as session: async with session.post( url, json={**query_copy, "chat_id": "", "trace_id": ""}, headers=headers, timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT), ) as response: response.raise_for_status() # Process streaming response...As per coding guidelines: "Use async/await for non-blocking I/O operations throughout the application".
chat-server/tests/unit/test_schema_search.py (1)
14-39: Consider extracting common mocks into fixtures.The mock setup for
mock_embeddings,mock_client, and the context manager patches are duplicated across all tests. Extracting these into pytest fixtures would reduce boilerplate and improve maintainability.Example fixture approach
`@pytest.fixture` def mock_embeddings(): mock = Mock() mock.embed_query.return_value = [0.1, 0.2, 0.3] return mock `@pytest.fixture` def mock_qdrant_client(): return AsyncMock() `@pytest.fixture` def patch_qdrant_and_sparse(mock_qdrant_client): with ( patch( "app.services.qdrant.schema_search.QdrantSetup.get_async_client", return_value=mock_qdrant_client, ), patch("app.services.qdrant.schema_search.generate_sparse_vector") as mock_sparse, ): mock_sparse.return_value = models.SparseVector(indices=[1, 2], values=[0.5, 0.5]) yield mock_qdrant_clientAlso applies to: 57-70, 87-100, 117-130, 147-160
chat-server/app/workflow/prompts/multi_dataset_prompts/analyze_query_prompt.py (1)
188-191: Inconsistent use oflangsmith_compatibleformatter.The template path (line 172) applies
langsmith_compatible()tosystem_content, but the direct message path (line 189) uses the rawsystem_content. This could lead to different prompt behavior depending on which path is used.Proposed fix for consistency
return [ - SystemMessage(content=system_content), + SystemMessage(content=langsmith_compatible(system_content)), HumanMessage(content=human_content), ]chat-server/scripts/scrape_duckdb_docs.py (1)
112-138: Replace progressUse the project’s JSON logger for progress and completion messages to keep observability consistent.
As per coding guidelines: Use JSON structured logging with context for observability.🔧 Suggested update
- print(f"Scraping [{len(self.visited_urls)}/{self.max_pages}]: {url}") + logger.info( + "duckdb_docs_scrape_progress", + extra={"current": len(self.visited_urls), "max_pages": self.max_pages, "url": url}, + ) @@ - print(f"Scraped {len(self.scraped_content)} pages") + logger.info("duckdb_docs_scrape_complete", extra={"pages": len(self.scraped_content)}) @@ - print(f"Saved {len(self.scraped_content)} pages to {output_file}") + logger.info( + "duckdb_docs_saved", + extra={"pages": len(self.scraped_content), "path": str(output_file)}, + )chat-server/scripts/index_duckdb_docs.py (1)
19-72: Use the structured logger instead of print for observability.
Replacing print statements with the app logger keeps output consistent and JSON-structured across the chat-server.As per coding guidelines: Use JSON structured logging with context for observability.
chat-server/app/services/qdrant/vector_store.py (1)
47-49: Prefernext(iter(...))over single-element slice for efficiency.Using
list(...)[0]creates an unnecessary intermediate list. The generator-based approach is more idiomatic and efficient.♻️ Proposed fix
def generate_sparse_vector(text: str) -> models.SparseVector: sparse_model = SparseModelManager.get_model() - result = list(sparse_model.embed([text]))[0] + result = next(iter(sparse_model.embed([text]))) return models.SparseVector(indices=result.indices.tolist(), values=result.values.tolist())chat-server/tests/unit/test_column_value_matching.py (1)
85-90: Consider using underscore prefix for unused unpacked variables.The
match_sourceanderror_messagevariables are unpacked but never used. Prefixing with underscore communicates intent and silences linter warnings.♻️ Proposed fix
- similar_values, match_source, error_message = await find_similar_values( + similar_values, _match_source, _error_message = await find_similar_values( "foo", "name", "t", estimated_size=1000 )chat-server/app/workflow/graph/single_dataset_graph/node/execute_sql.py (1)
52-57: Redundantorg_idextraction inside loop.
org_idis already extracted at lines 22-24 before entering the loop. The re-extraction on line 54 is redundant but harmless.♻️ Proposed fix - remove redundant extraction
for sql_info in query_result.single_dataset_query_result.sql_queries: try: - org_id = config.get("metadata", {}).get("org_id") result_data, truncated_result_data = await execute_sql_with_full_and_truncated( query=sql_info.sql_query, org_id=org_id )chat-server/app/workflow/prompts/multi_dataset_prompts/identify_datasets_prompt.py (1)
126-131: Avoid mutable default arguments.Using mutable default arguments (
[]) can lead to unexpected behavior if the function modifies them. While this function doesn't mutate the lists, usingNonewith a default is the safer pattern.♻️ Proposed fix
def format_identify_datasets_input( user_query: str, - relevant_dataset_schemas: list[DatasetSchema] = [], - semantic_searched_datasets: list[DatasetSchema] = [], + relevant_dataset_schemas: list[DatasetSchema] | None = None, + semantic_searched_datasets: list[DatasetSchema] | None = None, validation_result: str | None = None, ) -> dict: + if relevant_dataset_schemas is None: + relevant_dataset_schemas = [] + if semantic_searched_datasets is None: + semantic_searched_datasets = [] input_str = f"USER QUERY: {user_query}"chat-server/app/workflow/prompts/multi_dataset_prompts/generate_subqueries_prompt.py (1)
8-79: Document template placeholders and expected kwargs.
The prompt now relies on{context_info}and{user_input}, but the function doesn’t describe these variables. Add a short docstring or inline note listing required kwargs and template parameters.📌 Suggested docstring
def generate_subqueries_prompt(**kwargs) -> list | ChatPromptTemplate: + """ + Build the subquery-decomposition prompt. + + Template variables: + - context_info: Optional project/dataset context text. + - user_input: Raw user query. + """ prompt_template = kwargs.get("prompt_template", False)As per coding guidelines: Document prompt templates and parameters in prompt files.
chat-server/tests/scripts/reset_and_reindex_collection.py (1)
20-25: Remove unused# noqa: E402directives.
Ruff reports these as unused, so they can be dropped to keep lint clean.🧹 Suggested cleanup
-from qdrant_client import QdrantClient, models # noqa: E402 +from qdrant_client import QdrantClient, models @@ -from app.core.config import settings # noqa: E402 -from app.core.log import custom_logger as logger # noqa: E402 -from app.services.qdrant.qdrant_setup import QdrantSetup # noqa: E402 -from app.services.qdrant.vector_store import ( # noqa: E402 +from app.core.config import settings +from app.core.log import custom_logger as logger +from app.services.qdrant.qdrant_setup import QdrantSetup +from app.services.qdrant.vector_store import ( generate_sparse_vector, )chat-server/app/utils/olap/__init__.py (1)
8-15: Sort__all__to satisfy RUF022.Keeps export lists deterministic and aligns with Ruff’s expectation.
🧹 Suggested change
__all__ = [ - "OlapQueryBuilder", - "DuckDBQueryBuilder", "ClickHouseQueryBuilder", + "DuckDBQueryBuilder", + "OlapQueryBuilder", "get_query_builder", - "is_duckdb_family", "is_clickhouse_family", + "is_duckdb_family", ]chat-server/app/utils/olap/factory.py (1)
3-8: Use structuredcustom_loggerinstead of stdlib logging.Keeps logging consistent with the rest of the chat server’s structured logging approach.
🧾 Suggested change
-import logging - -from app.core.config import settings +from app.core.config import settings +from app.core.log import custom_logger as logger from app.utils.olap.base import OlapQueryBuilder - -logger = logging.getLogger(__name__)As per coding guidelines: Use JSON structured logging with context for observability.
chat-server/tests/unit/test_match_columns.py (1)
189-258: Consider adding assertion for_regenerate_fuzzy_valuescall arguments.The retry test verifies the outcome but doesn't assert that
_regenerate_fuzzy_valueswas called with expected arguments. This could help catch regressions if the function signature changes.💡 Proposed enhancement
with ( patch( "app.workflow.graph.sql_planner_graph.match_columns.match_column_values", new_callable=AsyncMock, return_value=mock_column_mappings, ), patch( "app.workflow.graph.sql_planner_graph.match_columns.validate_match_relevance", new_callable=AsyncMock, return_value=mock_column_mappings, ), patch( "app.workflow.graph.sql_planner_graph.match_columns._regenerate_fuzzy_values", new_callable=AsyncMock, return_value=mock_regenerate_response, - ), + ) as mock_regenerate, ): state = { "user_query": "show finance department", "messages": [], "multi_datasets_info": sample_multi_datasets_info, "match_columns_retry_count": 0, } result = await match_columns(state, mock_config) assert result["match_columns_retry_count"] == 1 assert "multi_datasets_info" in result assert len(result["messages"]) == 1 assert "Retrying" in result["messages"][0].content + mock_regenerate.assert_called_once()chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py (2)
57-68: Consider using f-string conversion flag for cleaner formatting.The static analysis suggests using an explicit conversion flag. While minor, using
{e!s}is more explicit than{str(e)}.♻️ Proposed fix
except Exception as e: - error_message = f"Error in SQL planner agent: {str(e)}" + error_message = f"Error in SQL planner agent: {e!s}" single_dataset_result.error = error_message
63-68: Consider passingconfigparameter toadispatch_custom_eventfor proper config propagation in LangGraph workflows.While
configis optional in theadispatch_custom_eventsignature, passing it explicitly when available (as it is in this function) ensures proper context propagation in LangGraph-based workflows. Theconfig: RunnableConfigparameter is already available in scope on line 13, so update the call at lines 63-68 to:await adispatch_custom_event( "gopie-agent", { "content": "Error planning SQL queries", }, config=config, )Apply the same change to other
adispatch_custom_eventcalls in this file and similar files in the workflow directory.chat-server/app/utils/olap/base.py (1)
26-29: Consider documenting the expected format of thepctparameter.The
pctparameter is typed asstracross multiple methods (build_sample_query,build_large_table_ilike_query,build_large_table_levenshtein_query). Based on the usage ingenerate_schema.py, it's a formatted percentage string (e.g.,"0.001234"). Adding this to the docstring would help implementers.📝 Proposed docstring enhancement
`@abstractmethod` def build_sample_query(self, table_name: str, pct: str, limit: int) -> str: - """Return query with sampling for large tables.""" + """Return query with sampling for large tables. + + Args: + table_name: Name of the table to sample + pct: Sampling percentage as a string (e.g., "0.001234") + limit: Maximum number of rows to return + """ passAlso applies to: 48-58
chat-server/app/workflow/graph/single_dataset_graph/types.py (1)
36-36: Consider using more specific types for list fields.The
prev_sql_queries,sql_queries, andexplanationsfields are typed aslist | Noneorlist[str] | None. For better type safety and IDE support, consider using explicit element types.💡 Proposed type refinement
class State(TypedDict): messages: Annotated[list[BaseMessage], add_messages] dataset_id: str | None user_query: str | None query_result: QueryResult dataset_info: SingleDatasetInfo - prev_sql_queries: list | None + prev_sql_queries: list[str] | None validation_result: str | None recommendation: str sql_queries: list[str] | None explanations: list[str] | NoneAlso applies to: 39-40
chat-server/app/tool_utils/tools/list_datasets.py (1)
79-85: Avoid mutable default arguments.Using mutable default arguments (
list[str] = []) is a known Python anti-pattern that can cause unexpected behavior. Additionally,config: RunnableConfig = Noneshould useOptionalfor type correctness.♻️ Proposed fix
`@tool` async def get_all_datasets( status_message: str = "", - project_ids: list[str] = [], - dataset_ids: list[str] = [], - config: RunnableConfig = None, + project_ids: Optional[list[str]] = None, + dataset_ids: Optional[list[str]] = None, + config: Optional[RunnableConfig] = None, ) -> str: """ ... """ - if not project_ids and not dataset_ids: + if not project_ids: + project_ids = [] + if not dataset_ids: + dataset_ids = [] + + if not project_ids and not dataset_ids:chat-server/app/workflow/graph/sql_planner_graph/generate_sql.py (1)
86-130: Condition check at line 90 could be simplified.The check
len(sql_queries) == 0is more idiomatically written asnot sql_queriesin Python.♻️ Minor style improvement
- if non_sql_response and len(sql_queries) == 0: + if non_sql_response and not sql_queries:chat-server/app/workflow/prompts/generate_sql_prompt.py (2)
20-134: Add prompt/parameter documentation for this prompt module.There’s no docstring describing the template’s purpose, required kwargs, and expected input sections. Add brief module/function docs so the prompt contract is explicit and maintainable.
As per coding guidelines: Document prompt templates and parameters in prompt files.📝 Suggested docstrings
+"""Prompt templates for SQL generation and input formatting.""" + def generate_sql_prompt(**kwargs) -> list[BaseMessage] | ChatPromptTemplate: + """Build the SQL-generation prompt. + + Args: + prompt_template: If True, return ChatPromptTemplate. + input: Preformatted input string from format_generate_sql_input(). + """ prompt_template = kwargs.get("prompt_template", False) input_content = kwargs.get("input", "") @@ def format_generate_sql_input( @@ -) -> dict: +) -> dict: + """Format user query + dataset context into a single input string."""
124-134: Avoid unused kwargs informat_generate_sql_input.
**kwargsis unused (ruff ARG001). Either remove it or rename to**_kwargsto explicitly mark it as unused.♻️ Minimal lint fix
def format_generate_sql_input( @@ - **kwargs, + **_kwargs, ) -> dict:chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py (1)
30-74: Reuse the shared CSV conversion helper to avoid drift.
convert_rows_to_csvalready exists inapp.utils.csv_utils. Duplicating logic here risks divergence over time; prefer importing the shared helper.♻️ Example delegation
def convert_rows_to_csv(rows: list[dict]) -> str: - """ - Convert a list of dictionaries into a CSV-formatted string with special handling for certain cell values. - ... - """ - if not rows: - return "" - - output = io.StringIO() - writer = csv.writer(output) - ... - return output.getvalue() + """Delegate to the shared CSV formatter for consistent behavior.""" + from app.utils.csv_utils import convert_rows_to_csv as _convert_rows_to_csv + return _convert_rows_to_csv(rows)
| def exception(self, msg, *args, **kwargs): | ||
| self._logger.exception(msg, *args, stack_info=True, **kwargs) | ||
| self._logger.critical(msg, *args, stack_info=True, **kwargs) | ||
| if self.dev_mode: | ||
| raise |
There was a problem hiding this comment.
Loss of exception traceback information.
The change from logger.exception() to logger.critical() with stack_info=True loses the actual exception traceback:
stack_info=Truelogs the call stack (where the log statement was invoked)exc_info=True(used internally bylogger.exception()) logs the exception traceback (the actual error and its origin)
This means when an exception occurs, you'll see where logger.exception() was called, but not the actual exception details and traceback that caused the error.
🐛 Proposed fix to preserve exception traceback
def exception(self, msg, *args, **kwargs):
- self._logger.critical(msg, *args, stack_info=True, **kwargs)
+ self._logger.critical(msg, *args, exc_info=True, **kwargs)
if self.dev_mode:
raiseIf you need both the call stack and exception info, use exc_info=True, stack_info=True.
🤖 Prompt for AI Agents
In `@chat-server/app/core/log.py` around lines 14 - 17, The custom logger method
exception in chat-server/app/core/log.py currently calls
self._logger.critical(..., stack_info=True) which records the call stack but
loses the actual exception traceback; update the exception method to pass
exc_info=True (and keep stack_info=True if desired) to self._logger so the
logged output includes the exception traceback; modify the exception function
(named exception) to call self._logger with exc_info=True (and stack_info=True
if you want both) and preserve the existing behavior of raising when
self.dev_mode is true.
| row_count: Optional[int] = None | ||
| columns: Optional[list[ColumnDetails]] = None | ||
| size: Optional[int] = None | ||
| file_path: Optional[str] = None | ||
| created_at: Optional[str] = None | ||
| updated_at: Optional[str] = None | ||
| created_by: Optional[str] = None | ||
| updated_by: Optional[str] = None | ||
| custom_prompt: Optional[str] = None | ||
| sql_query: Optional[str] = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find where DatasetDetails is defined and used
rg -n "class DatasetDetails|DatasetDetails" --type py chat-server/ | head -20Repository: factly/gopie
Length of output: 769
🏁 Script executed:
# Find all accesses to the optional fields mentioned in the review
rg -n "\.columns\s*[^=]|\.row_count\s*[^=]|\.size\s*[^=]|\.file_path\s*[^=]" --type py chat-server/app/ -A2 -B2 | head -50Repository: factly/gopie
Length of output: 5588
🏁 Script executed:
# Look for iteration or direct access patterns on these fields without null checks
rg -n "for .* in .*\.columns|\.columns\[|\w+\.row_count \+|\w+\.row_count -" --type py chat-server/app/Repository: factly/gopie
Length of output: 1470
🏁 Script executed:
# Check the context around the unsafe column iteration in regenerate_fuzzy_values_prompt.py
sed -n '100,120p' chat-server/app/workflow/prompts/multi_dataset_prompts/regenerate_fuzzy_values_prompt.pyRepository: factly/gopie
Length of output: 1060
🏁 Script executed:
# Check the context in dataset_info.py around line 135
sed -n '130,140p' chat-server/app/services/gopie/dataset_info.pyRepository: factly/gopie
Length of output: 583
🏁 Script executed:
# Let's look at the full DatasetDetails model definition to confirm the field types
sed -n '20,50p' chat-server/app/models/data.pyRepository: factly/gopie
Length of output: 890
Fix unsafe access to optional columns field in DatasetDetails consumers.
The columns: Optional[list[ColumnDetails]] field can now be None, but multiple locations iterate over it without null checks:
chat-server/app/workflow/prompts/multi_dataset_prompts/regenerate_fuzzy_values_prompt.py:109, 113- List comprehension and generator expression iterate overmatching_schema.columnswithout checking forNonechat-server/app/services/gopie/dataset_info.py:135- Direct for-loop overschema.columnswithout null guard
Add null checks before any iteration or access to columns. Similarly, verify that row_count usage in calculations and file_path access also handle None values.
🤖 Prompt for AI Agents
In `@chat-server/app/models/data.py` around lines 28 - 37, DatasetDetails.columns
is now Optional and can be None; update all consumers to guard against None
before iterating or accessing it — e.g., in regenerate_fuzzy_values_prompt.py
where matching_schema.columns is used in list/generator comprehensions and in
dataset_info.py where schema.columns is looped, replace direct iterations with a
null-safe form (check matching_schema.columns is not None or use a fallback like
(matching_schema.columns or [])). Also add null checks before using row_count in
calculations and before using file_path for file operations so you don't assume
they are present (guard expressions around row_count and file_path where used).
Ensure you update usages referencing DatasetDetails, matching_schema.columns,
schema.columns, row_count, and file_path accordingly.
| try: | ||
| async with http_session.get(url, headers=headers) as response: | ||
| data = await response.json() | ||
| return DatasetDetails(**data) | ||
| async with await client.get(path) as response: | ||
| response_data = await response.json() | ||
| data = response_data["data"] if is_view else response_data | ||
| if is_view and data.get("sql_query"): | ||
| data["description"] = f"{data['description']}\n\nSQL Query:\n{data['sql_query']}" | ||
| return DatasetDetails(**data) | ||
| except Exception as e: | ||
| logger.exception(f"Error getting dataset info: {e!s}") | ||
| raise e |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Python raise vs raise e exception traceback difference
💡 Result:
Inside an except … as e: block:
-
raise(bare raise) re-raises the currently handled (“active”) exception without changing its traceback (so you keep the original stack frames showing where it first failed). This is the intended re-raise form. [1] -
raise eraises the exception objectevia the “raise an exception object” form ofraise(i.e., it’s treated like raising an explicit exception expression). Practically, this changes what traceback you see by making the re-throw point (raise e) the reported raise location (often obscuring the original failure site compared to bareraise). [1][2]
If you’re converting/wrapping exceptions:
- Prefer
raise NewError(...) from eto keep a clear chain (__cause__) from the original to the new exception. [1][3] - Use
raise NewError(...) from Noneto suppress the original context in the default traceback output. [3]
Sources:
[1] Python Language Reference – raise statement (re-raise vs raising an explicit expression) https://docs.python.org/3/reference/simple_stmts.html#the-raise-statement
[2] Discussion/examples of traceback differences between raise and raise e https://stackoverflow.com/questions/36153805/difference-between-raise-and-raise-e
[3] PEP 409 / exception context suppression (raise ... from None) https://peps.python.org/pep-0409/
🏁 Script executed:
# Check if the file exists and examine the actual code at lines 34-43 and 62-68
if [ -f "chat-server/app/services/gopie/dataset_info.py" ]; then
echo "=== Lines 34-43 ==="
sed -n '34,43p' chat-server/app/services/gopie/dataset_info.py
echo ""
echo "=== Lines 62-68 ==="
sed -n '62,68p' chat-server/app/services/gopie/dataset_info.py
else
echo "File not found at expected path"
# Try to find it
find . -name "dataset_info.py" -type f 2>/dev/null
fiRepository: factly/gopie
Length of output: 810
Use bare raise to preserve the original exception traceback.
Inside the except blocks at lines 34–43 and 62–68, raise e changes the reported raise location and obscures the original traceback. Since logger.exception() already captures the full exception details, use bare raise instead to maintain the complete call stack for debugging and error diagnostics.
🔧 Suggested fix
except Exception as e:
logger.exception(f"Error getting dataset info: {e!s}")
- raise e
+ raiseAlso applies to: 62-68
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 43-43: Use raise without specifying exception name
Remove exception name
(TRY201)
🤖 Prompt for AI Agents
In `@chat-server/app/services/gopie/dataset_info.py` around lines 34 - 43, Replace
the re-raising pattern in the except blocks that currently use "raise e" with a
bare "raise" so the original traceback is preserved; specifically update the
except Exception as e blocks in chat-server/app/services/gopie/dataset_info.py
(the block that logs via logger.exception("Error getting dataset info...") and
the other similar block later) to log with logger.exception(...) and then
perform a bare raise rather than "raise e".
| dense_vector = embeddings.embed_query(user_query) | ||
| sparse_vector = generate_sparse_vector(user_query) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "schema_search.py" -type f 2>/dev/null | head -20Repository: factly/gopie
Length of output: 106
🏁 Script executed:
# Check for requirements files to determine langchain version
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | head -20Repository: factly/gopie
Length of output: 153
🏁 Script executed:
# Also check the actual file content around lines 86-88
fd schema_search.py --type f | head -1 | xargs cat -n 2>/dev/null | sed -n '75,100p'Repository: factly/gopie
Length of output: 1155
🏁 Script executed:
# Get the function signature and broader context
cat -n ./chat-server/app/services/qdrant/schema_search.py | sed -n '60,90p'Repository: factly/gopie
Length of output: 1311
🏁 Script executed:
# Check the pyproject.toml for langchain/langchain-openai version
cat ./chat-server/pyproject.toml | head -100Repository: factly/gopie
Length of output: 1748
🏁 Script executed:
# Search for aembed_query usage in the codebase to see if it's already used elsewhere
rg -i "aembed_query" --type pyRepository: factly/gopie
Length of output: 650
🌐 Web query:
langchain-openai OpenAIEmbeddings aembed_query async method version 0.3.9
💡 Result:
In langchain-openai==0.3.9, OpenAIEmbeddings includes an async aembed_query method.
- Signature:
async aembed_query(text: str, **kwargs) -> list[float](returns a single embedding vector for one input string) [1]. - Typical use:
[1]
from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-large") vec = await embeddings.aembed_query("The meaning of life is 42")
Implementation-wise, aembed_query is a thin wrapper that calls aembed_documents([text]) and returns the first result [2].
Sources: LangChain OpenAIEmbeddings reference docs [1], LangChain source module showing aembed_query implementation [2].
🏁 Script executed:
# Check if there are other async embedding calls in the codebase as reference
rg "await.*embed" --type py -A 2 -B 2Repository: factly/gopie
Length of output: 3850
Use the async aembed_query method to avoid blocking the event loop. The search_schemas function is async (line 68), but embed_query is a synchronous call that will block at line 86. Use await embeddings.aembed_query(user_query) instead. The method is available in langchain-openai ≥ 0.3.9 and is already used elsewhere in the codebase (e.g., health.py).
Suggested fix
- dense_vector = embeddings.embed_query(user_query)
+ dense_vector = await embeddings.aembed_query(user_query)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dense_vector = embeddings.embed_query(user_query) | |
| sparse_vector = generate_sparse_vector(user_query) | |
| dense_vector = await embeddings.aembed_query(user_query) | |
| sparse_vector = generate_sparse_vector(user_query) | |
🤖 Prompt for AI Agents
In `@chat-server/app/services/qdrant/schema_search.py` around lines 86 - 88, The
async function search_schemas is calling the blocking embeddings.embed_query
which will block the event loop; change the call to the async variant await
embeddings.aembed_query(user_query) and assign its result back to dense_vector,
leaving generate_sparse_vector(user_query) as-is; ensure the embeddings object
supports aembed_query (as used elsewhere, e.g., health.py) and update any nearby
error handling or type expectations for dense_vector if needed.
| import httpx | ||
| from bs4 import BeautifulSoup |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify httpx is declared in dependency manifests.
set -euo pipefail
deps=$(fd -a -t f 'pyproject.toml|poetry.lock|Pipfile|setup.cfg|requirements.*|constraints.*' || true)
if [ -z "$deps" ]; then
echo "No dependency manifests found."
exit 0
fi
echo "=== Found dependency manifest files ==="
echo "$deps"
echo
echo "=== Checking for httpx declaration ==="
rg -n 'httpx' $deps || echo "httpx not found in dependency manifests"Repository: factly/gopie
Length of output: 324
🏁 Script executed:
#!/bin/bash
# Check for BeautifulSoup/beautifulsoup4 in dependency manifests
rg -n 'beautifulsoup|bs4' /home/jailuser/git/chat-server/pyproject.toml /home/jailuser/git/pulumi/requirements.txt /home/jailuser/git/scripts/starter-project/requirements.txt 2>/dev/null || echo "beautifulsoup4/bs4 not found in dependency manifests"Repository: factly/gopie
Length of output: 132
Add httpx to chat-server/pyproject.toml dependencies.
The script imports httpx at line 5 but it is not declared in the project's dependency manifest. This will cause an ImportError in CI and production environments. Add httpx to the dependencies section of pyproject.toml alongside the existing beautifulsoup4 declaration.
🤖 Prompt for AI Agents
In `@chat-server/scripts/scrape_duckdb_docs.py` around lines 5 - 6, The project
imports httpx in scrape_duckdb_docs.py but pyproject.toml lacks that dependency;
update the pyproject.toml dependencies section to include "httpx" (alongside the
existing "beautifulsoup4") so CI and production environments can install
it—ensure the exact package name "httpx" is added under the same dependencies
list used for other runtime packages.
| async def _fetch_page(self, client: httpx.AsyncClient, url: str) -> str | None: | ||
| try: | ||
| response = await client.get(url, timeout=30.0) | ||
| response.raise_for_status() | ||
| return response.text | ||
| except Exception as e: | ||
| print(f"Failed to fetch {url}: {e}") | ||
| return None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and examine the relevant section
find . -name "scrape_duckdb_docs.py" -type fRepository: factly/gopie
Length of output: 99
🏁 Script executed:
# Look for logging configuration and patterns in the chat-server directory
find chat-server -type f -name "*.py" | head -20Repository: factly/gopie
Length of output: 792
🏁 Script executed:
# Search for logging imports and usage patterns in the codebase
rg "import logging" chat-server/ -t py -cRepository: factly/gopie
Length of output: 125
🏁 Script executed:
# Search for JSON structured logging patterns
rg "json.*log|structlog|pydantic.*log" chat-server/ -t py -A 2Repository: factly/gopie
Length of output: 38
🏁 Script executed:
# Examine the logging configuration
cat -n chat-server/app/core/log.pyRepository: factly/gopie
Length of output: 1941
🏁 Script executed:
# Examine the actual file content
cat -n chat-server/scripts/scrape_duckdb_docs.pyRepository: factly/gopie
Length of output: 6154
🏁 Script executed:
# Look for other examples of error handling in the codebase
rg "httpx\." chat-server/ -t py -B 2 -A 5 | head -60Repository: factly/gopie
Length of output: 2021
🏁 Script executed:
# Check for other error handling patterns
rg "except.*Error|except.*Exception" chat-server/ -t py -B 2 -A 3 | head -80Repository: factly/gopie
Length of output: 6095
🏁 Script executed:
# Check if scripts are expected to follow the same logging standards
rg "from app.core.log|CustomLogger" chat-server/scripts/ -t pyRepository: factly/gopie
Length of output: 38
🏁 Script executed:
# Look for other scripts to see their logging patterns
ls -la chat-server/scripts/Repository: factly/gopie
Length of output: 308
🏁 Script executed:
# Verify if there's JSON logging anywhere in the codebase
rg "json.*format|JSONFormatter|structlog" chat-server/ -t pyRepository: factly/gopie
Length of output: 38
🏁 Script executed:
# Verify httpx exception types exist
python3 << 'EOF'
import httpx
print("HTTPStatusError:", hasattr(httpx, 'HTTPStatusError'))
print("RequestError:", hasattr(httpx, 'RequestError'))
print("HTTPError:", hasattr(httpx, 'HTTPError'))
EOFRepository: factly/gopie
Length of output: 112
Avoid broad Exception + print; use proper error handling with specific httpx exceptions and logging.
Catching all exceptions hides real bugs and masks errors. Handle httpx.HTTPStatusError (status codes) and httpx.RequestError (transport errors) explicitly, and emit logs instead of print statements. This aligns with the error handling and logging requirements in coding guidelines.
🔧 Proposed fix
import asyncio
import json
+import logging
from pathlib import Path
import httpx
from bs4 import BeautifulSoup
+logger = logging.getLogger(__name__)
+
class DuckDBDocsScraper:
async def _fetch_page(self, client: httpx.AsyncClient, url: str) -> str | None:
try:
response = await client.get(url, timeout=30.0)
response.raise_for_status()
return response.text
- except Exception as e:
- print(f"Failed to fetch {url}: {e}")
+ except httpx.HTTPStatusError as e:
+ logger.warning(
+ f"Failed to fetch {url}: HTTP {e.response.status_code}",
+ extra={"url": url, "status_code": e.response.status_code},
+ )
+ return None
+ except httpx.RequestError as e:
+ logger.error(
+ f"Failed to fetch {url}: {e}",
+ extra={"url": url, "error": str(e)},
+ )
return None🧰 Tools
🪛 Ruff (0.14.14)
[warning] 20-20: Consider moving this statement to an else block
(TRY300)
[warning] 21-21: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@chat-server/scripts/scrape_duckdb_docs.py` around lines 16 - 23, The
_fetch_page function currently catches all Exceptions and uses print; change it
to catch httpx.RequestError for transport-level failures around await
client.get(url, ...) and httpx.HTTPStatusError for non-2xx responses raised by
response.raise_for_status(), log errors via a module logger (e.g., logger =
logging.getLogger(__name__)) instead of print and include the URL and exception
details in the log (use logger.error or logger.exception), and still return
response.text on success or None on handled errors; keep the function signature
and return semantics the same.
| def _normalize_message(self, message: Any) -> tuple[str, str]: | ||
| """Extract role and content from a message dict; handle missing/alternate keys.""" | ||
| if isinstance(message, dict): | ||
| role = message.get("role") or message.get("type") or "user" | ||
| if role == "human": | ||
| role = "user" | ||
| content = ( | ||
| message.get("content") | ||
| or message.get("text") | ||
| or message.get("message") | ||
| or "" | ||
| ) | ||
| content = str(content) if content is not None else "" | ||
| else: | ||
| role = "user" | ||
| content = str(message) if message is not None else "" | ||
| content_escaped = (content or "").replace("\\", "\\\\").replace( | ||
| '"', '\\"' | ||
| ).replace("\n", "\\n").replace("\r", "") | ||
| return role, content_escaped |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
LangChain message role type values ai human system tool OpenAI mapping
💡 Result:
LangChain message role/type values (core)
LangChain’s main message classes (Python/JS) correspond to these conceptual roles: system, user, assistant, tool; plus a legacy function role for OpenAI’s older function-calling API. [1], [2], [3]
SystemMessage→ role system [2], [3]HumanMessage→ role user [2], [3]AIMessage(and streamingAIMessageChunk) → role assistant [2]ToolMessage→ role tool [2], [3]FunctionMessage(legacy) → role function (legacy) [2], [3]
OpenAI role mapping (what LangChain converts to)
When LangChain converts messages to OpenAI Chat Completions message params, it maps to OpenAI-style roles: user, assistant, system, tool, function, developer. [4], [5]
Typical mapping:
HumanMessage→{"role":"user", ...}[2], [3]AIMessage→{"role":"assistant", ...}(and tool calls are carried on the assistant message) [6]SystemMessage→{"role":"system", ...}[6]ToolMessage→{"role":"tool", ...}(includestool_call_id; may includename) [3], [6]FunctionMessage(legacy) →{"role":"function", ...}[2], [3]
Note on developer: LangChain’s OpenAI adapter mentions model-specific mappings such as converting system → developer for certain “reasoning models,” and generally supports the developer role in its conversion layer. [4], [5]
Built-in converters
- Python:
convert_to_openai_messages(...)shows the concrete OpenAI dicts produced (including assistanttool_callsandtoolmessages). [6] - JS (
@langchain/openai):convertMessagesToCompletionsMessageParams/convertStandardContentMessageToCompletionsMessagedocument the role mapping behavior (includingdeveloper). [4], [5]
Sources: LangChain concepts/docs and reference pages for message roles/types and OpenAI conversion utilities. [1]-[6]
🏁 Script executed:
# First, let's examine the actual file to understand the context
cat -n chat-server/tests/e2e/utils/generate_app_cases.py | sed -n '380,430p'Repository: factly/gopie
Length of output: 2662
🏁 Script executed:
# Search for how messages are used in this file and if role normalization matters
rg -n "role" chat-server/tests/e2e/utils/generate_app_cases.py | head -20Repository: factly/gopie
Length of output: 838
🏁 Script executed:
# Check if there are any tests or usages that show what roles are expected
rg -n "_normalize_message" chat-server/tests/e2e/utils/generate_app_cases.py -A 5 -B 5Repository: factly/gopie
Length of output: 1575
🏁 Script executed:
# Search for OpenAI API schema or message format requirements in the codebase
rg -n "OpenAI\|openai" chat-server/tests/e2e/utils/generate_app_cases.pyRepository: factly/gopie
Length of output: 38
🏁 Script executed:
# Search for where messages are created or come from
rg -n "messages\s*=" chat-server/tests/e2e/utils/generate_app_cases.py | head -20Repository: factly/gopie
Length of output: 38
🏁 Script executed:
# Look for any place where "ai" role is created
rg -n '"ai"|\'ai\'' chat-server/tests/e2e/utils/generate_app_cases.pyRepository: factly/gopie
Length of output: 124
🏁 Script executed:
# Check the TestCasesList and test case structure
rg -n "class.*TestCase|messages.*:" chat-server/tests/e2e/utils/generate_app_cases.py -B 2 -A 5Repository: factly/gopie
Length of output: 3974
🏁 Script executed:
# Search for LangChain imports and usage
rg -n "langchain|LangChain" chat-server/tests/e2e/utils/generate_app_cases.py -iRepository: factly/gopie
Length of output: 109
Add role escaping to prevent code generation injection.
The current code escapes message content but leaves role values unescaped. Since role is directly inserted into generated Python code templates (line 448, 481), any special characters (quotes, backslashes) in a role value could break the output. Add escaping for role similar to content:
if isinstance(message, dict):
role = message.get("role") or message.get("type") or "user"
if role == "human":
role = "user"
content = (
message.get("content")
or message.get("text")
or message.get("message")
or ""
)
content = str(content) if content is not None else ""
else:
role = "user"
content = str(message) if message is not None else ""
+ role_escaped = str(role).replace("\\", "\\\\").replace('"', '\\"')
content_escaped = (content or "").replace("\\", "\\\\").replace(
'"', '\\"'
).replace("\n", "\\n").replace("\r", "")
- return role, content_escaped
+ return role_escaped, content_escapedAlso consider normalizing "ai" → "assistant" if test case messages come from LangChain sources (consistent with existing "human" → "user" normalization), though this may be defensive if the current data only produces OpenAI-standard roles.
🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/utils/generate_app_cases.py` around lines 401 - 420,
The _normalize_message function currently escapes content but not role, which
can allow special chars in role to break generated Python code; update
_normalize_message to escape role the same way as content (replace backslash
with \\\\, double-quote with \\", newline with \\n, and remove \r) and return
the escaped role, and also normalize role values by mapping "human"→"user" and
"ai"→"assistant" (in addition to the existing "type" fallback) so generated
templates that inject role values use safe, normalized strings.
| class TestGenerateSqlPromptFormatting: | ||
| @pytest.fixture | ||
| def mock_dataset_schema(self): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find chat-server -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | head -10Repository: factly/gopie
Length of output: 82
🏁 Script executed:
ls -la chat-server/tests/unit/ 2>/dev/null | head -20Repository: factly/gopie
Length of output: 1337
🏁 Script executed:
head -50 chat-server/tests/unit/test_generate_sql_prompt_formatting.py 2>/dev/nullRepository: factly/gopie
Length of output: 1594
🏁 Script executed:
grep -r "pytest.mark.unit" chat-server/tests/unit/ 2>/dev/null | head -5Repository: factly/gopie
Length of output: 443
🏁 Script executed:
cat chat-server/pyproject.tomlRepository: factly/gopie
Length of output: 1748
🏁 Script executed:
cat -n chat-server/tests/unit/test_generate_sql_prompt_formatting.pyRepository: factly/gopie
Length of output: 7576
Add the unit test marker to align with project conventions.
Unit tests under chat-server/tests/unit require the pytest.mark.unit marker for reliable filtering. This file is currently missing it, while other tests in the same directory follow this requirement.
Add pytestmark = pytest.mark.unit at the module level (after the imports and docstring, as established in other test files in this directory):
🔧 Suggested change
@@
"""
Unit tests for generate_sql_prompt formatting logic, specifically verifying column value inclusion.
"""
from typing import TypedDict
from unittest.mock import MagicMock
import pytest
from app.models.data import ColumnValueMatching
from app.models.schema import DatasetSchema
from app.workflow.graph.sql_planner_graph.types import DatasetsInfo
from app.workflow.prompts.generate_sql_prompt import format_generate_sql_input
+pytestmark = pytest.mark.unit🤖 Prompt for AI Agents
In `@chat-server/tests/unit/test_generate_sql_prompt_formatting.py` around lines
31 - 33, Add the missing module-level pytest marker by declaring pytestmark =
pytest.mark.unit near the top of the test module (after imports and any module
docstring) so this file follows the same convention as other unit tests; modify
the TestGenerateSqlPromptFormatting test module to include that pytestmark
declaration (referencing the module that contains class
TestGenerateSqlPromptFormatting).
| # Verify query_points was called with correct structure | ||
| call_args = mock_client.query_points.call_args | ||
| assert call_args.kwargs["filter"] is None | ||
| assert len(call_args.kwargs["prefetch"]) == 2 # Dense and sparse prefetch |
There was a problem hiding this comment.
Potential KeyError when accessing kwargs["filter"].
Based on the search_schemas implementation in the relevant code snippet, the filter parameter is passed inside each Prefetch object, not at the top level of query_points. If filter is not explicitly passed as a keyword argument, accessing call_args.kwargs["filter"] will raise a KeyError instead of returning None.
Other tests in this file (lines 77, 107, 137, 165) use the safer pattern: call_args.kwargs.get("filter").
Proposed fix
# Verify query_points was called with correct structure
call_args = mock_client.query_points.call_args
- assert call_args.kwargs["filter"] is None
+ assert call_args.kwargs.get("filter") is None
assert len(call_args.kwargs["prefetch"]) == 2 # Dense and sparse prefetch🤖 Prompt for AI Agents
In `@chat-server/tests/unit/test_schema_search.py` around lines 48 - 51, The test
accesses query call kwargs with call_args.kwargs["filter"] which can raise
KeyError because search_schemas places filter inside each Prefetch rather than
as a top-level kwarg; update the assertion to use call_args.kwargs.get("filter")
(matching other tests) and keep the existing assertion on prefetch length
(mock_client.query_points, search_schemas, Prefetch are the symbols to
update/inspect).
…onous source-to-destination migration with hybrid vector support.
|
Claude finished @paul-tharun's task —— View job 🔍 PR Review: Feat/e2e datasets metadataReview Progress
🏗️ Architecture ReviewThis PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support. ✅ Positive Changes1. Clean NL-to-SQL Graph Architecture
2. Multi-Tenant Organization Support
3. SQL Injection Prevention
🚨 Critical Security Issues1. Missing Organization ID in Semantic SearchLocation: The semantic search function calls # VULNERABLE: No org_id passed
results = await search_schemas(
user_query=user_query,
embeddings=embeddings_model,
dataset_ids=dataset_ids,
project_ids=project_ids,
# Missing: org_id=org_id
)Fix Required: Extract org_id from config and pass it to search_schemas. 2. Division by Zero RiskLocation: # VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100Fix Required: Add zero-check before division. 3. Direct SQL InterpolationLocation: # POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"While
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⭐ | Excellent modular design |
| Security | ⭐⭐⭐⚪⚪ | Good foundations, critical gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code, room for improvement |
Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@chat-server/tests/scripts/reset_and_reindex_collection.py`:
- Around line 1-33: The imports at the top of reset_and_reindex_collection.py
include unnecessary "# noqa: E402" markers (on AsyncQdrantClient, models,
settings, logger, QdrantSetup, generate_sparse_vector); remove these trailing "#
noqa: E402" annotations from those import lines so Ruff can report real
import-order issues in future, leaving the import order and the sys.path
insertion unchanged and without adding other noqa comments.
- Around line 78-86: In run_generate_sparse_vector, replace the current re-raise
"raise e" with a bare "raise" to preserve the original traceback when an
exception is caught; keep the logger.error call as-is so the error is logged but
ensure you re-raise with bare raise in the except block of
generate_sparse_vector wrapper to retain original exception context.
|
Claude finished @paul-tharun's task —— View job PR Review: Feat/e2e datasets metadataI've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings: Review Checklist
🏗️ Architecture ReviewThis PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support. ✅ Positive Changes1. Clean NL-to-SQL Graph Architecture
2. Multi-Tenant Organization Support
3. SQL Injection Prevention
🚨 Critical Security Issues1. Missing Organization ID in Semantic SearchLocation: The semantic search function calls # VULNERABLE: No org_id passed
results = await search_schemas(
user_query=user_query,
embeddings=embeddings_model,
dataset_ids=dataset_ids,
project_ids=project_ids,
# Missing: org_id=org_id
)Fix Required: Extract org_id from config and pass it to search_schemas. 2. Division by Zero RiskLocation: # VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100Fix Required: Add zero-check before division. 3. Direct SQL InterpolationLocation: # POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"While
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⭐ | Excellent modular design |
| Security | ⭐⭐⭐⚪⚪ | Good foundations, critical gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code, room for improvement |
Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.
…ation and synchronously initialize the sparse vector model.
|
Claude finished @paul-tharun's task —— View job 🔍 PR Review: Feat/e2e datasets metadataI've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings: Review Progress
🏗️ Architecture & Design✅ Positive Changes
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⚪ | Well-designed but complex |
| Security | ⭐⭐⭐⚪⚪ | Good practices with some gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code with room for improvement |
Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.
|
Claude finished @paul-tharun's task —— View job PR Review: Feat/e2e datasets metadataReview Progress
🏗️ Architecture ReviewThis PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support. ✅ Positive Changes1. Clean NL-to-SQL Graph Architecture
2. Multi-Tenant Organization Support
3. SQL Injection Prevention
🚨 Critical Security Issues1. Missing Organization ID in Semantic SearchLocation: The semantic search function calls # VULNERABLE: No org_id passed
results = await search_schemas(
user_query=user_query,
embeddings=embeddings_model,
dataset_ids=dataset_ids,
project_ids=project_ids,
# Missing: org_id=org_id
)Fix Required: Extract org_id from config and pass it to search_schemas. 2. Division by Zero RiskLocation: # VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100Fix Required: Add zero-check before division. 3. Direct SQL InterpolationLocation: # POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"While
|
| Category | Rating | Comments |
|---|---|---|
| Architecture | ⭐⭐⭐⭐⭐ | Excellent modular design |
| Security | ⭐⭐⭐⚪⚪ | Good foundations, critical gaps |
| Performance | ⭐⭐⭐⭐⚪ | Smart optimizations implemented |
| Testing | ⭐⭐⚪⚪⚪ | Infrastructure good, coverage lacking |
| Code Quality | ⭐⭐⭐⭐⚪ | Clean code, room for improvement |
Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
web/src/lib/stores/auth-store.ts (1)
202-211:⚠️ Potential issue | 🟡 MinorMissing
role: nullin catch block ofcheckSession.When session check throws an error,
roleis not reset tonull, unlike the successful!response.okbranch at line 197. This could leave stale role data in state after a network error.🛡️ Proposed fix
} catch (error) { console.error("Session check error:", error); set({ user: null, isAuthenticated: false, accessToken: null, organizationId: null, + role: null, }); setGlobalAccessToken(null); setGlobalOrganizationId(null); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/lib/stores/auth-store.ts` around lines 202 - 211, In checkSession's catch block the state reset omits role, leaving stale role values after errors; update the set({...}) call inside the catch of checkSession to include role: null (matching the !response.ok branch) so user/role/accessToken/organizationId are all cleared on error, keeping state consistent with setGlobalAccessToken(null) and setGlobalOrganizationId(null).server/interfaces/http/routes/api/projects/datasets/update.go (1)
66-70:⚠️ Potential issue | 🟡 MinorIncorrect error message: "deleting" should be "updating".
The error response message at line 68 says "Error deleting dataset" but this is the update handler. This appears to be a copy-paste error.
🐛 Proposed fix
return ctx.Status(fiber.StatusInternalServerError).JSON(fiber.Map{ "error": err.Error(), - "message": "Error deleting dataset", + "message": "Error updating dataset", "code": fiber.StatusInternalServerError, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/interfaces/http/routes/api/projects/datasets/update.go` around lines 66 - 70, The error response in the update handler incorrectly says "Error deleting dataset"; update the message string to "Error updating dataset" in the JSON returned by the update route (the block that calls ctx.Status(fiber.StatusInternalServerError).JSON with "error": err.Error()); ensure the same handler/return (the update.go update route) uses "Error updating dataset" so the message matches the update operation.server/interfaces/http/routes/source/s3/create.go (2)
154-156:⚠️ Potential issue | 🔴 CriticalCleanup struct is missing
userIDandroleinitialization.When the
cleanupstruct is initialized and later updated,userIDandroleare never set. However,cleanupResourcesuses these fields when callingh.datasetSvc.Delete(rc.datasetID, rc.orgID, rc.userID, rc.role)at line 56.This means cleanup operations after dataset creation will pass empty/zero values for
userIDandrole, which could cause authorization failures or unexpected behavior during error recovery.🐛 Proposed fix to initialize userID
// Initialize cleanup resource object cleanup := resourceCleanup{ tableName: res.TableName, + userID: userID, + orgID: orgID, }And later, only update the dataset-specific fields:
// Update cleanup object to include dataset info cleanup.hasDataset = true cleanup.datasetID = dataset.ID - cleanup.orgID = dataset.OrgIDAlso applies to: 194-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/interfaces/http/routes/source/s3/create.go` around lines 154 - 156, resourceCleanup is constructed without setting userID and role, causing cleanupResources to call datasetSvc.Delete with empty auth fields; when creating the cleanup variable in create.go (resourceCleanup{...}) set userID and role from the current handler/request context (the same userID and role values used elsewhere in this handler) so those fields are populated, and ensure subsequent updates (the later assignments at the other initialization site around lines 194-197) only overwrite dataset-specific fields (datasetID, orgID, tableName) and not userID/role so datasetSvc.Delete receives correct auth parameters.
37-37:⚠️ Potential issue | 🔴 CriticalThe
rolefield is actively used and required; it should be assigned, not removed.The
rolefield is passed toh.datasetSvc.Delete()at line 56 and is essential for authorization checks (if role == models.Admin). However, bothroleanduserIDare never assigned to thecleanupstruct during initialization, causing the Delete call to receive zero-values. SinceuserIDandroleare available from the context in theuploadfunction, they should be populated when initializing thecleanupstruct.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/interfaces/http/routes/source/s3/create.go` at line 37, The cleanup struct is missing assignment of the required authorization fields; set cleanup.userID and cleanup.role when you create the cleanup instance in the upload handler so the later call to h.datasetSvc.Delete(...) receives the actual context values instead of zero-values. Locate the upload function where the cleanup struct is initialized (the struct with fields userID and role used by h.datasetSvc.Delete) and populate those fields from the request context (the same place you already read userID and role) so authorization checks (e.g., if role == models.Admin) work correctly.
🧹 Nitpick comments (4)
web/src/app/projects/[projectId]/page.tsx (1)
19-23: Consider addingProtectedPagewrapper for authentication.Similar to other authenticated pages in this PR (e.g.,
HomePage), this page should be wrapped withProtectedPageto ensure unauthenticated users are redirected appropriately.As per coding guidelines: "Protect routes using the ProtectedRoute wrapper HOC for authentication-required pages".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/app/projects/`[projectId]/page.tsx around lines 19 - 23, The ProjectPage component is not wrapped with the ProtectedPage auth wrapper; update the default export so ProjectPage is returned/wrapped by ProtectedPage (i.e., export default ProtectedPage(ProjectPage) or wrap the JSX returned by ProjectPage with <ProtectedPage>), ensuring the params prop handling (params: Promise<{ projectId: string }>) is preserved and any server/client boundaries remain correct; locate the ProjectPage function and apply the same ProtectedPage pattern used by HomePage to enforce authentication and redirection for unauthenticated users.web/src/components/dataset/column-descriptions-modal.tsx (1)
233-243: Consider updating help text when editing is disabled.The message "Click edit to add one" is shown even when
canEditisfalse, which could confuse users who don't see an edit button.💡 Suggested improvement
) : ( <p className="text-muted-foreground italic"> - No description available. Click edit to add one. + No description available.{canEdit && " Click edit to add one."} </p> )}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/components/dataset/column-descriptions-modal.tsx` around lines 233 - 243, The "Click edit to add one." help text is shown even when editing is disabled; update the JSX in column-descriptions-modal.tsx (the block using descriptions[column.column_name] and the canEdit prop/variable) to conditionally render the message: if descriptions[column.column_name] is missing and canEdit is true show "Click edit to add one.", otherwise show a neutral message such as "No description available." or "Editing disabled" when canEdit is false; adjust the conditional around the <p> elements so the text reflects the canEdit state.server/application/services/store.go (2)
61-68: Drop dead list parameters (or enforce them) for a clearer service contract.
ProjectService.Liststill acceptscreatedBybut does not use it, and this pattern now mirrors the role-removal transition. Tightening the signature here will reduce ambiguity for callers.Also applies to: 107-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/application/services/store.go` around lines 61 - 68, ProjectService.List currently accepts an unused createdBy parameter; either remove it from the method signature and all callers or wire it through to the repository call (update ProjectService.List signature and the call to projectRepo.SearchProject to accept and forward createdBy, or update repository if necessary). Locate the ProjectService.List method and the projectRepo.SearchProject invocation to either delete the createdBy argument from List and callers, or add createdBy handling in List and modify SearchProject's signature and implementations to accept and filter by createdBy so the parameter is not left unused.
28-29: Remove unusedcreatedByfromDetailsservice methods.
createdByis accepted but ignored in both methods, which makes the API contract look stricter than the actual authorization path.♻️ Proposed cleanup
-func (service *ProjectService) Details(id, orgID, createdBy string) (*models.Project, error) { +func (service *ProjectService) Details(id, orgID string) (*models.Project, error) { return service.projectRepo.Details(context.Background(), id, orgID) } -func (service *DatasetService) Details(id, orgID, createdBy string) (*models.Dataset, error) { +func (service *DatasetService) Details(id, orgID string) (*models.Dataset, error) { return service.datasetRepo.Details(context.Background(), id, orgID) }Also applies to: 93-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/application/services/store.go` around lines 28 - 29, The Details method on ProjectService currently accepts an unused createdBy parameter; remove createdBy from the ProjectService.Details signature and from any other service Details variants, update all callers to stop passing createdBy, and adjust any interface definitions and tests that reference ProjectService.Details; keep the internal call to projectRepo.Details(context.Background(), id, orgID) unchanged. Ensure you update method declarations (ProjectService.Details) and any implementing interfaces so signatures match, run tests, and remove dead parameters in related methods mentioned in the review (the other Details method at lines 93-94).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/application/repositories/store.go`:
- Line 80: The ListByProjectAndOrg implementation is ignoring orgID and
createdBy when calling s.List; update the function (ListByProjectAndOrg) to pass
a filter that enforces tenant scoping and optional creator filtering instead of
only projectID and pagination—modify the call to s.List to include an
appropriate filter/map/selector that adds org_id = orgID and, if createdBy is
non-empty, created_by = createdBy (or whatever field names Dataset expects), so
the query returned by s.List is correctly restricted by org and creator.
In `@server/infrastructure/postgres/store/datasets/list_by_project_and_org.go`:
- Around line 9-11: The ListByProjectAndOrg implementation ignores orgID and
createdBy (it just calls s.List), so update it to apply org and creator
filtering: either (A) extend or call an underlying query method that accepts
orgID and createdBy and pass those values into the SQL/ORM filter, or (B) if
filtering is intentionally not applied, add a clear comment on
PgDatasetStore.ListByProjectAndOrg stating that org/user scoping is deferred and
why; specifically modify the function PgDatasetStore.ListByProjectAndOrg to use
orgID and createdBy in the dataset query (or call a new/internal method like
s.listWithFilters(ctx, projectID, orgID, createdBy, pagination)) to enforce
tenant/user isolation and return the paginated result.
In `@server/interfaces/http/routes/api/projects/details.go`:
- Around line 20-24: Replace the header-based org lookup with the
middleware-local org value: stop using ctx.Get(middleware.OrganizationIDHeader)
and instead read the org ID from the middleware-populated locals (e.g.,
ctx.Locals(<middleware's org-local-key>)) before calling
h.projectSvc.Details(projectID, orgID, userID); update the orgID variable
initialization so authorization uses the middleware-scoped org context (use the
exact local key constant provided by the middleware rather than the header
constant) and keep projectID and userID retrievals as-is.
In `@web/src/app/projects/`[projectId]/upload/page.tsx:
- Around line 9-36: The page lacks the authentication wrapper: wrap the
UploadDatasetPage with the app's ProtectedPage/ProtectedRoute so unauthenticated
users are redirected to login before any canEdit checks run; locate the
UploadDatasetPage export in web/src/app/projects/[projectId]/upload/page.tsx and
either return/render <ProtectedPage> (or the project's ProtectedRoute HOC)
around the component's JSX (ensuring DatasetUploadWizard remains inside) or
export the component via ProtectedPage(UploadDatasetPage), keep the existing
useAuthStore, useProject, and canEdit logic intact inside the wrapped component,
and ensure ProtectedPage handles redirect-to-login for unauthenticated users.
---
Outside diff comments:
In `@server/interfaces/http/routes/api/projects/datasets/update.go`:
- Around line 66-70: The error response in the update handler incorrectly says
"Error deleting dataset"; update the message string to "Error updating dataset"
in the JSON returned by the update route (the block that calls
ctx.Status(fiber.StatusInternalServerError).JSON with "error": err.Error());
ensure the same handler/return (the update.go update route) uses "Error updating
dataset" so the message matches the update operation.
In `@server/interfaces/http/routes/source/s3/create.go`:
- Around line 154-156: resourceCleanup is constructed without setting userID and
role, causing cleanupResources to call datasetSvc.Delete with empty auth fields;
when creating the cleanup variable in create.go (resourceCleanup{...}) set
userID and role from the current handler/request context (the same userID and
role values used elsewhere in this handler) so those fields are populated, and
ensure subsequent updates (the later assignments at the other initialization
site around lines 194-197) only overwrite dataset-specific fields (datasetID,
orgID, tableName) and not userID/role so datasetSvc.Delete receives correct auth
parameters.
- Line 37: The cleanup struct is missing assignment of the required
authorization fields; set cleanup.userID and cleanup.role when you create the
cleanup instance in the upload handler so the later call to
h.datasetSvc.Delete(...) receives the actual context values instead of
zero-values. Locate the upload function where the cleanup struct is initialized
(the struct with fields userID and role used by h.datasetSvc.Delete) and
populate those fields from the request context (the same place you already read
userID and role) so authorization checks (e.g., if role == models.Admin) work
correctly.
In `@web/src/lib/stores/auth-store.ts`:
- Around line 202-211: In checkSession's catch block the state reset omits role,
leaving stale role values after errors; update the set({...}) call inside the
catch of checkSession to include role: null (matching the !response.ok branch)
so user/role/accessToken/organizationId are all cleared on error, keeping state
consistent with setGlobalAccessToken(null) and setGlobalOrganizationId(null).
---
Nitpick comments:
In `@server/application/services/store.go`:
- Around line 61-68: ProjectService.List currently accepts an unused createdBy
parameter; either remove it from the method signature and all callers or wire it
through to the repository call (update ProjectService.List signature and the
call to projectRepo.SearchProject to accept and forward createdBy, or update
repository if necessary). Locate the ProjectService.List method and the
projectRepo.SearchProject invocation to either delete the createdBy argument
from List and callers, or add createdBy handling in List and modify
SearchProject's signature and implementations to accept and filter by createdBy
so the parameter is not left unused.
- Around line 28-29: The Details method on ProjectService currently accepts an
unused createdBy parameter; remove createdBy from the ProjectService.Details
signature and from any other service Details variants, update all callers to
stop passing createdBy, and adjust any interface definitions and tests that
reference ProjectService.Details; keep the internal call to
projectRepo.Details(context.Background(), id, orgID) unchanged. Ensure you
update method declarations (ProjectService.Details) and any implementing
interfaces so signatures match, run tests, and remove dead parameters in related
methods mentioned in the review (the other Details method at lines 93-94).
In `@web/src/app/projects/`[projectId]/page.tsx:
- Around line 19-23: The ProjectPage component is not wrapped with the
ProtectedPage auth wrapper; update the default export so ProjectPage is
returned/wrapped by ProtectedPage (i.e., export default
ProtectedPage(ProjectPage) or wrap the JSX returned by ProjectPage with
<ProtectedPage>), ensuring the params prop handling (params: Promise<{
projectId: string }>) is preserved and any server/client boundaries remain
correct; locate the ProjectPage function and apply the same ProtectedPage
pattern used by HomePage to enforce authentication and redirection for
unauthenticated users.
In `@web/src/components/dataset/column-descriptions-modal.tsx`:
- Around line 233-243: The "Click edit to add one." help text is shown even when
editing is disabled; update the JSX in column-descriptions-modal.tsx (the block
using descriptions[column.column_name] and the canEdit prop/variable) to
conditionally render the message: if descriptions[column.column_name] is missing
and canEdit is true show "Click edit to add one.", otherwise show a neutral
message such as "No description available." or "Editing disabled" when canEdit
is false; adjust the conditional around the <p> elements so the text reflects
the canEdit state.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
server/application/repositories/store.goserver/application/services/store.goserver/infrastructure/postgres/store/datasets/list_by_project_and_org.goserver/infrastructure/postgres/store/datasets/list_by_project_and_role.goserver/interfaces/http/middleware/validate.goserver/interfaces/http/routes/api/datasets/details.goserver/interfaces/http/routes/api/datasets/get-column-descriptions.goserver/interfaces/http/routes/api/datasets/update-column-descriptions.goserver/interfaces/http/routes/api/projects/datasets/delete.goserver/interfaces/http/routes/api/projects/datasets/details.goserver/interfaces/http/routes/api/projects/datasets/update.goserver/interfaces/http/routes/api/projects/details.goserver/interfaces/http/routes/api/projects/list.goserver/interfaces/http/routes/source/database/create.goserver/interfaces/http/routes/source/database/refresh.goserver/interfaces/http/routes/source/s3/create.goweb/src/app/page.tsxweb/src/app/projects/[projectId]/datasets/[datasetId]/page.tsxweb/src/app/projects/[projectId]/page.tsxweb/src/app/projects/[projectId]/upload/page.tsxweb/src/components/dataset/column-descriptions-modal.tsxweb/src/components/dataset/dataset-header.tsxweb/src/components/project/inline-project-editor.tsxweb/src/components/project/project-card.tsxweb/src/lib/stores/auth-store.ts
💤 Files with no reviewable changes (1)
- server/infrastructure/postgres/store/datasets/list_by_project_and_role.go
| DeleteByOrgAndCreator(ctx context.Context, datasetID, orgID, createdBy string) error | ||
| UpdateByOrgAndCreator(ctx context.Context, datasetID, orgID, createdBy string, params *models.UpdateDatasetParams) (*models.Dataset, error) | ||
| ListByProjectAndRole(ctx context.Context, projectID, orgID, createdBy string, role models.Role, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error) | ||
| ListByProjectAndOrg(ctx context.Context, projectID, orgID, createdBy string, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== ListByProjectAndOrg definitions =="
rg -n -C6 '\bListByProjectAndOrg\s*\(' server
echo
echo "== Candidate implementation files =="
fd 'list_by_project_and_org\.go$' server
echo
echo "== Inspect org/creator usage in implementations =="
for f in $(fd 'list_by_project_and_org\.go$' server); do
echo "---- $f ----"
rg -n -C6 'ListByProjectAndOrg|orgID|createdBy|org_id|created_by|WHERE|project_id' "$f"
doneRepository: factly/gopie
Length of output: 4469
ListByProjectAndOrg ignores orgID and createdBy parameters—org scoping is not enforced.
The implementation at server/infrastructure/postgres/store/datasets/list_by_project_and_org.go accepts these parameters but forwards only projectID and pagination to s.List(), ignoring tenant and creator constraints. Update the implementation to apply orgID (and createdBy if intended) to query filters.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/application/repositories/store.go` at line 80, The ListByProjectAndOrg
implementation is ignoring orgID and createdBy when calling s.List; update the
function (ListByProjectAndOrg) to pass a filter that enforces tenant scoping and
optional creator filtering instead of only projectID and pagination—modify the
call to s.List to include an appropriate filter/map/selector that adds org_id =
orgID and, if createdBy is non-empty, created_by = createdBy (or whatever field
names Dataset expects), so the query returned by s.List is correctly restricted
by org and creator.
| func (s *PgDatasetStore) ListByProjectAndOrg(ctx context.Context, projectID, orgID, createdBy string, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error) { | ||
| return s.List(ctx, projectID, pagination) | ||
| } |
There was a problem hiding this comment.
Parameters orgID and createdBy are ignored in the implementation.
The method signature accepts orgID and createdBy but these values are never used—the call simply delegates to s.List(ctx, projectID, pagination). This creates a misleading API contract where callers may assume org-scoped or user-scoped filtering is applied when it is not.
If org/user filtering is intentionally deferred, consider adding a comment explaining this. Otherwise, incorporate these parameters into the query to enforce proper tenant isolation.
🔍 Verification: Check if List method supports org filtering
#!/bin/bash
# Description: Check the List method signature and implementation to see if it supports org/user filtering
# Find the List method implementation
ast-grep --pattern $'func ($_ *PgDatasetStore) List($$$) {
$$$
}'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/infrastructure/postgres/store/datasets/list_by_project_and_org.go`
around lines 9 - 11, The ListByProjectAndOrg implementation ignores orgID and
createdBy (it just calls s.List), so update it to apply org and creator
filtering: either (A) extend or call an underlying query method that accepts
orgID and createdBy and pass those values into the SQL/ORM filter, or (B) if
filtering is intentionally not applied, add a clear comment on
PgDatasetStore.ListByProjectAndOrg stating that org/user scoping is deferred and
why; specifically modify the function PgDatasetStore.ListByProjectAndOrg to use
orgID and createdBy in the dataset query (or call a new/internal method like
s.listWithFilters(ctx, projectID, orgID, createdBy, pagination)) to enforce
tenant/user isolation and return the paginated result.
| projectID := ctx.Params("projectID") | ||
| orgID := ctx.Get(middleware.OrganizationIDHeader) | ||
| userID := ctx.Locals(middleware.UserCtxKey).(string) | ||
| role := ctx.Locals(middleware.RoleCtxKey).(models.Role) | ||
|
|
||
| project, err := h.projectSvc.Details(projectID, orgID, userID, role) | ||
| project, err := h.projectSvc.Details(projectID, orgID, userID) |
There was a problem hiding this comment.
Use middleware-scoped org context instead of request headers for tenant enforcement.
Line 21 currently trusts a user-provided header for orgID. For authorization-sensitive reads, use middleware locals to avoid tenant spoofing.
🔒 Suggested fix
- orgID := ctx.Get(middleware.OrganizationIDHeader)
+ orgID := ctx.Locals(middleware.OrganizationCtxKey).(string)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| projectID := ctx.Params("projectID") | |
| orgID := ctx.Get(middleware.OrganizationIDHeader) | |
| userID := ctx.Locals(middleware.UserCtxKey).(string) | |
| role := ctx.Locals(middleware.RoleCtxKey).(models.Role) | |
| project, err := h.projectSvc.Details(projectID, orgID, userID, role) | |
| project, err := h.projectSvc.Details(projectID, orgID, userID) | |
| projectID := ctx.Params("projectID") | |
| orgID := ctx.Locals(middleware.OrganizationCtxKey).(string) | |
| userID := ctx.Locals(middleware.UserCtxKey).(string) | |
| project, err := h.projectSvc.Details(projectID, orgID, userID) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/interfaces/http/routes/api/projects/details.go` around lines 20 - 24,
Replace the header-based org lookup with the middleware-local org value: stop
using ctx.Get(middleware.OrganizationIDHeader) and instead read the org ID from
the middleware-populated locals (e.g., ctx.Locals(<middleware's org-local-key>))
before calling h.projectSvc.Details(projectID, orgID, userID); update the orgID
variable initialization so authorization uses the middleware-scoped org context
(use the exact local key constant provided by the middleware rather than the
header constant) and keep projectID and userID retrievals as-is.
| export default function UploadDatasetPage({ | ||
| params, | ||
| }: { | ||
| params: Promise<{ projectId: string }>; | ||
| }) { | ||
| const { projectId } = React.use(params); | ||
| const { role, user } = useAuthStore(); | ||
| const router = useRouter(); | ||
|
|
||
| const { data: project, isLoading } = useProject({ | ||
| variables: { projectId }, | ||
| }); | ||
|
|
||
| React.useEffect(() => { | ||
| if (!isLoading && project) { | ||
| const canEdit = role === "admin" || project.created_by === user?.id; | ||
| if (!canEdit) { | ||
| router.replace(`/projects/${projectId}`); | ||
| } | ||
| } | ||
| }, [isLoading, project, role, user, projectId, router]); | ||
|
|
||
| if (isLoading || !project) return null; | ||
|
|
||
| const canEdit = role === "admin" || project.created_by === user?.id; | ||
| if (!canEdit) return null; | ||
|
|
||
| return <DatasetUploadWizard projectId={projectId} />; |
There was a problem hiding this comment.
Missing ProtectedPage wrapper for authentication.
Per coding guidelines, authentication-required pages should use the ProtectedRoute or ProtectedPage wrapper. While the canEdit check handles authorization, unauthenticated users should be redirected to login rather than seeing a blank page.
🛡️ Proposed fix
+import { ProtectedPage } from "@/components/auth/protected-page";
+
export default function UploadDatasetPage({
params,
}: {
params: Promise<{ projectId: string }>;
}) {
// ... existing code ...
- return <DatasetUploadWizard projectId={projectId} />;
+ return (
+ <ProtectedPage>
+ <DatasetUploadWizard projectId={projectId} />
+ </ProtectedPage>
+ );
}As per coding guidelines: "Protect routes using the ProtectedRoute wrapper HOC for authentication-required pages".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/src/app/projects/`[projectId]/upload/page.tsx around lines 9 - 36, The
page lacks the authentication wrapper: wrap the UploadDatasetPage with the app's
ProtectedPage/ProtectedRoute so unauthenticated users are redirected to login
before any canEdit checks run; locate the UploadDatasetPage export in
web/src/app/projects/[projectId]/upload/page.tsx and either return/render
<ProtectedPage> (or the project's ProtectedRoute HOC) around the component's JSX
(ensuring DatasetUploadWizard remains inside) or export the component via
ProtectedPage(UploadDatasetPage), keep the existing useAuthStore, useProject,
and canEdit logic intact inside the wrapped component, and ensure ProtectedPage
handles redirect-to-login for unauthenticated users.
7b6a888 to
bda820b
Compare
Summary by CodeRabbit
New Features
Improvements
Tests / Docs