Skip to content

Feat/e2e datasets metadata#49

Open
paul-tharun wants to merge 15 commits intomainfrom
feat/e2e-datasets-metadata
Open

Feat/e2e datasets metadata#49
paul-tharun wants to merge 15 commits intomainfrom
feat/e2e-datasets-metadata

Conversation

@paul-tharun
Copy link
Copy Markdown
Contributor

@paul-tharun paul-tharun commented Jan 22, 2026

Summary by CodeRabbit

  • New Features

    • Natural-language → SQL generation endpoint; DuckDB documentation lookup; visualization end-to-end tooling and runner; multi-organization header support.
  • Improvements

    • Hybrid semantic+sparse search and size-aware OLAP sampling for better results; smarter SQL planning with clearer non‑SQL responses; CSV conversion utility; enhanced health and sandbox handling.
  • Tests / Docs

    • Extensive new/updated unit and e2e tests and expanded README.

Major changes:
- Add fetch-sql endpoint for NL to SQL generation
- Add unified GopieClient for API requests
- Add new SQL agent workflow with sql_planner_graph and nl_to_sql_graph
- Refactor single/multi dataset graphs with dedicated sql_agent nodes
- Add DuckDB docs search functionality
- Update prompts and validation logic
- Add new tools: respond_to_user, return_result_paths
- Update dependencies: langgraph>=0.6.0, dspy==3.1.0, beautifulsoup4
- Various bug fixes and improvements
- Updated `_build_ilike_query` and `_build_levenshtein_query` to ensure proper casting of columns to VARCHAR for SQL queries.
- Changed `data` type in `VisualizationResult` from `list[bytes]` to `list[str]` for better compatibility.
- Refactored `update_sandbox_timeout` to return a tuple indicating whether a new sandbox was created.
- Adjusted `pre_model_hook` to utilize the updated sandbox handling logic.
- Enhanced prompt guidelines to exclude non-string columns from fuzzy value generation.
- Updated test cases to validate new query handling and data type changes.
@claude
Copy link
Copy Markdown

claude Bot commented Jan 22, 2026

Claude finished @paul-tharun's task —— View job


PR Review: Feat/e2e datasets metadata

I've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings:

Review Checklist

  • Read repository setup and guidelines (CLAUDE.md, configs)
  • Review architectural changes (new graphs, workflows)
  • Analyze API security and validation
  • Check error handling and edge cases
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture & Design

Positive Changes

  • Well-structured agent architecture: New nl_to_sql_graph and sql_planner_graph follow LangGraph patterns correctly
  • Separation of concerns: Clear separation between SQL planning, execution, and validation workflows
  • Type safety: Good use of Pydantic models for API validation (FetchSqlRequest, FetchSqlResponse)
  • Unified client pattern: New GopieClient centralizes API communication with proper header management

⚠️ Architectural Concerns

  • Commented code in workflows: chat-server/app/workflow/graph/sql_planner_graph/graph.py:19-30 has extensive commented conditional edges that should either be implemented or removed
  • Complex state management: The addition of multiple new graph types increases complexity without clear documentation of when each should be used

🔒 Security Analysis

Good Security Practices

  • Input validation: Proper Pydantic validation in new API endpoints (fetch_sql.py:21-22)
  • SQL injection prevention: Using parameterized queries and proper escaping (column_value_matching.py:40)
  • Header handling: Proper organization ID header management in GopieClient

🚨 Security Issues

  1. Insufficient input sanitization in column_value_matching.py:40:

    WHERE LOWER(CAST({column_name} AS VARCHAR)) LIKE '%' || LOWER('{escaped_value}') || '%'

    The escaped_value variable name suggests sanitization, but no actual escaping is shown in this snippet.

  2. Generic exception handling in fetch_sql.py:36-42:

    except Exception as e:
        logger.error(f"Error in fetch_sql: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail="Failed to generate SQL queries, please try again.",
        ) from e

    This could mask important security errors and provide too generic error messages.


🚀 Performance Analysis

Performance Improvements

  • Smart sampling: New table size estimation and sampling logic in table_utils.py prevents performance issues on large datasets
  • Connection pooling: Using singleton pattern for HTTP client connections (SingletonAiohttp)
  • Async operations: Proper use of async/await throughout new components

⚠️ Performance Concerns

  • Multiple vector searches: New DuckDB documentation search (duckdb_docs/search.py) adds another vector search operation that could impact latency
  • Memory usage: Complex state objects with nested query results could consume significant memory in high-concurrency scenarios

🧪 Testing & Quality

Testing Infrastructure

  • Good test structure: Maintained separation between unit and e2e tests
  • DSPy integration: Advanced evaluation system with optimized evaluators

Testing Gaps

  1. No tests for new API endpoint: fetch_sql.py lacks dedicated test coverage
  2. Missing edge case tests: New graph workflows need tests for failure scenarios
  3. Integration tests: No tests covering the interaction between new nl_to_sql_graph and existing workflows

🐛 Bug & Error Handling Issues

  1. State validation inconsistency in sql_agent.py:27-28:

    if not query_result or not single_dataset_result:
        raise ValueError("query_result is not properly initialized")

    This doesn't provide context about what went wrong or how to fix it.

  2. Resource cleanup: Missing cleanup logic for failed operations in new graph workflows

  3. Retry logic: Inconsistent retry mechanisms across different agents - some have configurable retry counts, others don't


📋 Code Quality

Good Practices

  • Type hints: Comprehensive type annotations throughout new code
  • Documentation: Good docstrings in new functions like get_duckdb_docs()
  • Code organization: Logical file structure following established patterns

⚠️ Areas for Improvement

  • Hardcoded values: Magic numbers like LIMIT 200000 in SQL queries should be configurable
  • Long functions: Some functions exceed 50 lines and could benefit from decomposition
  • Duplicate logic: Similar error handling patterns repeated across multiple files

🔧 Specific Recommendations

  1. Security:

    # In column_value_matching.py - add proper escaping
    def escape_sql_value(value: str) -> str:
        return value.replace("'", "''").replace("%", "\\%").replace("_", "\\_")
  2. Error handling:

    # In fetch_sql.py - more specific exception handling
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except ConnectionError as e:
        raise HTTPException(status_code=503, detail="Service unavailable")
  3. Configuration:

    # Add to config.py
    DEFAULT_QUERY_LIMIT: int = 200000
    MAX_SAMPLING_PERCENTAGE: float = 5.0
  4. Testing:

    • Add unit tests for fetch_sql_router
    • Add integration tests for new graph workflows
    • Add performance tests for large dataset scenarios

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⚪ Well-designed but complex
Security ⭐⭐⭐⚪⚪ Good practices with some gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code with room for improvement

Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.


@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds org-scoped request handling and a GopieClient; implements NL→SQL and SQL planner graphs with semantic search and SQL agents; introduces OLAP query builders (DuckDB/ClickHouse), sampling helpers, hybrid dense+sparse Qdrant support (SPLADE), refactors single/multi-dataset workflows, and expands tests, viz tooling, and indexing/migration scripts.

Changes

Cohort / File(s) Summary
API & Org Propagation
chat-server/app/api/v1/routers/dataset_upload.py, chat-server/app/api/v1/routers/query.py, chat-server/app/api/v1/routers/fetch_sql.py, chat-server/app/main.py
Add x_organization_id Header param and propagate org_id through endpoints and service calls; add new fetch-sql router.
Gopie client & services
chat-server/app/services/gopie/client.py, .../dataset_info.py, .../generate_schema.py, .../sql_executor.py
Introduce GopieClient, replace direct HTTP sessions, add org-scoped dataset/project fetch, sampling-aware summary generation, and org-scoped SQL execution.
Qdrant & Hybrid Vector Store
chat-server/app/services/qdrant/qdrant_setup.py, .../vector_store.py, .../schema_search.py, .../get_schema.py, .../schema_vectorization.py
Per-collection client/config, async/sync client singletons, collection existence handling, SPLADE sparse support, hybrid dense+sparse upserts and hybrid search, and org_id filtering for schema queries.
OLAP query builders & table utils
chat-server/app/utils/olap/*, chat-server/app/utils/graph_utils/table_utils.py, chat-server/app/utils/graph_utils/column_value_matching.py
Add OlapQueryBuilder interface and DuckDB/ClickHouse implementations; table size estimation, sampling decision helpers, and size-aware fuzzy SQL generation with org_id propagation.
NL→SQL & SQL Planner Graphs
chat-server/app/workflow/graph/nl_to_sql_graph/*, chat-server/app/workflow/graph/sql_planner_graph/*, related node/*, types.py
Add NL→SQL state graph (supervisor, semantic_search, sql_agent) and SQL planner graph (generate_sql, match_columns, search_docs) with typed states and nodes.
Single / Multi-dataset workflow refactor
chat-server/app/workflow/graph/single_dataset_graph/*, chat-server/app/workflow/graph/multi_dataset_graph/*
Split single-dataset flow into prepare_data, sql_agent, execute_sql with routing; replace plan_query with sql_agent in multi-dataset graph; rename/reshape state keys (previous_sql_queries/prev_sql_queries).
Prompts & selector updates
chat-server/app/workflow/prompts/*, chat-server/app/workflow/prompts/generate_sql_prompt.py
Rename/add many prompt APIs (generate_sql, extract_column_assumptions, etc.), add DB-aware prompt pieces, update prompt_map and input formatters.
Model & embedding providers
chat-server/app/utils/providers/embedding_providers/vllm.py, .../__init__.py, chat-server/app/models/provider.py, chat-server/app/utils/model_registry/model_provider.py
Add VLLM embedding provider and wiring; expose VLLM provider and adjust LLM retry behavior in provider selection.
Qdrant tooling & scripts
tests/scripts/reset_and_reindex_collection.py, chat-server/scripts/index_duckdb_docs.py, chat-server/scripts/scrape_duckdb_docs.py
New scripts to scrape/index DuckDB docs and migrate/reindex Qdrant collections to hybrid dense+sparse vectors.
Visualization & sandbox
chat-server/app/workflow/graph/visualize_data_graph/*, tests/e2e/viz_utils/*
Add VizIOManager, PerExampleWorkflow, viz test runner; update sandbox timeout helper to create sandbox when None and return (sandbox, created_new); pre_process_data and visualization nodes now pass org_id to SQL executor.
Chat history, formatting & models
chat-server/app/utils/chat_history/processor.py, chat-server/app/workflow/prompts/formatters/format_query_result.py, chat-server/app/models/query.py, chat-server/app/models/schema.py
Refactor chat history to inline SQL IDs; rename fields (no_sql_response→non_sql_response, sql_results→sql_queries), add org_id to DatasetSchema, update model serialization.
Tools & RunnableConfig plumbing
chat-server/app/tool_utils/*, chat-server/app/tool_utils/tools/*
Tool enum updates (RESPOND_TO_USER, RETURN_RESULT_PATHS), pass RunnableConfig into tools to extract org_id, and refactor tool-node metadata application.
CSV & utilities
chat-server/app/utils/csv_utils.py, chat-server/app/utils/graph_utils/result_validation.py
Add convert_rows_to_csv utility and adjust result validation/truncation messaging and JSON serialization defaults.
Tests & test infra
chat-server/tests/**/*
Large test additions and updates: unit tests for OLAP builders, schema search, match_columns, prompts; e2e viz tooling, dataset managers, test config updates, and changed .gitignore for datasets.
Config, deps, docs & infra
chat-server/app/core/config.py, chat-server/pyproject.toml, chat-server/README.md, .gitignore, docker-compose*.yaml, tests/test_config.py
New settings (duckdb collection, sparse model, sampling thresholds, OLAP_DB_TYPE), dependency updates/additions, README and test config updates, narrowed .gitignore entry for datasets.
Minor renames & logging
assorted files (app/core/log.py, event utils, constants)
Logger exception level tweak, removal of some streaming helpers, and assorted renames/refactors across workflow nodes and prompts.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API as Fetch SQL API
    participant NLToSQL as NL-to-SQL Graph
    participant Supervisor
    participant Semantic as SemanticSearch
    participant Qdrant
    participant SQLAgent
    participant Planner as SQL Planner Agent

    User->>API: POST /fetch-sql (user_query, dataset_ids, project_ids)
    API->>NLToSQL: invoke(payload)
    NLToSQL->>Supervisor: decide route
    alt single dataset
        Supervisor->>SQLAgent: goto sql_agent
        SQLAgent->>Planner: invoke sql_planning_agent
        Planner->>SQLAgent: sql_queries
    else multi-dataset
        Supervisor->>Semantic: goto semantic_search
        Semantic->>Qdrant: search_schemas(user_query,...)
        Qdrant->>Semantic: schema results
        Semantic->>NLToSQL: semantic results
        NLToSQL->>SQLAgent: provide semantic results
        SQLAgent->>Planner: invoke sql_planning_agent
        Planner->>SQLAgent: sql_queries
    end
    SQLAgent->>API: return sql_queries + optional message
    API->>User: FetchSqlResponse
Loading
sequenceDiagram
    participant State
    participant Prepare as PrepareData
    participant SQLAgent
    participant Execute as ExecuteSQL
    participant Validate

    State->>Prepare: dataset_info, user_query
    Prepare->>Prepare: fetch schema, sample data, column assumptions
    Prepare->>SQLAgent: dataset_info
    SQLAgent->>SQLAgent: call sql_planning_agent -> generate sql_queries or non_sql_response
    alt sql_queries present
        SQLAgent->>Execute: execute_sql (sql_queries)
        Execute->>Validate: results
    else non_sql_response
        SQLAgent->>Validate: non_sql_response
    end
    Validate->>State: updated query_result + recommendation
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • Feat/role authorization #39: Changes to middleware and repository/service signatures around org/role-scoped dataset/project access—overlaps with removed role parameter and org scoping.
  • factly/gopie#88: Modifies GOPIE integration and dataset upload code—overlaps with introduction of GopieClient and org-scoped dataset/project fetches.
  • factly/gopie#194: Modifies Qdrant schema retrieval and org-scoped filtering—overlaps with get_schema/get_project_schemas changes.

Suggested reviewers

  • paul-tharun

Poem

🐰 I hopped through graphs and vectors new,

I fetched schemas, sampled, and knew,
Dense and sparse now dance as one,
Org-scoped queries get their run,
Tiny rabbit, big refactor — done! 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/e2e-datasets-metadata

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (15)
chat-server/tests/performance_tools/performance_tracker.py (1)

40-82: Unused test_cases parameter detected.

The test_cases parameter is accepted by save_run but is never used in the method body. The run_data dictionary includes results_file but omits test_cases.

If test cases should be persisted with the run data, add them to run_data. Otherwise, consider removing the unused parameter.

🔧 Option A: Include test_cases in run_data
         run_data = {
             "run_id": run_id,
             "timestamp": timestamp,
             "model_config": model_config,
             "notes": notes,
             "summary": {
                 "total_tests": total,
                 "scores": scores,
                 "avg_score": round(score_stats.get("average", 0), 2),
                 "median_score": round(score_stats.get("median", 0), 2),
                 "min_score": round(score_stats.get("min", 0), 2),
                 "max_score": round(score_stats.get("max", 0), 2),
                 "errors": errors,
                 "avg_request_time": round(timing_stats.get("avg_request_time", 0), 2),
                 "median_request_time": round(timing_stats.get("median_request_time", 0), 2),
                 "min_request_time": round(timing_stats.get("min_request_time", 0), 2),
                 "max_request_time": round(timing_stats.get("max_request_time", 0), 2),
                 "total_time": round(timing_stats.get("total_time", 0), 2),
             },
             "results_file": results_file,
+            "test_cases": test_cases,
         }
🔧 Option B: Remove the unused parameter
     def save_run(
         self,
         model_config: dict[str, str],
         summary: dict[str, Any],
         notes: str = "",
-        test_cases: list[dict[str, Any]] | None = None,
         results_file: str = "",
     ) -> str:
chat-server/app/workflow/agent/node/multi_dataset.py (1)

80-87: Fix incorrect state field reference: use previous_sql_queries not relevant_sql_queries.

Line 86 reads from state.get("relevant_sql_queries", []), but AgentState only defines previous_sql_queries (line 26 of agent/types.py). This causes SQL history to always be empty, breaking multi-dataset reasoning. Change to state.get("previous_sql_queries", []).

The same bug exists in single_dataset.py:74.

chat-server/tests/scripts/replicate_prod_to_local.py (1)

576-596: Update the example JSON file path to use an actual file or wildcard pattern.

The example path tests/chat_server_tests/output/golden_dataset.json does not exist. Actual golden dataset files generated by the test tools use timestamp suffixes (e.g., golden_dataset_20250806_010759.json). Either update the example to reference an actual generated file, use a wildcard pattern like golden_dataset_*.json, or clarify that this is a placeholder path. The output/ directory structure also doesn't currently exist in the repository.

chat-server/app/workflow/agent/node/single_dataset.py (1)

70-75: Fix state key mismatch: use previous_sql_queries instead of relevant_sql_queries

Line 74 attempts to retrieve from state.get("relevant_sql_queries", []), but the AgentState defines the key as previous_sql_queries (set by context_processor). The relevant_sql_queries key exists only in the visualize_data_graph InputState. Using the wrong key means the agent context is lost unless the default empty list masks the issue.

-        "prev_sql_queries": state.get("relevant_sql_queries", []),
+        "prev_sql_queries": state.get("previous_sql_queries", []),

Note: The same issue exists in chat-server/app/workflow/agent/node/multi_dataset.py:86.

chat-server/pyproject.toml (1)

10-30: Move pre-commit to the dev dependency group.

Pre-commit is a development tool and should be in the [dependency-groups] dev section (which currently contains only test tools), not in the main dependencies. The langgraph 1.0.6 APIs (ToolNode, Command, StateGraph, add_messages) are actively used throughout the codebase and appear compatible with the version bump.

Diff to move pre-commit to dev
dependencies = [
  "aioboto3>=14.3.0",
  "beautifulsoup4>=4.14.2",
  "altair>=5.5.0",
  "dspy==3.1.0",
  "e2b-code-interpreter>=1.5.1",
  "fastapi[standard]>=0.115.12",
  "langchain>=0.3.21",
  "langchain-community>=0.3.20",
  "langchain-openai>=0.3.9",
  "langchain-qdrant>=0.2.0",
  "langgraph>=1.0.6",
  "langgraph-cli[inmem]>=0.3.3",
  "langsmith>=0.3.42",
  "logging>=0.4.9.6",
  "model2vec>=0.7.0",
  "pillow>=12.0.0",
  "portkey-ai>=1.11.1",
- "pre-commit>=4.2.0",
  "pydantic>=2.10.6",
  "qdrant-client>=1.13.3",
  "vega-datasets>=0.9.0",
  "vl-convert-python>=1.8.0",
]

[dependency-groups]
dev = [
  "pytest>=8.4.1",
  "pytest-asyncio>=0.24.0",
  "pytest-cache>=1.0",
  "pytest-sugar>=1.0.0",
  "pytest-testmon>=2.1.3",
  "pytest-xdist>=3.8.0",
+ "pre-commit>=4.2.0",
]
chat-server/app/tool_utils/tool_node.py (1)

26-64: Ensure tool metadata is always applied regardless of config type.

The current code returns config unchanged if it's not a dict (line 64), meaning tool_text, tool_category, and should_display_tool are dropped when config is None or non-dict. This can occur in langgraph's ToolNode execution flow. Fix by defaulting to {} when config is not a dict and always merging tool_config.

🛠️ Proposed fix
-        if isinstance(config, dict):
-            merged_config: RunnableConfig = config.copy()
+        base_config: RunnableConfig = config if isinstance(config, dict) else {}
+        merged_config: RunnableConfig = base_config.copy()
             if "tags" in tool_config:
                 existing_tags = merged_config.get("tags", [])
                 if not isinstance(existing_tags, list):
                     existing_tags = [existing_tags] if existing_tags else []
                 merged_config["tags"] = existing_tags + tool_config["tags"]

             if "metadata" in tool_config:
                 existing_metadata = merged_config.get("metadata", {})
                 if not isinstance(existing_metadata, dict):
                     existing_metadata = {}
                 merged_config["metadata"] = {**existing_metadata, **tool_config["metadata"]}

-            return merged_config
-
-        return config
+        return merged_config
chat-server/app/tool_utils/tools/get_table_schema.py (2)

69-73: Filter uses OR (should) instead of AND (must) logic.

When both dataset_ids and org_id filters are present, using should combines them with OR semantics—returning records that match any dataset ID or the org ID. This likely isn't the intended behavior; you probably want records that match the specified dataset IDs and belong to the organization.

🐛 Proposed fix to use AND logic
         if filter_conditions:
             search_result = await client.scroll(
                 collection_name=settings.QDRANT_COLLECTION,
-                scroll_filter=Filter(should=filter_conditions),
+                scroll_filter=Filter(must=filter_conditions),
             )

80-86: Exception type mismatch: catching JSONDecodeError but code uses Pydantic.

DatasetSchema(**metadata) performs Pydantic validation, which raises ValidationError, not json.JSONDecodeError. The current handler won't catch actual validation errors.

🐛 Proposed fix
+from pydantic import ValidationError
...
                     try:
                         metadata = payload.get("metadata", {})
                         dataset_schema = DatasetSchema(**metadata)
                         schemas.append(dataset_schema.format_for_prompt())
-                    except json.JSONDecodeError as e:
+                    except (json.JSONDecodeError, ValidationError) as e:
                         logger.exception(f"Error parsing schema JSON: {e}")
                         continue
chat-server/tests/unit/test_schema_search.py (1)

153-154: Inconsistent mock signature will cause test failure.

This fake_get_vector_store only accepts embeddings, while all other test cases in this file were updated to accept both embeddings=None and collection_name=None. If the real get_vector_store is called with collection_name, this mock will raise a TypeError.

🐛 Proposed fix
-    def fake_get_vector_store(embeddings):
+    def fake_get_vector_store(embeddings=None, collection_name=None):
         return object()
chat-server/app/services/qdrant/schema_search.py (1)

18-61: Fix tenant isolation: move org_id to must filter.

Using org_id in should allows results to match by project/dataset ID alone, ignoring the organization context. This creates a cross-tenant data leak risk. With OR semantics, a schema from a different organization with the same project ID could be returned. Additionally, org_id is often not provided to these functions (callers in semantic_search.py, context_processor.py, and identify_datasets.py omit it), resulting in no tenant filtering in those cases.

Separate org_id into a must condition to enforce tenant isolation:

Proposed fix
-        if org_id:
-            filter_conditions.append(
-                models.FieldCondition(
-                    key="metadata.org_id",
-                    match=models.MatchValue(value=org_id),
-                )
-            )
+        must_conditions = []
+        if org_id:
+            must_conditions.append(
+                models.FieldCondition(
+                    key="metadata.org_id",
+                    match=models.MatchValue(value=org_id),
+                )
+            )
 
-        if filter_conditions:
-            query_filter = models.Filter(should=filter_conditions)
+        if filter_conditions or must_conditions:
+            query_filter = models.Filter(
+                should=filter_conditions or None,
+                must=must_conditions or None,
+            )

Apply the same fix to get_schema_from_qdrant, get_schema_by_dataset_ids, and get_project_schemas in get_schema.py.

chat-server/tests/unit/test_column_value_matching.py (1)

71-79: Duplicate dictionary key "error" - only the second definition is used.

The dummy logger has "error" defined twice (lines 75 and 77). Python dictionaries only keep the last value for duplicate keys, so this is redundant. While it doesn't affect test behavior (both are no-op lambdas), it should be cleaned up.

🧹 Proposed fix
     dummy_logger = type(
         "L",
         (),
         {
             "error": lambda *args, **kwargs: None,
             "debug": lambda *args, **kwargs: None,
-            "error": lambda *args, **kwargs: None,
         },
     )()
chat-server/app/services/qdrant/get_schema.py (1)

100-112: Filter logic may not behave as intended for multi-dataset + org_id queries.

Using Filter(should=filter_conditions) applies OR logic across all conditions. When org_id is provided alongside multiple dataset_ids, this will match records where dataset_id matches ANY of the IDs OR org_id matches—potentially returning datasets from other organizations.

If the intent is to filter datasets within a specific organization, consider using nested filters:

🐛 Proposed fix for AND semantics between dataset_ids (OR) and org_id
+        dataset_conditions = [
+            FieldCondition(
+                key="metadata.dataset_id",
+                match=MatchValue(value=dataset_id),
+            )
+            for dataset_id in dataset_ids
+        ]
+
+        filter_obj = Filter(should=dataset_conditions)
+
         if org_id:
-            filter_conditions.append(
-                FieldCondition(
-                    key="metadata.org_id",
-                    match=MatchValue(value=org_id),
-                )
+            filter_obj = Filter(
+                must=[
+                    Filter(should=dataset_conditions),
+                    FieldCondition(
+                        key="metadata.org_id",
+                        match=MatchValue(value=org_id),
+                    ),
+                ]
             )

         search_result = await client.scroll(
             collection_name=settings.QDRANT_COLLECTION,
-            scroll_filter=Filter(should=filter_conditions),
+            scroll_filter=filter_obj,
             limit=len(dataset_ids),
         )
chat-server/app/tool_utils/tools/list_datasets.py (3)

79-85: Mutable default arguments and type hint issue.

Line 82-83: Using mutable default arguments (list[str] = []) is a Python antipattern that can lead to unexpected behavior if the list is mutated. Line 84: The config parameter should use Optional[RunnableConfig] type hint for consistency.

🐛 Proposed fix
 `@tool`
 async def get_all_datasets(
     status_message: str = "",
-    project_ids: list[str] = [],
-    dataset_ids: list[str] = [],
-    config: RunnableConfig = None,
+    project_ids: list[str] | None = None,
+    dataset_ids: list[str] | None = None,
+    config: RunnableConfig | None = None,
 ) -> str:

Then update the usage inside the function:

-    if not project_ids and not dataset_ids:
+    if not project_ids and not dataset_ids:
         ...
 
-    if project_ids:
+    if project_ids:
         dataset_names = await get_dataset_names_from_project_ids(project_ids, org_id=org_id)

38-52: Missing exception handling for network errors.

Unlike get_dataset_names_from_project_ids which has a try-except block (lines 27-28), this function lacks exception handling for network failures. This inconsistency could cause unhandled exceptions to propagate.

🐛 Proposed fix
 async def get_project_ids_for_datasets_ids(
     dataset_ids: list[str],
     org_id: Optional[str] = None,
 ) -> dict[str, str]:
     dataset_id_project_map = {}
     client = GopieClient(org_id=org_id)
     for dataset_id in dataset_ids:
-        path = f"/v1/api/datasets/{dataset_id}/project"
-        async with await client.get(path, ssl=False) as response:
-            if response.status == 200:
-                dataset_data = await response.json()
-                dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "")
-            else:
-                logger.warning(f"Dataset {dataset_id} not found")
+        try:
+            path = f"/v1/api/datasets/{dataset_id}/project"
+            async with await client.get(path, ssl=False) as response:
+                if response.status == 200:
+                    dataset_data = await response.json()
+                    dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "")
+                else:
+                    logger.warning(f"Dataset {dataset_id} not found")
+        except Exception as e:
+            logger.exception(f"Error fetching project for dataset {dataset_id}: {e}")
     return dataset_id_project_map

55-76: Missing exception handling (same as above).

This function also lacks the try-except block present in get_dataset_names_from_project_ids. Consider adding consistent error handling.

🤖 Fix all issues with AI agents
In `@chat-server/app/services/gopie/client.py`:
- Around line 51-97: The get/post methods (get and post in client.py) currently
lack timeouts and can hang; add a GOPIE_API_TIMEOUT integer setting (following
E2B_TIMEOUT pattern) to config/settings and use aiohttp.ClientTimeout with that
value when making outbound calls: either configure the SingletonAiohttp session
(_session) with ClientTimeout(total=GOPIE_API_TIMEOUT) on creation or pass
timeout=aiohttp.ClientTimeout(total=GOPIE_API_TIMEOUT) to self._session.get(...)
and self._session.post(...); update the client to read the GOPIE_API_TIMEOUT
setting and apply it so get/post never send requests without an explicit
timeout.

In `@chat-server/app/services/gopie/dataset_info.py`:
- Around line 11-31: The async context manager usage in get_dataset_info (and
similarly in get_project_info) incorrectly uses "async with await
client.get(path)"; remove the explicit await and use "async with
client.get(path)" so the coroutine returned by client.get is awaited by the
context manager itself; update both functions to change the context manager
lines to use client.get(path) without await and keep the inner await
response.json() as-is to preserve reading the body.

In `@chat-server/app/services/qdrant/get_schema.py`:
- Around line 47-55: The code assumes search_result is always set but if
filter_conditions is empty the client.scroll branch won't run and search_result
is unbound; update the function containing this logic (the block using
filter_conditions, search_result, client.scroll and settings.QDRANT_COLLECTION)
to either (a) early-return None when filter_conditions is empty/invalid (e.g.,
when dataset_id is falsy and org_id is None) or (b) initialize search_result to
a safe default before the if (e.g., None) and add a guard before accessing
search_result[0]; pick one approach and apply it consistently so that the
subsequent check "if not search_result[0] or not search_result[0][0]" cannot
raise UnboundLocalError.

In `@chat-server/app/tool_utils/tools/plan_sql_query.py`:
- Around line 39-41: The plan_sql_query function calls
get_prompt_llm_chain("generate_sql", config) but omits schema enforcement;
update that call in plan_sql_query to pass schema=PlanQueryOutput so the LLM
response is validated into the PlanQueryOutput dataclass, and update the
plan_sql_query docstring to describe the actual return contract: a
PlanQueryOutput containing sql_queries (list of SqlQueryOutput objects which
include nested tables_used), non_sql_response, user_friendly_response, and
limitations (instead of the old top-level reasoning/expected_result/tables_used
fields).

In `@chat-server/app/utils/graph_utils/column_value_matching.py`:
- Around line 21-65: The _build_ilike_query function (and related query
builders) currently interpolate table_name and column_name directly into SQL,
creating SQL injection risk; update match_column_values() to validate both table
and column against a whitelist of actual schema identifiers (e.g., query the
database schema or use a precomputed set) and only pass validated names into
_build_ilike_query, and likewise change get_table_estimated_size() to require
validated table_name before interpolation; ensure validation rejects or
normalizes any identifier that is not an exact match to the known table/column
names and fail fast rather than interpolating untrusted values.

In `@chat-server/app/utils/graph_utils/table_utils.py`:
- Around line 45-46: calculate_sampling_percentage currently divides by
estimated_size and will raise ZeroDivisionError when estimated_size is 0; update
the function (calculate_sampling_percentage) to guard against estimated_size <=
0 by returning a safe formatted value instead of dividing (e.g., return
"100.000000" to indicate full sampling) and ensure the computed sample_pct is
clamped between 0 and 100 when using settings.TARGET_ROWS / estimated_size to
avoid extremes; locate the logic that uses settings.TARGET_ROWS and sample_pct
and add the conditional check and clamping there.
- Around line 18-22: The code directly interpolates table_name into size_query
creating an SQL injection risk and also can divide by zero in
calculate_sampling_percentage; to fix, validate/sanitize table_name before
building size_query (e.g., allow only /^[A-Za-z0-9_]+$/ or check against a
whitelist) and reject or raise a clear error for invalid names instead of
interpolating raw input (this change touches the size_query construction and
callers that pass table_name and can optionally use/extend execute_sql to accept
safe parameters later); also add a zero-check in calculate_sampling_percentage
to handle estimated_size == 0 (return 0 sampling percentage or raise a
controlled error) before performing any division.

In `@chat-server/app/workflow/graph/multi_dataset_graph/node/analyze_dataset.py`:
- Around line 47-49: The indexing of subqueries using subquery_index is unsafe:
before assigning current_query = subqueries[state.get("subquery_index", 0)],
validate that subqueries is a non-empty list and that
state.get("subquery_index") is an int within range; if not, fallback to a safe
default (e.g., None or the first element) and handle that downstream. Update the
code around subqueries, state.get("subquery_index"), and current_query to
perform a type-and-range check (or coerce the index) and choose a safe fallback
to avoid IndexError/TypeError.

In `@chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py`:
- Line 31: The code reads query_result.subqueries[query_index].retry_count
without checking that subqueries exists and query_index is in range; add a
defensive check before access in the function/method where this line appears
(use query_result, subqueries, query_index and retry_count as references):
verify that query_result.subqueries is truthy and that 0 <= query_index <
len(query_result.subqueries) and only then read .retry_count, otherwise assign a
safe default (e.g., 0) or handle the out-of-bounds case (raise a clear error or
skip processing) so an IndexError cannot occur.

In `@chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py`:
- Around line 8-20: The semantic_search function calls search_schemas without
scoping by organization; extract org_id from the RunnableConfig metadata (e.g.,
config.metadata.get("org_id") or the equivalent pattern used in execute_query.py
/ analyze_dataset.py) and pass it into the search_schemas call as the org_id
parameter so search_schemas(user_query=..., embeddings=..., dataset_ids=...,
project_ids=..., org_id=org_id). Ensure you use the same metadata access pattern
as get_model_provider and other workflow nodes to avoid KeyError when metadata
is absent.

In `@chat-server/app/workflow/graph/nl_to_sql_graph/node/sql_agent.py`:
- Around line 35-38: The schema lookup is missing the org_id, so change the
get_schema_from_qdrant call to pass the tenant id (e.g.,
get_schema_from_qdrant(dataset_id=dataset_id, org_id=org_id)); ensure org_id is
obtained from the current context/inputs (where this node receives tenant info)
or added to the function signature if absent, and propagate it alongside
dataset_ids to avoid cross-tenant schema leakage while keeping the same return
handling when schema is None.

In `@chat-server/app/workflow/graph/single_dataset_graph/graph.py`:
- Around line 40-42: The graph currently always routes from prepare_data to
sql_agent even when prepare_data returns no dataset_info; update the graph
and/or agents to handle that case: add a conditional edge from "prepare_data" to
an error handler node (e.g. "prepare_error" or "validate_result") and only add
the edge prepare_data -> "sql_agent" when dataset_info is present, or modify
sql_agent to validate dataset_info (the variable name dataset_info) before
calling sql_planning_agent and route to the error handler if missing; touch the
graph_builder edges around START, "prepare_data", "sql_agent", "execute_sql",
and "validate_result" and the sql_agent/sql_planning_agent dataset_info check to
ensure safe conditional routing.

In `@chat-server/app/workflow/graph/single_dataset_graph/node/execute_sql.py`:
- Around line 47-52: The loop over
query_result.single_dataset_query_result.sql_queries calls
execute_sql_with_full_and_truncated without passing the tenant/org id, causing
org_id to be None; update the call in execute_sql.py to pass the current org id
(e.g., org_id variable or attribute available in the surrounding scope) into
execute_sql_with_full_and_truncated(query=sql_info.sql_query, org_id=org_id) and
ensure sql_info.sql_query_result is assigned the truncated_result_data as before
so tenant isolation is preserved.
- Around line 21-35: The guard incorrectly checks isinstance(last_message,
ErrorMessage) while last_message is a list; change how messages are read so you
inspect the actual last message element: fetch messages = state.get("messages",
[]) and set last_message = messages[-1] if messages else None, then update the
conditional to check if non_sql_response is present or isinstance(last_message,
ErrorMessage) and raise/return accordingly (use the existing ErrorMessage class
and the query_result variable names shown).

In `@chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py`:
- Around line 106-109: The code builds sample_data_query via f-string with
dataset_name, which allows SQL injection; instead, stop interpolating raw
dataset_name into SQL in prepare_data.py and use a safe approach when calling
execute_sql_with_limit: either pass a parameterized query where possible (use
placeholders for values) and supply parameters to execute_sql_with_limit, or
(for SQL identifiers) validate/sanitize dataset_name against a strict pattern
(e.g., /^[A-Za-z0-9_]+$/) or use the DB driver's identifier-quoting mechanism
before injecting it; update the sample_data_query construction and the call to
execute_sql_with_limit (referencing sample_data_query, dataset_name and
execute_sql_with_limit) so only a validated/quoted identifier is used and all
dynamic values are passed as parameters, then keep using convert_rows_to_csv on
the returned rows.

In `@chat-server/app/workflow/graph/single_dataset_graph/node/routing.py`:
- Around line 5-10: The code sets query_result = state.get("query_result", [])
which can be a list and causes attribute access errors; change the default to
None (query_result = state.get("query_result")) and ensure you only access
query_result.single_dataset_query_result when query_result is truthy, e.g. set
single_dataset_res = query_result.single_dataset_query_result if query_result
else None, then compute sql_queries from single_dataset_res only when it and its
sql_queries exist (keeping the existing list comprehension for [sq.sql_query for
sq in single_dataset_res.sql_queries if sq.sql_query]).

In `@chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py`:
- Around line 23-28: The code accesses query_result.single_dataset_query_result
before verifying query_result is present, which can raise AttributeError; update
the logic in sql_agent.py so you first retrieve query_result =
state.get("query_result") and immediately check if not query_result (or
query_result is None) and raise the ValueError there, and only after that access
query_result.single_dataset_query_result to assign single_dataset_result; ensure
the check covers both query_result and single_dataset_result as intended.

In `@chat-server/tests/e2e/viz_utils/per_example_workflow.py`:
- Around line 401-543: The test scoring logic in process_example assumes a 0–10
scale (score >= 8) but the evaluator may return 0–1; detect and normalize the
score before comparing: after retrieving evaluation =
test_result.get("evaluation", {}) and score = evaluation.get("score", 0), coerce
score to a float, and if score <= 1 then scale it (e.g., score_scaled =
float(score) * 10) else keep as-is (score_scaled = float(score)); then use
passed = score_scaled >= 8 and use score_scaled in any printed messages and
result fields so the pass/fail decision matches the intended 0–10 threshold.
Ensure this change is applied inside process_example where
test_result/evaluation/score are handled.

In `@chat-server/tests/e2e/viz_utils/viz_test_case_runner.py`:
- Around line 39-46: The VizEvaluationSchema currently constrains score to 0–1
but the runner uses a 0–10 pass threshold (score >= 8); update the schema field
in VizEvaluationSchema (the score Field) to ge=0.0 and le=10.0 and change its
description to "Similarity score between 0 and 10 (higher is better)"; then
check other places that parse/validate or prompt for the score (any code that
consumes VizEvaluationSchema or performs the score >= 8 check) and ensure
prompts, validation, and test expectations consistently use the 0–10 scale.

In `@chat-server/tests/e2e/viz_utils/viz_utils.py`:
- Around line 192-201: The dict_hash function is using hashlib.sha256
incorrectly by slicing the hash object; instead ensure `serialized` is bytes
(encode when needed), pass that to `hashlib.sha256(...)`, then call
`.hexdigest()` (or `.digest()` if you want raw bytes) and slice the resulting
bytes/hex string as intended before returning; update the logic in dict_hash to
encode `serialized` when it's a str, compute the hash via
`hashlib.sha256(serialized_bytes)`, then return the proper sliced `.hexdigest()`
(or slice `.digest()` and then hexlify) so `m.hexdigest()` is called on a real
digest rather than on the hash object.
🟡 Minor comments (10)
chat-server/app/workflow/graph/graph_mermaid/sql_agent_graph.mmd-8-12 (1)

8-12: Connect or remove the orphan search_duckdb_docs node.

Right now it’s defined but not part of the flow, so the graph renders a floating node and misrepresents the pipeline. If it’s intended, wire it in; otherwise remove it.

✅ Example fix (wire into flow)
 __start__ --> generate_sql;
-generate_sql --> __end__;
+generate_sql --> search_duckdb_docs;
+search_duckdb_docs --> __end__;
chat-server/tests/dspy/evaluator.py-82-92 (1)

82-92: Broad exception catch masks specific failure modes that could aid debugging.

The exception handler at line 82 catches all exceptions indiscriminately, making it difficult to distinguish between parsing failures (recoverable) and unexpected errors. Similar code in this file (e.g., optimize_evaluator.py line 28) catches specific exception types like ValueError and AttributeError. Consider narrowing the catch to specific exceptions that the dspy.ChainOfThought evaluation is expected to raise (e.g., ValueError, TypeError, AttributeError) to let unexpected errors propagate for visibility.

Additionally, returning evaluation_score: 0.0 silently on any failure could cause misleading test results—tests expecting high scores will correctly fail, but those expecting low scores might incorrectly pass.

chat-server/tests/dspy/optimize_evaluator.py-40-42 (1)

40-42: --no-reasoning flag and use_reasoning parameter are unused

The CLI flag and function parameter are collected and passed through but never actually used to control behavior. EvaluatorModule hardcodes dspy.ChainOfThought with no provision to switch to dspy.Predictor. Remove these unused parameters or implement the feature by adding a constructor parameter to EvaluatorModule and conditionally instantiating the appropriate module type.

chat-server/tests/test_config.py-33-33 (1)

33-33: Update the module docstring default to match the new URL.

The header comment still mentions port 8001.

🔧 Suggested docstring update
-    CHAT_SERVER_URL        - Chat server endpoint (default: http://localhost:8001/api/v1/chat/completions)
+    CHAT_SERVER_URL        - Chat server endpoint (default: http://localhost:8003/api/v1/chat/completions)
chat-server/tests/test_config.py-22-26 (1)

22-26: Add python-dotenv as an explicit dependency in pyproject.toml.

The import will work because python-dotenv is available as a transitive dependency in the lockfile, but it's not explicitly declared. Add it to the main dependencies list to ensure it remains available if upstream dependencies change.

chat-server/tests/e2e/utils/dataset_test_cases.py-11-26 (1)

11-26: Replace non-breaking hyphens and en dashes with regular ASCII hyphens in expected_result strings.

The file contains multiple instances of Unicode non-breaking hyphens (U+2011) and en dashes (U+2013) in expected_result strings (lines 55, 205, 235, 250, 265, 434, 494, 584, 590, 629, 644, 659, 689). These characters differ from regular hyphens (U+002D) and will cause string comparison failures if test assertions compare expected results programmatically.

Examples: division‑by‑zero, Non‑SQL, group‑by, side‑by‑side, X‑axis, Y‑axis.

Replace all instances with standard ASCII hyphens (the - character).

chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py-86-86 (1)

86-86: Typo: "Succesfully" should be "Successfully".

Proposed fix
-                    content="Succesfully Completed query planning step in multidataset workflow."
+                    content="Successfully completed query planning step in multi-dataset workflow."
chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py-21-21 (1)

21-21: Potential TypeError if subqueries is None.

If state.get("subqueries") returns None, the indexing [query_index] will raise a TypeError. The fallback "No input" only applies if the entire conditional is falsy.

Proposed fix
-    user_query = state.get("subqueries")[query_index] if state.get("subqueries") else "No input"
+    subqueries = state.get("subqueries")
+    user_query = subqueries[query_index] if subqueries else "No input"
chat-server/app/services/gopie/generate_schema.py-15-49 (1)

15-49: Remove unnecessary else clause and add defensive quoting for dataset_name.

The else block on line 37 is unnecessary since the preceding if statement returns early. The code can be simplified by removing the else and unindenting the subsequent statements.

Additionally, while dataset_name comes from the trusted Gopie API response, consider quoting it defensively to align with test patterns elsewhere in the codebase (e.g., f'SELECT * FROM "{dataset_name}"'). This provides defense-in-depth against any unexpected values.

Finally, add error handling around the execute_sql call on line 72 to provide comprehensive error messages and logging per coding guidelines.

♻️ Remove unnecessary else clause and add quoting
     if not should_use_sampling(estimated_size):
         logger.debug(
             f"[{dataset_name}] Small dataset detected ({estimated_size} rows). "
             "Using standard nested query logic."
         )
         return f"""
         SELECT DISTINCT * FROM (
-            SELECT * FROM {dataset_name} LIMIT 200000
+            SELECT * FROM "{dataset_name}" LIMIT 200000
         )
         LIMIT {limit}
         """
-    else:
-        pct_str = calculate_sampling_percentage(estimated_size)
+    pct_str = calculate_sampling_percentage(estimated_size)

-        logger.debug(
-            f"[{dataset_name}] Large dataset detected ({estimated_size} rows). "
-            f"Sampling {pct_str}% (system) to retrieve approx {settings.TARGET_ROWS} rows."
-        )
+    logger.debug(
+        f"[{dataset_name}] Large dataset detected ({estimated_size} rows). "
+        f"Sampling {pct_str}% (system) to retrieve approx {settings.TARGET_ROWS} rows."
+    )

-        return f"""
-        SELECT DISTINCT * FROM {dataset_name}
-        USING SAMPLE {pct_str}% (system)
-        LIMIT {limit}
-        """
+    return f"""
+    SELECT DISTINCT * FROM "{dataset_name}"
+    USING SAMPLE {pct_str}% (system)
+    LIMIT {limit}
+    """
chat-server/tests/e2e/viz_utils/per_example_workflow.py-181-238 (1)

181-238: Quote dataset names in sample SQL to avoid syntax errors/injection.

dataset_name is interpolated directly into SQL without quoting. If the name contains spaces/reserved words or quotes, this will fail (and can be abused if names aren’t sanitized).

✅ Safer quoting
-            body = {"query": f"SELECT * FROM {dataset_name} LIMIT 5"}
+            safe_name = dataset_name.replace('"', '""')
+            body = {"query": f'SELECT * FROM "{safe_name}" LIMIT 5'}

Comment thread chat-server/app/services/gopie/client.py
Comment on lines +11 to +31
async def get_dataset_info(
dataset_id: str,
project_id: str,
org_id: Optional[str] = None,
) -> DatasetDetails:
"""Get dataset information from Gopie API.

url = f"{settings.GOPIE_API_ENDPOINT}/v1/api/projects/{project_id}/datasets/{dataset_id}"
headers = {"accept": "application/json"}
Args:
dataset_id: The dataset ID
project_id: The project ID
org_id: Optional organization ID for multi-tenant support

try:
async with http_session.get(url, headers=headers) as response:
data = await response.json()
return DatasetDetails(**data)
except Exception as e:
logger.exception(f"Error getting dataset info: {e!s}")
raise e
Returns:
DatasetDetails object
"""
client = GopieClient(org_id=org_id)
path = f"/v1/api/projects/{project_id}/datasets/{dataset_id}"

async with await client.get(path) as response:
data = await response.json()
return DatasetDetails(**data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and inspect GopieClient.get method
find chat-server -name "*.py" -type f | xargs rg -l "class GopieClient" | head -5

Repository: factly/gopie

Length of output: 96


🏁 Script executed:

# Read the GopieClient implementation
cat -n chat-server/app/services/gopie/client.py

Repository: factly/gopie

Length of output: 3577


🏁 Script executed:

# Also check lines 34-52 in dataset_info.py to see other affected functions
sed -n '30,55p' chat-server/app/services/gopie/dataset_info.py

Repository: factly/gopie

Length of output: 679


Remove explicit await from async context manager pattern.

async with await client.get(path) is non-standard for aiohttp. The async context manager protocol automatically awaits the coroutine; use async with client.get(path) instead. This applies to both get_dataset_info and get_project_info functions.

🔧 Fix
-    async with await client.get(path) as response:
+    async with client.get(path) as response:
         data = await response.json()
🤖 Prompt for AI Agents
In `@chat-server/app/services/gopie/dataset_info.py` around lines 11 - 31, The
async context manager usage in get_dataset_info (and similarly in
get_project_info) incorrectly uses "async with await client.get(path)"; remove
the explicit await and use "async with client.get(path)" so the coroutine
returned by client.get is awaited by the context manager itself; update both
functions to change the context manager lines to use client.get(path) without
await and keep the inner await response.json() as-is to preserve reading the
body.

Comment on lines 47 to 55
if filter_conditions:
search_result = await client.scroll(
collection_name=settings.QDRANT_COLLECTION,
scroll_filter=Filter(should=filter_conditions),
limit=1,
)

if not search_result[0][0]:
if not search_result[0] or not search_result[0][0]:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential UnboundLocalError when filter_conditions is empty.

If dataset_id is falsy and org_id is None, filter_conditions will be empty, the if filter_conditions: block won't execute, and search_result will be unbound when accessed on line 54.

🐛 Proposed fix
         if filter_conditions:
             search_result = await client.scroll(
                 collection_name=settings.QDRANT_COLLECTION,
                 scroll_filter=Filter(should=filter_conditions),
                 limit=1,
             )
+        else:
+            return None

         if not search_result[0] or not search_result[0][0]:
             return None

Alternatively, add early validation:

+    if not dataset_id:
+        logger.warning("get_schema_from_qdrant called without dataset_id")
+        return None
+
     try:
         client = await QdrantSetup.get_async_client(settings.QDRANT_COLLECTION)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if filter_conditions:
search_result = await client.scroll(
collection_name=settings.QDRANT_COLLECTION,
scroll_filter=Filter(should=filter_conditions),
limit=1,
)
if not search_result[0][0]:
if not search_result[0] or not search_result[0][0]:
return None
if filter_conditions:
search_result = await client.scroll(
collection_name=settings.QDRANT_COLLECTION,
scroll_filter=Filter(should=filter_conditions),
limit=1,
)
else:
return None
if not search_result[0] or not search_result[0][0]:
return None
🤖 Prompt for AI Agents
In `@chat-server/app/services/qdrant/get_schema.py` around lines 47 - 55, The code
assumes search_result is always set but if filter_conditions is empty the
client.scroll branch won't run and search_result is unbound; update the function
containing this logic (the block using filter_conditions, search_result,
client.scroll and settings.QDRANT_COLLECTION) to either (a) early-return None
when filter_conditions is empty/invalid (e.g., when dataset_id is falsy and
org_id is None) or (b) initialize search_result to a safe default before the if
(e.g., None) and add a guard before accessing search_result[0]; pick one
approach and apply it consistently so that the subsequent check "if not
search_result[0] or not search_result[0][0]" cannot raise UnboundLocalError.

Comment on lines 39 to 41
try:
chain = get_prompt_llm_chain("plan_sql_query_tool", config)
chain = get_prompt_llm_chain("generate_sql", config)
response = await chain.ainvoke({"user_query": user_query, "dataset_info": dataset_info})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find generate_sql prompt definition
find chat-server -type f -name "*.py" -o -name "*.json" -o -name "*.yaml" -o -name "*.yml" | head -50 | xargs grep -l "generate_sql" 2>/dev/null

Repository: factly/gopie

Length of output: 168


🏁 Script executed:

#!/bin/bash
# Search for generate_sql in prompts directory
fd "prompts" chat-server --type d

Repository: factly/gopie

Length of output: 168


🏁 Script executed:

#!/bin/bash
# Search for get_prompt_llm_chain implementation
rg -n "def get_prompt_llm_chain" chat-server --type py -A 10

Repository: factly/gopie

Length of output: 893


🏁 Script executed:

#!/bin/bash
# Inspect plan_sql_query to see full implementation
cat -n chat-server/app/tool_utils/tools/plan_sql_query.py

Repository: factly/gopie

Length of output: 2585


🏁 Script executed:

#!/bin/bash
# Get full get_prompt_llm_chain implementation
sed -n '86,150p' chat-server/app/utils/langsmith/prompt_manager.py

Repository: factly/gopie

Length of output: 1362


🏁 Script executed:

#!/bin/bash
# Search for generate_sql prompt or prompt mapping
rg -n "generate_sql" chat-server --type py -B 2 -A 5

Repository: factly/gopie

Length of output: 9931


🏁 Script executed:

#!/bin/bash
# Search for PromptSelector or NodeName enum/mapping
rg -n "class NodeName|generate_sql" chat-server/app --type py

Repository: factly/gopie

Length of output: 1836


🏁 Script executed:

#!/bin/bash
# Check format_generate_sql_input full signature and expected parameter types
sed -n '111,140p' chat-server/app/workflow/prompts/generate_sql_prompt.py

Repository: factly/gopie

Length of output: 1107


🏁 Script executed:

#!/bin/bash
# Check PlanQueryOutput schema definition
rg -n "class PlanQueryOutput" chat-server --type py -A 15

Repository: factly/gopie

Length of output: 1977


🏁 Script executed:

#!/bin/bash
# Get full PlanQueryOutput schema and SqlQueryOutput
sed -n '1,50p' chat-server/app/workflow/graph/sql_planner_graph/generate_sql.py

Repository: factly/gopie

Length of output: 2137


Add missing schema=PlanQueryOutput parameter and update tool docstring to match actual contract.

plan_sql_query calls get_prompt_llm_chain("generate_sql", config) without schema=PlanQueryOutput, causing unstructured LLM output instead of validated PlanQueryOutput. This contradicts the workflow implementation (generate_sql.py:80) which enforces the schema. Additionally, the tool's docstring documents incorrect return fields:

  • Promises: reasoning, expected_result, tables_used (top-level)
  • Actual PlanQueryOutput: sql_queries (list of SqlQueryOutput objects with nested tables_used), non_sql_response, user_friendly_response, limitations

Pass schema=PlanQueryOutput to enforce structured output and correct the docstring to match the actual schema.

🤖 Prompt for AI Agents
In `@chat-server/app/tool_utils/tools/plan_sql_query.py` around lines 39 - 41, The
plan_sql_query function calls get_prompt_llm_chain("generate_sql", config) but
omits schema enforcement; update that call in plan_sql_query to pass
schema=PlanQueryOutput so the LLM response is validated into the PlanQueryOutput
dataclass, and update the plan_sql_query docstring to describe the actual return
contract: a PlanQueryOutput containing sql_queries (list of SqlQueryOutput
objects which include nested tables_used), non_sql_response,
user_friendly_response, and limitations (instead of the old top-level
reasoning/expected_result/tables_used fields).

Comment on lines +21 to +65
def _build_ilike_query(
table_name: str,
column_name: str,
escaped_value: str,
estimated_size: int,
limit: int = 5,
) -> str:
"""Build optimized ILIKE query based on table size.

Args:
table_name: Name of the table
column_name: Name of the column to search
escaped_value: Escaped search value
estimated_size: Estimated row count from metadata
limit: Number of rows to return

Returns:
SQL query string
"""
where_clause = f"WHERE LOWER(CAST({column_name} AS VARCHAR)) LIKE '%' || LOWER('{escaped_value}') || '%'"

if not should_use_sampling(estimated_size):
return f"""
SELECT DISTINCT {column_name}
FROM (SELECT * FROM {table_name} LIMIT 200000)
{where_clause}
LIMIT {limit}
"""
else:
pct_str = calculate_sampling_percentage(estimated_size)

logger.debug(
f"[{table_name}] Large dataset detected ({estimated_size} rows) for ILIKE search. "
f"Sampling {pct_str}% (system)."
)

return f"""
SELECT DISTINCT {column_name}
FROM (
SELECT * FROM {table_name}
USING SAMPLE {pct_str}% (system)
)
{where_clause}
LIMIT {limit}
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the full file to understand context
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | head -100

Repository: factly/gopie

Length of output: 3655


🏁 Script executed:

# Find who calls _build_ilike_query and related functions
rg -n "_build_ilike_query|_build_levenshtein_query|verify_fuzzy_values|find_similar_values" --type=py chat-server/app/ -A 3 -B 3

Repository: factly/gopie

Length of output: 5354


🏁 Script executed:

# Check if there's any schema validation or whitelisting
rg -n "table_name|column_name" --type=py chat-server/app/utils/graph_utils/ -B 2 -A 2 | head -150

Repository: factly/gopie

Length of output: 13328


🏁 Script executed:

# Find callers of find_similar_column_values
rg -n "find_similar_column_values" --type=py chat-server/app/ -B 2 -A 5

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Check if there's any schema validation or whitelist before these values are used
rg -n "schema|whitelist|validate.*column|validate.*table" --type=py chat-server/app/utils/graph_utils/ -i

Repository: factly/gopie

Length of output: 162


🏁 Script executed:

# Look for how database schema metadata is accessed
rg -n "get_table|get_column|table_metadata|column_metadata|schema" --type=py chat-server/app/utils/graph_utils/ -i | head -50

Repository: factly/gopie

Length of output: 543


🏁 Script executed:

# Get the full file to understand the function signatures and data sources
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | tail -200

Repository: factly/gopie

Length of output: 8903


🏁 Script executed:

# Look at the beginning of the main public function
cat -n chat-server/app/utils/graph_utils/column_value_matching.py | head -160

Repository: factly/gopie

Length of output: 6204


🏁 Script executed:

# Search for calls to these functions from other modules
rg -n "from.*column_value_matching|import.*find_similar\|import.*verify_fuzzy" --type=py chat-server/app/ -A 3

Repository: factly/gopie

Length of output: 533


🏁 Script executed:

# Check get_table_estimated_size to see if it validates table names
cat -n chat-server/app/utils/graph_utils/table_utils.py | head -50

Repository: factly/gopie

Length of output: 1996


🏁 Script executed:

# Find who calls analyze_dataset
rg -n "analyze_dataset" --type=py chat-server/app/ -B 2 -A 5 | head -100

Repository: factly/gopie

Length of output: 9503


🏁 Script executed:

# Check if there's any schema discovery or validation before these values are used
rg -n "get_columns|describe|schema.*table" --type=py chat-server/app/ -i | head -40

Repository: factly/gopie

Length of output: 2051


🏁 Script executed:

# Check the execute_sql implementation
cat -n chat-server/app/services/gopie/sql_executor.py | head -100

Repository: factly/gopie

Length of output: 2584


🌐 Web query:

DuckDB parameterized identifiers support table column names

💡 Result:

Short answer: No — DuckDB prepared parameters can only substitute values, not arbitrary identifiers (like column names). You can parameterize table names using query_table/query($1) or build SQL on the client side, but you cannot bind a parameter to an identifier (column name) inside a prepared statement. [1][2][3]

References:

  • Prepared statements / parameters (values only). [1]
  • query_table / query functions (allow passing a table name string). [2]
  • Identifiers / quoting rules (identifiers must be literal/quoted; not parameter placeholders). [3]

SQL injection risk: table_name and column_name are interpolated directly into queries without parameterization or validation.

While escaped_value is escaped for single quotes, the table_name and column_name parameters are interpolated directly into the SQL string via f-strings. Since these values are extracted from user-provided column_assumptions with only minimal falsy checks, an attacker could inject arbitrary SQL by crafting malicious table or column names.

DuckDB prepared statements do not support parameterized identifiers (only values), making whitelist validation against known schema elements the appropriate mitigation. Apply this validation in match_column_values() before passing these values to the query builders, or ensure the upstream data source guarantees only valid schema identifiers.

Additionally, get_table_estimated_size() in table_utils.py (line 21) has the same vulnerability when interpolating table_name.

🧰 Tools
🪛 Ruff (0.14.13)

43-48: Possible SQL injection vector through string-based query construction

(S608)


57-65: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
In `@chat-server/app/utils/graph_utils/column_value_matching.py` around lines 21 -
65, The _build_ilike_query function (and related query builders) currently
interpolate table_name and column_name directly into SQL, creating SQL injection
risk; update match_column_values() to validate both table and column against a
whitelist of actual schema identifiers (e.g., query the database schema or use a
precomputed set) and only pass validated names into _build_ilike_query, and
likewise change get_table_estimated_size() to require validated table_name
before interpolation; ensure validation rejects or normalizes any identifier
that is not an exact match to the known table/column names and fail fast rather
than interpolating untrusted values.

Comment on lines +5 to +10
query_result = state.get("query_result", [])
single_dataset_res = query_result.single_dataset_query_result

sql_queries = (
[sq.sql_query for sq in single_dataset_res.sql_queries if sq.sql_query]
if single_dataset_res and single_dataset_res.sql_queries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against missing query_result before attribute access.

state.get("query_result", []) can return [], and [] has no single_dataset_query_result, which will raise at runtime. Default to None and short-circuit when missing.

✅ Suggested fix
-def route_after_sql_generation(state: State) -> str:
-    query_result = state.get("query_result", [])
-    single_dataset_res = query_result.single_dataset_query_result
+def route_after_sql_generation(state: State) -> str:
+    query_result = state.get("query_result")
+    if not query_result or not getattr(query_result, "single_dataset_query_result", None):
+        return "no_sql_queries"
+    single_dataset_res = query_result.single_dataset_query_result
🤖 Prompt for AI Agents
In `@chat-server/app/workflow/graph/single_dataset_graph/node/routing.py` around
lines 5 - 10, The code sets query_result = state.get("query_result", []) which
can be a list and causes attribute access errors; change the default to None
(query_result = state.get("query_result")) and ensure you only access
query_result.single_dataset_query_result when query_result is truthy, e.g. set
single_dataset_res = query_result.single_dataset_query_result if query_result
else None, then compute sql_queries from single_dataset_res only when it and its
sql_queries exist (keeping the existing list comprehension for [sq.sql_query for
sq in single_dataset_res.sql_queries if sq.sql_query]).

Comment on lines +23 to +28
query_result = state.get("query_result")

single_dataset_result = query_result.single_dataset_query_result

if not query_result or not single_dataset_result:
raise ValueError("query_result is not properly initialized")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential AttributeError: accessing query_result before null check.

Line 25 accesses query_result.single_dataset_query_result before line 27 validates that query_result is not None. If state.get("query_result") returns None, this will raise AttributeError instead of the intended ValueError.

🐛 Proposed fix - reorder validation
     query_result = state.get("query_result")
 
-    single_dataset_result = query_result.single_dataset_query_result
-
     if not query_result or not single_dataset_result:
+    if not query_result:
+        raise ValueError("query_result is not properly initialized")
+
+    single_dataset_result = query_result.single_dataset_query_result
+
+    if not single_dataset_result:
         raise ValueError("query_result is not properly initialized")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
query_result = state.get("query_result")
single_dataset_result = query_result.single_dataset_query_result
if not query_result or not single_dataset_result:
raise ValueError("query_result is not properly initialized")
query_result = state.get("query_result")
if not query_result:
raise ValueError("query_result is not properly initialized")
single_dataset_result = query_result.single_dataset_query_result
if not single_dataset_result:
raise ValueError("query_result is not properly initialized")
🧰 Tools
🪛 Ruff (0.14.13)

28-28: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py` around
lines 23 - 28, The code accesses query_result.single_dataset_query_result before
verifying query_result is present, which can raise AttributeError; update the
logic in sql_agent.py so you first retrieve query_result =
state.get("query_result") and immediately check if not query_result (or
query_result is None) and raise the ValueError there, and only after that access
query_result.single_dataset_query_result to assign single_dataset_result; ensure
the check covers both query_result and single_dataset_result as intended.

Comment on lines +401 to +543
async def process_example(
self, example_name: str, example_code: str, example_index: int, total_examples: int
) -> ExampleResult:
"""
Complete workflow for ONE example.

Args:
example_name: Name of the example
example_code: Python code from example
example_index: Current example number (1-indexed)
total_examples: Total number of examples

Returns:
ExampleResult with status and details
"""
print(f"\n{'='*70}")
print(f"[{example_index}/{total_examples}] Processing: {example_name}")
print(f"{'='*70}")

result = ExampleResult(example_name=example_name, status="error")

# Step 1: Execute code and generate image
print(" → Step 1: Generating chart image...")
image_path, error = self._execute_python_code(example_code)

if error or not image_path:
result.status = "skip"
result.error_message = f"Failed to generate image: {error}"
print(f" ⚠️ Skipped - {result.error_message}")
return result

print(f" ✓ Image saved: {Path(image_path).name}")
result.image_path = image_path

# Step 2: Extract dataset names
print(" → Step 2: Extracting dataset names...")
dataset_names = self._extract_dataset_names(example_code)

if not dataset_names:
result.status = "skip"
result.error_message = "No datasets found in code"
print(f" ⚠️ Skipped - {result.error_message}")
return result

print(f" ✓ Found datasets: {', '.join(dataset_names)}")
result.datasets = dataset_names

# Step 3: Find datasets in Gopie (assumes already uploaded)
print(" → Step 3: Finding datasets in Gopie...")
dataset_ids = []
for ds_name in dataset_names:
ds_id = await self._find_dataset_in_gopie(ds_name)
if ds_id:
dataset_ids.append(ds_id)
print(f" ✓ Found '{ds_name}' (ID: {ds_id[:8]}...)")
else:
print(f" ⚠️ Dataset '{ds_name}' not found in Gopie")

if not dataset_ids:
result.status = "skip"
result.error_message = "No matching datasets found in Gopie"
print(f" ⚠️ Skipped - {result.error_message}")
return result

# Step 4: Fetch schemas
print(f" → Step 4: Fetching {len(dataset_ids)} schema(s)...")
fetch_tasks = [
self._fetch_schema_for_dataset(dataset_id=did, project_id=self.project_id)
for did in dataset_ids
]
schemas_results = await asyncio.gather(*fetch_tasks, return_exceptions=True)
schemas = [s for s in schemas_results if isinstance(s, dict)]

if not schemas:
result.status = "error"
result.error_message = "Failed to fetch schemas"
print(f" ❌ {result.error_message}")
return result

print(f" ✓ Fetched {len(schemas)} schema(s)")

# Step 5: Generate test case with LLM
print(" → Step 5: Generating test case with LLM...")
try:
prompt = self._create_prompt_for_image(image_path=image_path, datasets_schema=schemas)
test_case = await self._call_llm(prompt)

unique_project_ids = {s.get("project_id", "") for s in schemas if s.get("project_id")}
unique_dataset_ids = [s.get("dataset_id", "") for s in schemas if s.get("dataset_id")]
unique_dataset_ids = [d for d in unique_dataset_ids if d]

test_case.project_id = next(iter(unique_project_ids)) if unique_project_ids else ""
test_case.dataset_id = unique_dataset_ids[0] if len(unique_dataset_ids) == 1 else ""

dataset_names_for_sql = [
s.get("dataset_name", "") for s in schemas if s.get("dataset_name")
]
test_case.sql_queries = [f'SELECT * FROM "{dn}"' for dn in dataset_names_for_sql]
test_case.image_path = image_path

result.test_case = test_case
print(f" ✓ Test case generated: {test_case.query[:60]}...")

except Exception as e:
result.status = "error"
result.error_message = f"LLM generation failed: {e}"
print(f" ❌ {result.error_message}")
return result

# Step 6: Run test case immediately
print(" → Step 6: Running test case...")
try:
case_dict = {
"project_id": test_case.project_id,
"dataset_id": test_case.dataset_id,
"query": test_case.query,
"sql_queries": test_case.sql_queries,
"image_path": test_case.image_path,
}

test_result = await run_viz_test_case(case_dict)
result.test_result = test_result

if test_result.get("success"):
evaluation = test_result.get("evaluation", {})
score = evaluation.get("score", 0)
passed = score >= 8
status_emoji = "✅" if passed else "❌"
result.status = "success" if passed else "error"
print(
f" {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score}/10)"
)
else:
result.status = "error"
result.error_message = test_result.get("error", "Unknown error")
print(f" ❌ Test error: {result.error_message}")

except Exception as e:
result.status = "error"
result.error_message = f"Test execution failed: {e}"
print(f" ❌ {result.error_message}")

return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Score threshold assumes a 0–10 scale but evaluator appears 0–1.

If the evaluator returns 0–1 (as in the current prompt/schema), score >= 8 will always fail. Align the scale or convert before comparison.

🔧 Option if keeping 0–1 scale
-                score = evaluation.get("score", 0)
-                passed = score >= 8
+                score = evaluation.get("score", 0)
+                passed = score >= 0.8
                 status_emoji = "✅" if passed else "❌"
                 result.status = "success" if passed else "error"
                 print(
-                    f"    {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score}/10)"
+                    f"    {status_emoji} Test {'PASSED' if passed else 'FAILED'} (score: {score:.2f}/1.0)"
                 )
🧰 Tools
🪛 Ruff (0.14.13)

498-498: Possible SQL injection vector through string-based query construction

(S608)


504-504: Do not catch blind exception: Exception

(BLE001)


538-538: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/per_example_workflow.py` around lines 401 -
543, The test scoring logic in process_example assumes a 0–10 scale (score >= 8)
but the evaluator may return 0–1; detect and normalize the score before
comparing: after retrieving evaluation = test_result.get("evaluation", {}) and
score = evaluation.get("score", 0), coerce score to a float, and if score <= 1
then scale it (e.g., score_scaled = float(score) * 10) else keep as-is
(score_scaled = float(score)); then use passed = score_scaled >= 8 and use
score_scaled in any printed messages and result fields so the pass/fail decision
matches the intended 0–10 threshold. Ensure this change is applied inside
process_example where test_result/evaluation/score are handled.

Comment on lines +39 to +46
class VizEvaluationSchema(BaseModel):
passed: bool = Field(description="Whether the generated chart matches the reference")
score: float = Field(
ge=0.0,
le=1.0,
description="Similarity score between 0 and 1 (higher is better)",
)
reasoning: str = Field(description="Short explanation of the judgment")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align evaluation score scale with the pass/fail threshold.

The schema/prompt specify a 0–1 score, but the runner treats it as 0–10 (score >= 8). This will mark nearly all cases as failed or fail schema validation. Consider standardizing on 0–10 to match existing threshold.

🔧 Align schema & prompt to 0–10
 class VizEvaluationSchema(BaseModel):
     passed: bool = Field(description="Whether the generated chart matches the reference")
     score: float = Field(
         ge=0.0,
-        le=1.0,
-        description="Similarity score between 0 and 1 (higher is better)",
+        le=10.0,
+        description="Similarity score between 0 and 10 (higher is better)",
     )
     reasoning: str = Field(description="Short explanation of the judgment")
-                    "passed (boolean), score (float 0-1), reasoning (short string)."
+                    "passed (boolean), score (float 0-10), reasoning (short string)."

Also applies to: 69-83, 314-321

🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/viz_test_case_runner.py` around lines 39 -
46, The VizEvaluationSchema currently constrains score to 0–1 but the runner
uses a 0–10 pass threshold (score >= 8); update the schema field in
VizEvaluationSchema (the score Field) to ge=0.0 and le=10.0 and change its
description to "Similarity score between 0 and 10 (higher is better)"; then
check other places that parse/validate or prompt for the score (any code that
consumes VizEvaluationSchema or performs the score >= 8 check) and ensure
prompts, validation, and test expectations consistently use the 0–10 scale.

Comment on lines +192 to +201
def dict_hash(dct: dict[Any, Any]) -> Any:
"""Return a hash of the contents of a dictionary."""
serialized = json.dumps(dct, sort_keys=True)

try:
m = hashlib.sha256(serialized)[:32] # pyright: ignore[reportArgumentType,reportIndexIssue]
except TypeError:
m = hashlib.sha256(serialized.encode())[:32] # pyright: ignore[reportIndexIssue]

return m.hexdigest()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: hashlib.sha256() returns a hash object, not bytes—slicing it won't work.

The code attempts to slice hashlib.sha256(serialized)[:32], but hashlib.sha256() returns a hash object, not bytes. This will raise a TypeError (which the code then catches and tries the encoded version, which also won't work correctly). The # pyright: ignore comments are masking the issue.

🐛 Proposed fix
 def dict_hash(dct: dict[Any, Any]) -> Any:
     """Return a hash of the contents of a dictionary."""
     serialized = json.dumps(dct, sort_keys=True)
-
-    try:
-        m = hashlib.sha256(serialized)[:32]  # pyright: ignore[reportArgumentType,reportIndexIssue]
-    except TypeError:
-        m = hashlib.sha256(serialized.encode())[:32]  # pyright: ignore[reportIndexIssue]
-
-    return m.hexdigest()
+    return hashlib.sha256(serialized.encode()).hexdigest()[:32]
🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/viz_utils/viz_utils.py` around lines 192 - 201, The
dict_hash function is using hashlib.sha256 incorrectly by slicing the hash
object; instead ensure `serialized` is bytes (encode when needed), pass that to
`hashlib.sha256(...)`, then call `.hexdigest()` (or `.digest()` if you want raw
bytes) and slice the resulting bytes/hex string as intended before returning;
update the logic in dict_hash to encode `serialized` when it's a str, compute
the hash via `hashlib.sha256(serialized_bytes)`, then return the proper sliced
`.hexdigest()` (or slice `.digest()` and then hexlify) so `m.hexdigest()` is
called on a real digest rather than on the hash object.

Tharun Paul added 2 commits January 24, 2026 05:54
```
fix(workflow): standardize SQL query state key naming

- Changed state key from "relevant_sql_queries" to "previous_sql_queries" in multi_dataset and single_dataset agent nodes
- Changed state key from "previous_sql_queries" to "prev_sql_queries" in analyze_query node to match input parameter naming
```
@claude
Copy link
Copy Markdown

claude Bot commented Jan 24, 2026

Claude finished @paul-tharun's task —— View job


PR Review: Feat/e2e datasets metadata

I've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings:

Review Checklist

  • Read repository setup and guidelines (CLAUDE.md, configs)
  • Review architectural changes (new graphs, workflows)
  • Analyze API security and validation
  • Check error handling and edge cases
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture & Design

Positive Changes

  • Well-structured agent architecture: New nl_to_sql_graph and sql_planner_graph follow LangGraph patterns correctly
  • Separation of concerns: Clear separation between SQL planning, execution, and validation workflows
  • Type safety: Good use of Pydantic models for API validation (FetchSqlRequest, FetchSqlResponse)
  • Unified client pattern: New GopieClient centralizes API communication with proper header management

⚠️ Architectural Concerns

  • Commented code in workflows: chat-server/app/workflow/graph/sql_planner_graph/graph.py:19-30 has extensive commented conditional edges that should either be implemented or removed
  • Complex state management: The addition of multiple new graph types increases complexity without clear documentation of when each should be used

🔒 Security Analysis

Good Security Practices

  • Input validation: Proper Pydantic validation in new API endpoints (fetch_sql.py:21-22)
  • SQL injection prevention: Using parameterized queries and proper escaping (column_value_matching.py:40)
  • Header handling: Proper organization ID header management in GopieClient

🚨 Security Issues

  1. Insufficient input sanitization in column_value_matching.py:40:

    WHERE LOWER(CAST({column_name} AS VARCHAR)) LIKE '%' || LOWER('{escaped_value}') || '%'

    The escaped_value variable name suggests sanitization, but no actual escaping is shown in this snippet.

  2. Generic exception handling in fetch_sql.py:36-42:

    except Exception as e:
        logger.error(f"Error in fetch_sql: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail="Failed to generate SQL queries, please try again.",
        ) from e

    This could mask important security errors and provide too generic error messages.


🚀 Performance Analysis

Performance Improvements

  • Smart sampling: New table size estimation and sampling logic in table_utils.py prevents performance issues on large datasets
  • Connection pooling: Using singleton pattern for HTTP client connections (SingletonAiohttp)
  • Async operations: Proper use of async/await throughout new components

⚠️ Performance Concerns

  • Multiple vector searches: New DuckDB documentation search (duckdb_docs/search.py) adds another vector search operation that could impact latency
  • Memory usage: Complex state objects with nested query results could consume significant memory in high-concurrency scenarios

🧪 Testing & Quality

Testing Infrastructure

  • Good test structure: Maintained separation between unit and e2e tests
  • DSPy integration: Advanced evaluation system with optimized evaluators

Testing Gaps

  1. No tests for new API endpoint: fetch_sql.py lacks dedicated test coverage
  2. Missing edge case tests: New graph workflows need tests for failure scenarios
  3. Integration tests: No tests covering the interaction between new nl_to_sql_graph and existing workflows

🐛 Bug & Error Handling Issues

  1. State validation inconsistency in sql_agent.py:27-28:

    if not query_result or not single_dataset_result:
        raise ValueError("query_result is not properly initialized")

    This doesn't provide context about what went wrong or how to fix it.

  2. Resource cleanup: Missing cleanup logic for failed operations in new graph workflows

  3. Retry logic: Inconsistent retry mechanisms across different agents - some have configurable retry counts, others don't


📋 Code Quality

Good Practices

  • Type hints: Comprehensive type annotations throughout new code
  • Documentation: Good docstrings in new functions like get_duckdb_docs()
  • Code organization: Logical file structure following established patterns

⚠️ Areas for Improvement

  • Hardcoded values: Magic numbers like LIMIT 200000 in SQL queries should be configurable
  • Long functions: Some functions exceed 50 lines and could benefit from decomposition
  • Duplicate logic: Similar error handling patterns repeated across multiple files

🔧 Specific Recommendations

  1. Security:

    # In column_value_matching.py - add proper escaping
    def escape_sql_value(value: str) -> str:
        return value.replace("'", "''").replace("%", "\\%").replace("_", "\\_")
  2. Error handling:

    # In fetch_sql.py - more specific exception handling
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except ConnectionError as e:
        raise HTTPException(status_code=503, detail="Service unavailable")
  3. Configuration:

    # Add to config.py
    DEFAULT_QUERY_LIMIT: int = 200000
    MAX_SAMPLING_PERCENTAGE: float = 5.0
  4. Testing:

    • Add unit tests for fetch_sql_router
    • Add integration tests for new graph workflows
    • Add performance tests for large dataset scenarios

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⚪ Well-designed but complex
Security ⭐⭐⭐⚪⚪ Good practices with some gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code with room for improvement

Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.


Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@chat-server/app/utils/providers/embedding_providers/vllm.py`:
- Around line 26-31: The VLLMEmbeddingProvider __init__ currently accepts but
never stores metadata and calls no super; either remove the unused metadata
parameter from VLLMEmbeddingProvider or persist it similarly to
LiteLLMEmbeddingProvider (e.g., store self.metadata and extract org-scoped keys
like user/trace_id as PortkeyEmbeddingProvider does) so downstream methods can
use it; additionally implement a fallback strategy by updating either
VLLMEmbeddingProvider (add a guarded get_embedding call that catches failures
and delegates to a configured fallback provider) or enhance
get_embedding_provider() to return a wrapper that attempts the primary provider
(VLLMEmbeddingProvider) and on exception switches to a secondary provider,
ensuring clear identifiers (VLLMEmbeddingProvider, LiteLLMEmbeddingProvider,
PortkeyEmbeddingProvider, get_embedding_provider) are used to locate and wire
the fallback logic.
🧹 Nitpick comments (5)
chat-server/app/workflow/graph/multi_dataset_graph/node/analyze_query.py (2)

71-93: Consider generating a unique ID for the fallback tool_call.

The hardcoded "id": "fallback_respond_to_user" may cause issues if any downstream code validates or indexes by tool_call_id expecting unique values. LangChain typically uses UUIDs.

♻️ Suggested improvement
+import uuid
+
 ...
             fallback = AIMessage(
                 content=getattr(response, "content", "") or "",
                 tool_calls=[
                     {
                         "name": "respond_to_user",
                         "args": {
                             "query_type": "data_query",
                             "confidence_score": 5,
                             "reasoning": "Model did not return a tool call; defaulting to full workflow.",
                             "clarification_needed": "",
                             "status_message": "Analyzing your question...",
                             "response_data": None,
                         },
-                        "id": "fallback_respond_to_user",
+                        "id": f"fallback_{uuid.uuid4().hex[:12]}",
                     }
                 ],
             )

179-185: Consider adding defensive checks for robustness.

Direct access to response.tool_calls[0]["args"] could raise IndexError or KeyError if the response structure is unexpected. While the outer try-except catches this, a defensive check would provide clearer error handling.

♻️ Suggested defensive check
 async def _handle_analysis_response(
     response: Any,
     query_result: QueryResult,
     tool_call_count: int,
     config: RunnableConfig,
 ) -> dict:
-    tool_args = response.tool_calls[0]["args"]
+    tool_calls = getattr(response, "tool_calls", None) or []
+    if not tool_calls or "args" not in tool_calls[0]:
+        logger.warning("_handle_analysis_response: missing tool_calls or args; using defaults")
+        tool_args = {}
+    else:
+        tool_args = tool_calls[0]["args"]

     query_type = tool_args.get("query_type", "conversational")
     confidence_score = tool_args.get("confidence_score", 5)
chat-server/app/core/config.py (1)

113-115: Consider surfacing these sampling knobs in env/docs.
Since TARGET_ROWS and SAMPLING_THRESHOLD are new tunables, adding them to .env.example or config docs would make them more discoverable for operators.

chat-server/app/utils/providers/embedding_providers/vllm.py (1)

12-23: Avoid clobbering or duplicating model_kwargs in embed_query.

If a caller supplies model_kwargs, this currently risks a TypeError (duplicate keyword) or silently discarding caller options. Prefer merging.

♻️ Suggested merge pattern
 class VLLMEmbeddingFunctionProvider(OpenAIEmbeddings):
     def embed_query(self, text: str, **kwargs: Any) -> list[float]:
         text = f"Instruct: {TASK_DESCRIPTION}\nQuery: {text}"
+        model_kwargs = {"extra_body": {"truncate_prompt_tokens": -1}}
+        model_kwargs |= kwargs.pop("model_kwargs", {})
         return super().embed_query(
-            text, model_kwargs={"extra_body": {"truncate_prompt_tokens": -1}}, **kwargs
+            text, model_kwargs=model_kwargs, **kwargs
         )

     async def aembed_query(self, text: str, **kwargs: Any) -> list[float]:
         text = f"Instruct: {TASK_DESCRIPTION}\nQuery: {text}"
+        model_kwargs = {"extra_body": {"truncate_prompt_tokens": -1}}
+        model_kwargs |= kwargs.pop("model_kwargs", {})
         return await super().aembed_query(
-            text, model_kwargs={"extra_body": {"truncate_prompt_tokens": -1}}, **kwargs
+            text, model_kwargs=model_kwargs, **kwargs
         )
chat-server/app/utils/model_registry/model_provider.py (1)

175-203: Guard retries when tools are bound to avoid duplicate side effects in downstream chains.

bind_tools() binds tool schemas only and doesn't execute tools directly. However, when this LLM is used downstream in a chain that executes tools, retries at higher levels can cause tool side effects to repeat. As a defensive measure, skip with_retry() when tools are bound, since side-effectful operations (database writes, external API calls) should be idempotent or prevented from retrying altogether.

♻️ Suggested guard to avoid retrying tool calls
-    if schema:
-        structured_llm = llm.with_structured_output(schema=schema, method="json_schema")  # type: ignore
-        return structured_llm.with_retry()  # type: ignore
+    if schema:
+        structured_llm = llm.with_structured_output(schema=schema, method="json_schema")  # type: ignore
+        return structured_llm.with_retry()  # type: ignore
     elif json_mode:
         llm = llm.bind(response_format={"type": "json_object"})
     if force_tool_calls:
         llm = llm.bind(tool_choice="required")
-    return llm.with_retry()  # type: ignore
+    if tool_names or force_tool_calls:
+        return llm  # avoid retrying side-effectful tool calls
+    return llm.with_retry()  # type: ignore

Comment on lines +26 to +31
class VLLMEmbeddingProvider(BaseEmbeddingProvider):
def __init__(
self,
metadata: dict[str, str],
):
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find and examine BaseEmbeddingProvider and other embedding providers
echo "=== Finding base.py ==="
fd -t f 'base.py' chat-server/app/utils/providers/embedding_providers

echo -e "\n=== Listing embedding providers directory ==="
fd -t f '\.py$' chat-server/app/utils/providers/embedding_providers

echo -e "\n=== Checking BaseEmbeddingProvider definition ==="
rg -n "class BaseEmbeddingProvider" chat-server/app/utils/providers/embedding_providers -A 20

Repository: factly/gopie

Length of output: 1120


🏁 Script executed:

#!/bin/bash
# Get full BaseEmbeddingProvider implementation
echo "=== Full BaseEmbeddingProvider class ==="
cat -n chat-server/app/utils/providers/embedding_providers/base.py

echo -e "\n=== VLLMEmbeddingProvider __init__ ==="
rg -n "class VLLMEmbeddingProvider" chat-server/app/utils/providers/embedding_providers/vllm.py -A 10

echo -e "\n=== OpenAI provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/openai.py -A 10

echo -e "\n=== LiteLLM provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/litellm.py -A 10

echo -e "\n=== Local provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/local.py -A 10

echo -e "\n=== Custom provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/custom.py -A 10

echo -e "\n=== Portkey provider __init__ ==="
rg -n "class.*EmbeddingProvider" chat-server/app/utils/providers/embedding_providers/portkey.py -A 10

Repository: factly/gopie

Length of output: 3474


🏁 Script executed:

#!/bin/bash
# Check embedding provider usage and instantiation
echo "=== Check __init__.py for embedding providers ==="
cat -n chat-server/app/utils/providers/embedding_providers/__init__.py

echo -e "\n=== Search for VLLMEmbeddingProvider usage ==="
rg -r "VLLMEmbeddingProvider" chat-server -g '*.py' -B 2 -A 2

echo -e "\n=== Check provider manager or factory for fallback logic ==="
fd -t f '*.py' chat-server/app/utils/providers | head -20 | xargs -I {} bash -c 'echo "=== {} ===" && head -5 {}'

echo -e "\n=== Look for fallback or routing patterns ==="
rg -i "fallback|fallback_provider" chat-server -g '*.py' -A 2

Repository: factly/gopie

Length of output: 14425


🏁 Script executed:

#!/bin/bash
# Check how embedding providers are instantiated/used
echo "=== Search for embedding provider instantiation ==="
rg "EmbeddingProvider\(" chat-server/app -g '*.py' -B 2 -A 2

echo -e "\n=== Check if there's a provider manager/factory ==="
fd -t f '*.py' chat-server/app/utils/providers | grep -E "(manager|factory|init)" | xargs -I {} bash -c 'echo "=== {} ===" && head -20 {}'

echo -e "\n=== Check __init__.py in providers parent ==="
cat -n chat-server/app/utils/providers/__init__.py 2>/dev/null || echo "File not found"

echo -e "\n=== Check for EmbeddingProvider in main app init ==="
rg "EmbeddingProvider" chat-server/app -g '*.py' | head -20

Repository: factly/gopie

Length of output: 8631


🏁 Script executed:

#!/bin/bash
# Get complete model_provider.py context
echo "=== Full model_provider.py or relevant section ==="
wc -l chat-server/app/utils/model_registry/model_provider.py
head -80 chat-server/app/utils/model_registry/model_provider.py

echo -e "\n=== Check for fallback in get_embedding_provider ==="
rg -n "get_embedding_provider" chat-server/app/utils/model_registry/model_provider.py -A 20

echo -e "\n=== Check if metadata is used in VLLM get_embeddings_model ==="
cat -n chat-server/app/utils/providers/embedding_providers/vllm.py

Repository: factly/gopie

Length of output: 6287


Remove unused metadata parameter and implement fallback mechanism.

BaseEmbeddingProvider has no __init__ method, so super().__init__() is not applicable. However, VLLMEmbeddingProvider accepts metadata but never stores or uses it, unlike PortkeyEmbeddingProvider (which extracts user/trace_id for org-scoped routing) and LiteLLMEmbeddingProvider (which stores metadata for auth). Either remove the unused parameter for consistency, or document why it's needed.

Additionally, per coding guidelines, implement fallback mechanisms at the provider level or in get_embedding_provider() factory—currently there is no fallback strategy if the primary provider fails.

🤖 Prompt for AI Agents
In `@chat-server/app/utils/providers/embedding_providers/vllm.py` around lines 26
- 31, The VLLMEmbeddingProvider __init__ currently accepts but never stores
metadata and calls no super; either remove the unused metadata parameter from
VLLMEmbeddingProvider or persist it similarly to LiteLLMEmbeddingProvider (e.g.,
store self.metadata and extract org-scoped keys like user/trace_id as
PortkeyEmbeddingProvider does) so downstream methods can use it; additionally
implement a fallback strategy by updating either VLLMEmbeddingProvider (add a
guarded get_embedding call that catches failures and delegates to a configured
fallback provider) or enhance get_embedding_provider() to return a wrapper that
attempts the primary provider (VLLMEmbeddingProvider) and on exception switches
to a secondary provider, ensuring clear identifiers (VLLMEmbeddingProvider,
LiteLLMEmbeddingProvider, PortkeyEmbeddingProvider, get_embedding_provider) are
used to locate and wire the fallback logic.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 2, 2026

Claude finished @paul-tharun's task —— View job


🔍 PR Review: Feat/e2e datasets metadata

Review Progress

  • Read repository setup and guidelines
  • Review architectural changes
  • Analyze security implementations
  • Check error handling patterns
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture Review

This PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support.

Positive Changes

1. Clean NL-to-SQL Graph Architecture

  • Supervisor routing: Clean logic in nl_to_sql_graph/node/supervisor.py routing single vs multi-dataset queries
  • Separation of concerns: Clear split between semantic search, SQL agent, and planning
  • LangGraph patterns: Proper use of StateGraph, Command, and typed state management

2. Multi-Tenant Organization Support

  • Proper isolation: Organization ID properly threaded through GopieClient and Qdrant filters
  • Security boundaries: Clear separation using must conditions for org_id in schema search
  • Header management: Consistent X-Organization-id header handling

3. SQL Injection Prevention

  • Query builders: Safe abstraction through utils/olap/ query builders
  • Parameterized approach: SQL generation through builders rather than string interpolation
  • Validation patterns: Good use of escaped values in column matching

🚨 Critical Security Issues

1. Missing Organization ID in Semantic Search

Location: chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py:20

The semantic search function calls search_schemas without passing org_id, allowing cross-tenant data leakage:

# VULNERABLE: No org_id passed
results = await search_schemas(
    user_query=user_query,
    embeddings=embeddings_model,
    dataset_ids=dataset_ids,
    project_ids=project_ids,
    # Missing: org_id=org_id
)

Fix Required: Extract org_id from config and pass it to search_schemas.

2. Division by Zero Risk

Location: chat-server/app/utils/graph_utils/table_utils.py:43

# VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100

Fix Required: Add zero-check before division.

3. Direct SQL Interpolation

Location: chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py:121

# POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"

While dataset_name comes from trusted Qdrant storage, consider defensive identifier quoting.


⚠️ Error Handling Issues

1. Inconsistent Error Guards

  • execute_sql.py:34: Checks isinstance(last_message, ErrorMessage) but last_message is a list
  • sql_agent.py:25: Accesses query_result.single_dataset_query_result before null check

2. Missing Resource Cleanup

  • No explicit cleanup logic for failed operations in new graph workflows
  • Broad exception handling masks specific failure modes

3. State Validation Issues

  • prepare_data: Returns without dataset_info but graph always routes to sql_agent
  • routing.py: Uses state.get("query_result", []) which can be list causing AttributeError

🚀 Performance Considerations

Good Optimizations

  • Smart sampling: Table size estimation prevents performance issues on large datasets
  • Vector search optimization: Hybrid search with proper filtering
  • Connection pooling: Singleton pattern for HTTP client connections
  • Async operations: Proper async/await throughout

⚠️ Potential Concerns

  • Multiple vector searches: DuckDB docs search adds latency to SQL planning
  • Memory usage: Complex nested state objects in high-concurrency scenarios
  • Missing timeouts: GopieClient lacks explicit timeout configuration

🧪 Test Coverage Assessment

Good Coverage

  • Existing unit tests cover GopieClient org_id handling
  • DSPy integration provides advanced evaluation
  • E2E visualization test infrastructure

Missing Tests

  • No tests for new /fetch-sql endpoint
  • No integration tests for NL-to-SQL graph workflow
  • Missing edge case tests for new graph failure scenarios
  • No tests for org_id security boundaries

📋 Code Quality Issues

1. Error Message Inconsistencies

# Vague error without context
raise ValueError("query_result is not properly initialized")

2. Magic Numbers

# Should be configurable
SELECT * FROM {table} LIMIT 200000

3. Missing Function Documentation

Several new functions lack comprehensive docstrings explaining return contracts.


🔧 Specific Recommendations

High Priority Fixes

  1. Add org_id to semantic search to prevent data leakage
  2. Add zero-check in calculate_sampling_percentage
  3. Fix error guard in execute_sql.py (list vs ErrorMessage)
  4. Add validation in sql_agent.py before accessing query_result

Medium Priority

  1. Add timeouts to GopieClient requests
  2. Improve error messages with specific context
  3. Add integration tests for new workflows
  4. Configure magic numbers via settings

Code Quality

  1. Add comprehensive docstrings to new functions
  2. Implement consistent retry logic across agents
  3. Add resource cleanup for failed operations

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⭐ Excellent modular design
Security ⭐⭐⭐⚪⚪ Good foundations, critical gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code, room for improvement

Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.


Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
chat-server/tests/e2e/utils/generate_app_cases.py (1)

290-309: ⚠️ Potential issue | 🔴 Critical

Complete the multi-dataset routing description and add "ai" → "assistant" role normalization.

Lines 290–309: The routing overview for identify_datasets omits the retry_semantic_search path. Update line 295 to:

  • identify_datasets → routes: plan_query | route_response (no_datasets_found) | retry_semantic_search

Lines 401–420: The _normalize_message function converts "human" to "user" but does not convert "ai" to "assistant". Since LangChain's AIMessage.type is "ai" but provider APIs (OpenAI, etc.) require role: "assistant", add this mapping after line 405:

            if role == "ai":
                role = "assistant"

These omissions will cause incomplete test-path coverage and generated test cases with invalid message roles that fail API compatibility checks.

chat-server/tests/unit/test_column_value_matching.py (2)

36-61: ⚠️ Potential issue | 🟠 Major

Bug: captured dict is never populated, assertion will fail.

The test declares captured = {} on line 39 but never assigns captured["query"] in the mock function. The assertion on line 61 will raise a KeyError.

🐛 Proposed fix
 `@pytest.mark.asyncio`
 async def test_find_similar_values_falls_back_to_levenshtein(monkeypatch):
     """Test fallback to Levenshtein when ILIKE returns empty results."""
     calls = {"count": 0}
     captured = {}
 
     async def fake_execute_sql(query: str, org_id=None):
         calls["count"] += 1
+        captured["query"] = query
         # Check for both DuckDB (levenshtein) and ClickHouse (levenshteinDistance) syntax
         if "levenshtein" in query.lower():
             return [
                 {"name": "Alison"},
                 {"name": "Alicia"},
             ]
         return []

73-83: ⚠️ Potential issue | 🟡 Minor

Bug: Duplicate dictionary key "error" in dummy logger.

The dictionary literal has "error" defined twice. The second definition silently overwrites the first. This appears to be a copy-paste error.

🐛 Proposed fix
     dummy_logger = type(
         "L",
         (),
         {
             "error": lambda *args, **kwargs: None,
             "debug": lambda *args, **kwargs: None,
-            "error": lambda *args, **kwargs: None,
         },
     )()
chat-server/tests/unit/test_vector_store.py (1)

48-77: ⚠️ Potential issue | 🟠 Major

Test mocks get_vector_store but implementation uses get_async_client.

The test mocks QdrantSetup.get_vector_store, but examining the actual add_document_to_vector_store implementation (in vector_store.py lines 73-90), the function calls QdrantSetup.get_async_client and performs a direct upsert, not get_vector_store.aadd_documents. This test will pass but doesn't exercise the actual code path.

🔧 Proposed fix - update test to match implementation
     `@pytest.mark.asyncio`
     async def test_add_document_to_vector_store(self, mock_document):
-        mock_vector_store = AsyncMock()
+        mock_async_client = AsyncMock()

         with (
             patch("app.services.qdrant.vector_store.QdrantSetup") as mock_qdrant_setup_class,
             patch("app.services.qdrant.vector_store.get_model_provider") as mock_get_model_provider,
+            patch("app.services.qdrant.vector_store.generate_sparse_vector") as mock_sparse,
         ):
             mock_model_provider = Mock()
             mock_embeddings = Mock()
+            mock_embeddings.embed_query.return_value = [0.1, 0.2, 0.3]
             mock_model_provider.get_embeddings_model.return_value = mock_embeddings
             mock_get_model_provider.return_value = mock_model_provider

-            mock_qdrant_setup_class.get_vector_store.return_value = mock_vector_store
+            mock_qdrant_setup_class.get_async_client = AsyncMock(return_value=mock_async_client)
             mock_qdrant_setup_class.get_document_id.return_value = "doc_id_123"
+            mock_sparse.return_value = Mock(indices=[1, 2], values=[0.5, 0.5])

             await add_document_to_vector_store(mock_document)

-            mock_qdrant_setup_class.get_vector_store.assert_called_once_with(
-                embeddings=mock_embeddings, collection_name="dataset_collection"
-            )
             mock_qdrant_setup_class.get_document_id.assert_called_once_with("proj1", "ds1")
-            mock_vector_store.aadd_documents.assert_called_once_with(
-                documents=[mock_document], ids=["doc_id_123"]
-            )
+            mock_async_client.upsert.assert_called_once()
chat-server/tests/unit/test_prompts.py (1)

107-168: ⚠️ Potential issue | 🟠 Major

Add @pytest.mark.unit to mark these tests for unit-only test runs.

The test file is missing the module-level pytest marker required for unit tests in chat-server/tests/unit/. This causes these tests to be skipped when running pytest -m unit.

✅ Suggested fix
 from unittest.mock import Mock, patch

 import pytest

+pytestmark = pytest.mark.unit
+
 from app.utils.langsmith.prompt_manager import PromptManager, get_prompt
 from app.workflow.prompts.prompt_selector import PromptSelector
chat-server/app/api/v1/routers/dataset_upload.py (1)

21-50: ⚠️ Potential issue | 🟠 Major

Guard against missing x-organization-id header before Qdrant storage.

store_schema_in_qdrant and create_dataset_schema both require org_id: str, but the header is optional (str | None). Passing None risks type errors and tenant-mis-scoping. Make the header required or validate/reject requests without it.

Suggested fix (header required)
 async def upload_schema(
     payload: UploadSchemaRequest,
     x_organization_id: Annotated[str | None, Header()] = None,
 ):
     """
     Processes and index dataset schema.
@@
     """
     try:
+        if x_organization_id is None:
+            raise HTTPException(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                detail="x-organization-id header is required",
+            )
         project_id = payload.project_id
         dataset_id = payload.dataset_id
         dataset_details, project_details = await asyncio.gather(
chat-server/app/services/gopie/generate_schema.py (1)

70-76: ⚠️ Potential issue | 🟡 Minor

Missing error handling for the GET request.

Unlike execute_sql which checks response.status != HTTPStatus.OK, the GET request here doesn't validate the response status before parsing JSON. A non-200 response could result in unexpected behavior or unclear error messages.

🛡️ Proposed fix to add error handling
+    from http import HTTPStatus
+
     async with await client.get(path) as response:
+        if response.status != HTTPStatus.OK:
+            error_data = await response.json()
+            raise Exception(
+                f"Failed to fetch summary for {dataset_name}: "
+                f"{error_data.get('error', 'Unknown error')}"
+            )
         data = await response.json()
chat-server/app/tool_utils/tools/list_datasets.py (2)

38-52: ⚠️ Potential issue | 🟡 Minor

Add error handling for consistency with other functions.

This function lacks try/except handling unlike get_dataset_names_from_project_ids (lines 17-28) and get_dataset_names_for_dataset_ids (lines 62-70). An unhandled exception here could propagate and cause issues in the calling code.

🛡️ Proposed fix to add error handling
 async def get_project_ids_for_datasets_ids(
     dataset_ids: list[str],
     org_id: Optional[str] = None,
 ) -> dict[str, str]:
     dataset_id_project_map = {}
     client = GopieClient(org_id=org_id)
     for dataset_id in dataset_ids:
-        path = f"/v1/api/datasets/{dataset_id}/project"
-        async with await client.get(path, ssl=False) as response:
-            if response.status == 200:
-                dataset_data = await response.json()
-                dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "")
-            else:
-                logger.warning(f"Dataset {dataset_id} not found")
+        try:
+            path = f"/v1/api/datasets/{dataset_id}/project"
+            async with await client.get(path, ssl=False) as response:
+                if response.status == 200:
+                    dataset_data = await response.json()
+                    dataset_id_project_map[dataset_id] = dataset_data.get("project_id", "")
+                else:
+                    logger.warning(f"Dataset {dataset_id} not found")
+        except Exception as e:
+            logger.exception(f"Error fetching project for dataset {dataset_id}: {e}")
     return dataset_id_project_map

55-76: ⚠️ Potential issue | 🟡 Minor

Potential data loss: only storing one dataset per project.

In get_dataset_names_for_dataset_ids, when multiple datasets belong to the same project, only the last one is retained because project_dataset_map[project_id] is overwritten on each iteration (line 68).

🐛 Proposed fix to accumulate datasets per project
 async def get_dataset_names_for_dataset_ids(
     dataset_project_ids_map: dict[str, str],
     org_id: Optional[str] = None,
 ) -> str:
-    project_dataset_map = {}
+    project_dataset_map: dict[str, list[str]] = {}
     client = GopieClient(org_id=org_id)
 
     for dataset_id, project_id in dataset_project_ids_map.items():
         path = f"/v1/api/projects/{project_id}/datasets/{dataset_id}"
 
         async with await client.get(path, ssl=False) as response:
             if response.status == 200:
                 dataset_data = await response.json()
-                project_dataset_map[project_id] = dataset_data.get("alias", "")
+                if project_id not in project_dataset_map:
+                    project_dataset_map[project_id] = []
+                project_dataset_map[project_id].append(dataset_data.get("alias", ""))
             else:
                 logger.warning(f"Dataset {dataset_id} not found")
 
     list_of_datasets_str = ""
-    for project_id, dataset_name in project_dataset_map.items():
+    for project_id, dataset_names in project_dataset_map.items():
         list_of_datasets_str += f"Project {project_id}:\n"
-        list_of_datasets_str += f"  - {dataset_name}\n"
+        for dataset_name in dataset_names:
+            list_of_datasets_str += f"  - {dataset_name}\n"
     return list_of_datasets_str
chat-server/app/workflow/graph/multi_dataset_graph/node/identify_datasets.py (2)

88-111: ⚠️ Potential issue | 🟡 Minor

Incorrect data type passed to set_node_message.

At line 91-94, a set is being passed as the message argument instead of a str. The curly braces {...} create a set, not a dict. This should be a string.

🐛 Proposed fix
             query_result.set_node_message(
                 "identify_datasets",
-                {
-                    "No relevant datasets found by doing semantic search. This subquery is "
-                    "not relevant to any datasets. Treating as conversational query."
-                },
+                "No relevant datasets found by doing semantic search. This subquery is "
+                "not relevant to any datasets. Treating as conversational query.",
             )

197-218: ⚠️ Potential issue | 🟡 Minor

Docstring mentions "analyze_dataset" but returns "plan_query".

The docstring at line 202 still references "analyze_dataset" as a possible return value, but the function now returns "plan_query" at line 218.

📝 Proposed fix to update docstring
     """
     Route to the appropriate next node based on dataset identification results.
 
     Returns:
-        str: One of "no_datasets_found", "retry_semantic_search", or "analyze_dataset"
+        str: One of "no_datasets_found", "retry_semantic_search", or "plan_query"
     """
🤖 Fix all issues with AI agents
In `@chat-server/app/api/v1/routers/dataset_upload.py`:
- Around line 2-4: Remove the duplicate Header import (only import Header once)
and add validation for x_organization_id before calling store_schema_in_qdrant:
ensure x_organization_id (the header bound variable) is not None/empty and if
missing raise an HTTPException with status code 400; then pass the validated
org_id (str) into store_schema_in_qdrant to satisfy its required str parameter
(refer to Header import, x_organization_id, and store_schema_in_qdrant).

In `@chat-server/app/core/log.py`:
- Around line 14-17: The custom logger method exception in
chat-server/app/core/log.py currently calls self._logger.critical(...,
stack_info=True) which records the call stack but loses the actual exception
traceback; update the exception method to pass exc_info=True (and keep
stack_info=True if desired) to self._logger so the logged output includes the
exception traceback; modify the exception function (named exception) to call
self._logger with exc_info=True (and stack_info=True if you want both) and
preserve the existing behavior of raising when self.dev_mode is true.

In `@chat-server/app/models/data.py`:
- Around line 28-37: DatasetDetails.columns is now Optional and can be None;
update all consumers to guard against None before iterating or accessing it —
e.g., in regenerate_fuzzy_values_prompt.py where matching_schema.columns is used
in list/generator comprehensions and in dataset_info.py where schema.columns is
looped, replace direct iterations with a null-safe form (check
matching_schema.columns is not None or use a fallback like
(matching_schema.columns or [])). Also add null checks before using row_count in
calculations and before using file_path for file operations so you don't assume
they are present (guard expressions around row_count and file_path where used).
Ensure you update usages referencing DatasetDetails, matching_schema.columns,
schema.columns, row_count, and file_path accordingly.

In `@chat-server/app/services/gopie/dataset_info.py`:
- Around line 34-43: Replace the re-raising pattern in the except blocks that
currently use "raise e" with a bare "raise" so the original traceback is
preserved; specifically update the except Exception as e blocks in
chat-server/app/services/gopie/dataset_info.py (the block that logs via
logger.exception("Error getting dataset info...") and the other similar block
later) to log with logger.exception(...) and then perform a bare raise rather
than "raise e".

In `@chat-server/app/services/qdrant/schema_search.py`:
- Around line 86-88: The async function search_schemas is calling the blocking
embeddings.embed_query which will block the event loop; change the call to the
async variant await embeddings.aembed_query(user_query) and assign its result
back to dense_vector, leaving generate_sparse_vector(user_query) as-is; ensure
the embeddings object supports aembed_query (as used elsewhere, e.g., health.py)
and update any nearby error handling or type expectations for dense_vector if
needed.

In `@chat-server/app/utils/olap/clickhouse.py`:
- Around line 13-19: The SQL builder currently injects table_name and
column_name directly (e.g., in get_estimated_size_query and the other builder
methods referenced), which permits SQL injection; update all methods that
interpolate identifiers to validate them against a strict identifier whitelist
or a conservative regex (e.g., only ASCII letters, numbers, and underscores,
starting with a letter) before using them, and then format identifiers using a
safe quoting strategy (e.g., wrap with ClickHouse backticks after validation and
escape any backticks found), ensuring you apply this change to
get_estimated_size_query and every builder method that inserts table_name or
column_name (the ones in the review ranges) so all identifier interpolation uses
the validated-and-quoted identifier rather than raw f-string insertion.

In `@chat-server/app/utils/olap/duckdb.py`:
- Around line 65-91: The two methods build_large_table_ilike_query and
build_large_table_levenshtein_query currently interpolate user inputs directly
into SQL and need the same sanitization as other query builders: stop inserting
raw value, table_name, and column_name into f-strings; instead use the existing
sanitization routine used elsewhere (e.g., the same escaping/quoting used by
build_sample_query or the project's SQL sanitizer) for table and column
identifiers and use parameterized placeholders or safely-escaped literal values
for the search term (escape single quotes and wrap properly) so that value,
table_name, and column_name cannot inject SQL; update both functions to call the
sanitizer for identifiers and to bind or escape the value consistently with
other query builders.
- Around line 38-63: The Levenshtein and ILIKE query builders
(build_levenshtein_query and build_ilike_query) directly interpolate the value
and identifiers into SQL, creating an injection risk; fix by
validating/whitelisting table_name and column_name (e.g., allow only
[A-Za-z_][A-Za-z0-9_]*), and sanitize or escape the value string (e.g., ensure
it's a str and escape single quotes) or better yet switch to DuckDB parameter
binding for the value; update these two methods to call the new
_validate_identifier and _sanitize_sql_string helpers (or use parameter
placeholders) before constructing the final SQL string.
- Around line 13-19: The get_estimated_size_query and other query-builder
methods currently interpolate table_name/column_name directly; to fix, add
parameterized query support to the execution layer (extend execute_sql to accept
a query plus params and use parameter binding) and update these methods (e.g.,
get_estimated_size_query) to return a parameterized SQL string with placeholders
and a corresponding params list instead of raw string interpolation; as an
immediate mitigation also validate identifier inputs (table_name, column_name)
in the class using a strict regex allowing only [A-Za-z0-9_] and raise on
invalid values before building the query so only safe identifiers are used until
execute_sql parameterization is implemented.

In
`@chat-server/app/workflow/graph/multi_dataset_graph/node/generate_subqueries.py`:
- Around line 18-67: _format_context_info currently only uses project_ids[0];
update it to iterate over all provided project_ids and append each project's
context and datasets to context_info: loop for pid in project_ids, call
get_project_info(pid, org_id=org_id) and get_project_schemas(pid, limit=50,
org_id=org_id), prepend each block with a project header (e.g., "PROJECT
CONTEXT: <pid> - <name>"), and for datasets slice the returned list to respect
max_datasets (or distribute max_datasets across projects if desired) before
formatting names/descriptions as the existing code does; keep the try/except
around per-project calls to continue on errors. Ensure you reference the same
function name (_format_context_info), and helper calls (get_project_info,
get_project_schemas) when making edits.

In `@chat-server/app/workflow/graph/multi_dataset_graph/node/sql_agent.py`:
- Around line 82-84: The success message string in the IntermediateStep call
contains a typo ("Succesfully"); update the content passed to IntermediateStep
in sql_agent.py (the IntermediateStep(...) invocation) to read "Successfully
Completed query planning step in multidataset workflow." so the spelling is
corrected while preserving the rest of the message and punctuation.

In `@chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py`:
- Around line 85-112: The code extracts org_id but doesn't pass it into the
schema lookup, which breaks tenant isolation; update the call in prepare_data.py
where get_schema_from_qdrant is invoked (currently
get_schema_from_qdrant(dataset_id=dataset_id)) to include the org_id parameter
(e.g., get_schema_from_qdrant(dataset_id=dataset_id, org_id=org_id)), and if
necessary update get_schema_from_qdrant's signature/implementation to accept and
use org_id for tenant-scoped schema resolution; keep dataset_id usage and
existing error handling intact.

In
`@chat-server/app/workflow/prompts/multi_dataset_prompts/analyze_query_prompt.py`:
- Around line 177-186: The human prompt path is using the raw
previous_sql_queries list (previous_sql_queries variable) which differs from the
formatted representation produced by format_analyze_query_input used elsewhere;
to fix it, normalize previous_sql_queries before formatting the
human_template_str by running the same formatting helper
(format_analyze_query_input) on previous_sql_queries (preserving the "None"
fallback) so human_template_str.format receives the consistent "[id:1] SELECT
..." style strings as other code paths expect; update the place where
human_template_str.format is called to pass the formatted_previous_sql_queries
instead of the raw list.

In `@chat-server/scripts/scrape_duckdb_docs.py`:
- Around line 5-6: The project imports httpx in scrape_duckdb_docs.py but
pyproject.toml lacks that dependency; update the pyproject.toml dependencies
section to include "httpx" (alongside the existing "beautifulsoup4") so CI and
production environments can install it—ensure the exact package name "httpx" is
added under the same dependencies list used for other runtime packages.
- Around line 16-23: The _fetch_page function currently catches all Exceptions
and uses print; change it to catch httpx.RequestError for transport-level
failures around await client.get(url, ...) and httpx.HTTPStatusError for non-2xx
responses raised by response.raise_for_status(), log errors via a module logger
(e.g., logger = logging.getLogger(__name__)) instead of print and include the
URL and exception details in the log (use logger.error or logger.exception), and
still return response.text on success or None on handled errors; keep the
function signature and return semantics the same.

In `@chat-server/tests/e2e/utils/generate_app_cases.py`:
- Around line 401-420: The _normalize_message function currently escapes content
but not role, which can allow special chars in role to break generated Python
code; update _normalize_message to escape role the same way as content (replace
backslash with \\\\, double-quote with \\", newline with \\n, and remove \r) and
return the escaped role, and also normalize role values by mapping
"human"→"user" and "ai"→"assistant" (in addition to the existing "type"
fallback) so generated templates that inject role values use safe, normalized
strings.

In `@chat-server/tests/unit/test_generate_sql_prompt_formatting.py`:
- Around line 31-33: Add the missing module-level pytest marker by declaring
pytestmark = pytest.mark.unit near the top of the test module (after imports and
any module docstring) so this file follows the same convention as other unit
tests; modify the TestGenerateSqlPromptFormatting test module to include that
pytestmark declaration (referencing the module that contains class
TestGenerateSqlPromptFormatting).

In `@chat-server/tests/unit/test_schema_search.py`:
- Around line 48-51: The test accesses query call kwargs with
call_args.kwargs["filter"] which can raise KeyError because search_schemas
places filter inside each Prefetch rather than as a top-level kwarg; update the
assertion to use call_args.kwargs.get("filter") (matching other tests) and keep
the existing assertion on prefetch length (mock_client.query_points,
search_schemas, Prefetch are the symbols to update/inspect).
🧹 Nitpick comments (23)
chat-server/tests/e2e/utils/test_utils.py (1)

202-219: Consider using aiohttp for async HTTP requests.

The function is declared as async but uses synchronous requests.post, which blocks the event loop. Since the codebase already uses aiohttp (as seen in gopie/client.py), consider migrating to aiohttp for consistency and true non-blocking behavior.

This is a pre-existing pattern, so deferring is acceptable.

♻️ Suggested approach using aiohttp
import aiohttp

async def send_chat_request(test_case: Dict[str, Any], url: str) -> Dict[str, Any]:
    query_copy = test_case.copy()
    query_copy.pop("expected_result", None)

    try:
        headers = {
            "Content-Type": "application/json",
            "Accept": "text/event-stream",
            "X-organization-id": TestConfig.GOPIE_ORG_ID,
            "X-user-id": TestConfig.GOPIE_USER_ID,
        }
        async with aiohttp.ClientSession() as session:
            async with session.post(
                url,
                json={**query_copy, "chat_id": "", "trace_id": ""},
                headers=headers,
                timeout=aiohttp.ClientTimeout(total=REQUEST_TIMEOUT),
            ) as response:
                response.raise_for_status()
                # Process streaming response...

As per coding guidelines: "Use async/await for non-blocking I/O operations throughout the application".

chat-server/tests/unit/test_schema_search.py (1)

14-39: Consider extracting common mocks into fixtures.

The mock setup for mock_embeddings, mock_client, and the context manager patches are duplicated across all tests. Extracting these into pytest fixtures would reduce boilerplate and improve maintainability.

Example fixture approach
`@pytest.fixture`
def mock_embeddings():
    mock = Mock()
    mock.embed_query.return_value = [0.1, 0.2, 0.3]
    return mock


`@pytest.fixture`
def mock_qdrant_client():
    return AsyncMock()


`@pytest.fixture`
def patch_qdrant_and_sparse(mock_qdrant_client):
    with (
        patch(
            "app.services.qdrant.schema_search.QdrantSetup.get_async_client",
            return_value=mock_qdrant_client,
        ),
        patch("app.services.qdrant.schema_search.generate_sparse_vector") as mock_sparse,
    ):
        mock_sparse.return_value = models.SparseVector(indices=[1, 2], values=[0.5, 0.5])
        yield mock_qdrant_client

Also applies to: 57-70, 87-100, 117-130, 147-160

chat-server/app/workflow/prompts/multi_dataset_prompts/analyze_query_prompt.py (1)

188-191: Inconsistent use of langsmith_compatible formatter.

The template path (line 172) applies langsmith_compatible() to system_content, but the direct message path (line 189) uses the raw system_content. This could lead to different prompt behavior depending on which path is used.

Proposed fix for consistency
     return [
-        SystemMessage(content=system_content),
+        SystemMessage(content=langsmith_compatible(system_content)),
         HumanMessage(content=human_content),
     ]
chat-server/scripts/scrape_duckdb_docs.py (1)

112-138: Replace progress print statements with structured logs.

Use the project’s JSON logger for progress and completion messages to keep observability consistent.
As per coding guidelines: Use JSON structured logging with context for observability.

🔧 Suggested update
-        print(f"Scraping [{len(self.visited_urls)}/{self.max_pages}]: {url}")
+        logger.info(
+            "duckdb_docs_scrape_progress",
+            extra={"current": len(self.visited_urls), "max_pages": self.max_pages, "url": url},
+        )
@@
-        print(f"Scraped {len(self.scraped_content)} pages")
+        logger.info("duckdb_docs_scrape_complete", extra={"pages": len(self.scraped_content)})
@@
-        print(f"Saved {len(self.scraped_content)} pages to {output_file}")
+        logger.info(
+            "duckdb_docs_saved",
+            extra={"pages": len(self.scraped_content), "path": str(output_file)},
+        )
chat-server/scripts/index_duckdb_docs.py (1)

19-72: Use the structured logger instead of print for observability.
Replacing print statements with the app logger keeps output consistent and JSON-structured across the chat-server.

As per coding guidelines: Use JSON structured logging with context for observability.

chat-server/app/services/qdrant/vector_store.py (1)

47-49: Prefer next(iter(...)) over single-element slice for efficiency.

Using list(...)[0] creates an unnecessary intermediate list. The generator-based approach is more idiomatic and efficient.

♻️ Proposed fix
 def generate_sparse_vector(text: str) -> models.SparseVector:
     sparse_model = SparseModelManager.get_model()
-    result = list(sparse_model.embed([text]))[0]
+    result = next(iter(sparse_model.embed([text])))
     return models.SparseVector(indices=result.indices.tolist(), values=result.values.tolist())
chat-server/tests/unit/test_column_value_matching.py (1)

85-90: Consider using underscore prefix for unused unpacked variables.

The match_source and error_message variables are unpacked but never used. Prefixing with underscore communicates intent and silences linter warnings.

♻️ Proposed fix
-    similar_values, match_source, error_message = await find_similar_values(
+    similar_values, _match_source, _error_message = await find_similar_values(
         "foo", "name", "t", estimated_size=1000
     )
chat-server/app/workflow/graph/single_dataset_graph/node/execute_sql.py (1)

52-57: Redundant org_id extraction inside loop.

org_id is already extracted at lines 22-24 before entering the loop. The re-extraction on line 54 is redundant but harmless.

♻️ Proposed fix - remove redundant extraction
         for sql_info in query_result.single_dataset_query_result.sql_queries:
             try:
-                org_id = config.get("metadata", {}).get("org_id")
                 result_data, truncated_result_data = await execute_sql_with_full_and_truncated(
                     query=sql_info.sql_query, org_id=org_id
                 )
chat-server/app/workflow/prompts/multi_dataset_prompts/identify_datasets_prompt.py (1)

126-131: Avoid mutable default arguments.

Using mutable default arguments ([]) can lead to unexpected behavior if the function modifies them. While this function doesn't mutate the lists, using None with a default is the safer pattern.

♻️ Proposed fix
 def format_identify_datasets_input(
     user_query: str,
-    relevant_dataset_schemas: list[DatasetSchema] = [],
-    semantic_searched_datasets: list[DatasetSchema] = [],
+    relevant_dataset_schemas: list[DatasetSchema] | None = None,
+    semantic_searched_datasets: list[DatasetSchema] | None = None,
     validation_result: str | None = None,
 ) -> dict:
+    if relevant_dataset_schemas is None:
+        relevant_dataset_schemas = []
+    if semantic_searched_datasets is None:
+        semantic_searched_datasets = []
     input_str = f"USER QUERY: {user_query}"
chat-server/app/workflow/prompts/multi_dataset_prompts/generate_subqueries_prompt.py (1)

8-79: Document template placeholders and expected kwargs.
The prompt now relies on {context_info} and {user_input}, but the function doesn’t describe these variables. Add a short docstring or inline note listing required kwargs and template parameters.

📌 Suggested docstring
 def generate_subqueries_prompt(**kwargs) -> list | ChatPromptTemplate:
+    """
+    Build the subquery-decomposition prompt.
+
+    Template variables:
+      - context_info: Optional project/dataset context text.
+      - user_input: Raw user query.
+    """
     prompt_template = kwargs.get("prompt_template", False)

As per coding guidelines: Document prompt templates and parameters in prompt files.

chat-server/tests/scripts/reset_and_reindex_collection.py (1)

20-25: Remove unused # noqa: E402 directives.
Ruff reports these as unused, so they can be dropped to keep lint clean.

🧹 Suggested cleanup
-from qdrant_client import QdrantClient, models  # noqa: E402
+from qdrant_client import QdrantClient, models
@@
-from app.core.config import settings  # noqa: E402
-from app.core.log import custom_logger as logger  # noqa: E402
-from app.services.qdrant.qdrant_setup import QdrantSetup  # noqa: E402
-from app.services.qdrant.vector_store import (  # noqa: E402
+from app.core.config import settings
+from app.core.log import custom_logger as logger
+from app.services.qdrant.qdrant_setup import QdrantSetup
+from app.services.qdrant.vector_store import (
     generate_sparse_vector,
 )
chat-server/app/utils/olap/__init__.py (1)

8-15: Sort __all__ to satisfy RUF022.

Keeps export lists deterministic and aligns with Ruff’s expectation.

🧹 Suggested change
 __all__ = [
-    "OlapQueryBuilder",
-    "DuckDBQueryBuilder",
     "ClickHouseQueryBuilder",
+    "DuckDBQueryBuilder",
+    "OlapQueryBuilder",
     "get_query_builder",
-    "is_duckdb_family",
     "is_clickhouse_family",
+    "is_duckdb_family",
 ]
chat-server/app/utils/olap/factory.py (1)

3-8: Use structured custom_logger instead of stdlib logging.

Keeps logging consistent with the rest of the chat server’s structured logging approach.

🧾 Suggested change
-import logging
-
-from app.core.config import settings
+from app.core.config import settings
+from app.core.log import custom_logger as logger
 from app.utils.olap.base import OlapQueryBuilder
-
-logger = logging.getLogger(__name__)

As per coding guidelines: Use JSON structured logging with context for observability.

chat-server/tests/unit/test_match_columns.py (1)

189-258: Consider adding assertion for _regenerate_fuzzy_values call arguments.

The retry test verifies the outcome but doesn't assert that _regenerate_fuzzy_values was called with expected arguments. This could help catch regressions if the function signature changes.

💡 Proposed enhancement
         with (
             patch(
                 "app.workflow.graph.sql_planner_graph.match_columns.match_column_values",
                 new_callable=AsyncMock,
                 return_value=mock_column_mappings,
             ),
             patch(
                 "app.workflow.graph.sql_planner_graph.match_columns.validate_match_relevance",
                 new_callable=AsyncMock,
                 return_value=mock_column_mappings,
             ),
             patch(
                 "app.workflow.graph.sql_planner_graph.match_columns._regenerate_fuzzy_values",
                 new_callable=AsyncMock,
                 return_value=mock_regenerate_response,
-            ),
+            ) as mock_regenerate,
         ):
             state = {
                 "user_query": "show finance department",
                 "messages": [],
                 "multi_datasets_info": sample_multi_datasets_info,
                 "match_columns_retry_count": 0,
             }

             result = await match_columns(state, mock_config)

             assert result["match_columns_retry_count"] == 1
             assert "multi_datasets_info" in result
             assert len(result["messages"]) == 1
             assert "Retrying" in result["messages"][0].content
+            mock_regenerate.assert_called_once()
chat-server/app/workflow/graph/single_dataset_graph/node/sql_agent.py (2)

57-68: Consider using f-string conversion flag for cleaner formatting.

The static analysis suggests using an explicit conversion flag. While minor, using {e!s} is more explicit than {str(e)}.

♻️ Proposed fix
     except Exception as e:
-        error_message = f"Error in SQL planner agent: {str(e)}"
+        error_message = f"Error in SQL planner agent: {e!s}"
         single_dataset_result.error = error_message

63-68: Consider passing config parameter to adispatch_custom_event for proper config propagation in LangGraph workflows.

While config is optional in the adispatch_custom_event signature, passing it explicitly when available (as it is in this function) ensures proper context propagation in LangGraph-based workflows. The config: RunnableConfig parameter is already available in scope on line 13, so update the call at lines 63-68 to:

await adispatch_custom_event(
    "gopie-agent",
    {
        "content": "Error planning SQL queries",
    },
    config=config,
)

Apply the same change to other adispatch_custom_event calls in this file and similar files in the workflow directory.

chat-server/app/utils/olap/base.py (1)

26-29: Consider documenting the expected format of the pct parameter.

The pct parameter is typed as str across multiple methods (build_sample_query, build_large_table_ilike_query, build_large_table_levenshtein_query). Based on the usage in generate_schema.py, it's a formatted percentage string (e.g., "0.001234"). Adding this to the docstring would help implementers.

📝 Proposed docstring enhancement
     `@abstractmethod`
     def build_sample_query(self, table_name: str, pct: str, limit: int) -> str:
-        """Return query with sampling for large tables."""
+        """Return query with sampling for large tables.
+
+        Args:
+            table_name: Name of the table to sample
+            pct: Sampling percentage as a string (e.g., "0.001234")
+            limit: Maximum number of rows to return
+        """
         pass

Also applies to: 48-58

chat-server/app/workflow/graph/single_dataset_graph/types.py (1)

36-36: Consider using more specific types for list fields.

The prev_sql_queries, sql_queries, and explanations fields are typed as list | None or list[str] | None. For better type safety and IDE support, consider using explicit element types.

💡 Proposed type refinement
 class State(TypedDict):
     messages: Annotated[list[BaseMessage], add_messages]
     dataset_id: str | None
     user_query: str | None
     query_result: QueryResult
     dataset_info: SingleDatasetInfo
-    prev_sql_queries: list | None
+    prev_sql_queries: list[str] | None
     validation_result: str | None
     recommendation: str
     sql_queries: list[str] | None
     explanations: list[str] | None

Also applies to: 39-40

chat-server/app/tool_utils/tools/list_datasets.py (1)

79-85: Avoid mutable default arguments.

Using mutable default arguments (list[str] = []) is a known Python anti-pattern that can cause unexpected behavior. Additionally, config: RunnableConfig = None should use Optional for type correctness.

♻️ Proposed fix
 `@tool`
 async def get_all_datasets(
     status_message: str = "",
-    project_ids: list[str] = [],
-    dataset_ids: list[str] = [],
-    config: RunnableConfig = None,
+    project_ids: Optional[list[str]] = None,
+    dataset_ids: Optional[list[str]] = None,
+    config: Optional[RunnableConfig] = None,
 ) -> str:
     """
     ...
     """
-    if not project_ids and not dataset_ids:
+    if not project_ids:
+        project_ids = []
+    if not dataset_ids:
+        dataset_ids = []
+
+    if not project_ids and not dataset_ids:
chat-server/app/workflow/graph/sql_planner_graph/generate_sql.py (1)

86-130: Condition check at line 90 could be simplified.

The check len(sql_queries) == 0 is more idiomatically written as not sql_queries in Python.

♻️ Minor style improvement
-        if non_sql_response and len(sql_queries) == 0:
+        if non_sql_response and not sql_queries:
chat-server/app/workflow/prompts/generate_sql_prompt.py (2)

20-134: Add prompt/parameter documentation for this prompt module.

There’s no docstring describing the template’s purpose, required kwargs, and expected input sections. Add brief module/function docs so the prompt contract is explicit and maintainable.

📝 Suggested docstrings
+"""Prompt templates for SQL generation and input formatting."""
+
 def generate_sql_prompt(**kwargs) -> list[BaseMessage] | ChatPromptTemplate:
+    """Build the SQL-generation prompt.
+
+    Args:
+        prompt_template: If True, return ChatPromptTemplate.
+        input: Preformatted input string from format_generate_sql_input().
+    """
     prompt_template = kwargs.get("prompt_template", False)
     input_content = kwargs.get("input", "")
@@
 def format_generate_sql_input(
@@
-) -> dict:
+) -> dict:
+    """Format user query + dataset context into a single input string."""
As per coding guidelines: Document prompt templates and parameters in prompt files.

124-134: Avoid unused kwargs in format_generate_sql_input.

**kwargs is unused (ruff ARG001). Either remove it or rename to **_kwargs to explicitly mark it as unused.

♻️ Minimal lint fix
 def format_generate_sql_input(
@@
-    **kwargs,
+    **_kwargs,
 ) -> dict:
chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py (1)

30-74: Reuse the shared CSV conversion helper to avoid drift.

convert_rows_to_csv already exists in app.utils.csv_utils. Duplicating logic here risks divergence over time; prefer importing the shared helper.

♻️ Example delegation
 def convert_rows_to_csv(rows: list[dict]) -> str:
-    """
-    Convert a list of dictionaries into a CSV-formatted string with special handling for certain cell values.
-    ...
-    """
-    if not rows:
-        return ""
-
-    output = io.StringIO()
-    writer = csv.writer(output)
-    ...
-    return output.getvalue()
+    """Delegate to the shared CSV formatter for consistent behavior."""
+    from app.utils.csv_utils import convert_rows_to_csv as _convert_rows_to_csv
+    return _convert_rows_to_csv(rows)

Comment thread chat-server/app/api/v1/routers/dataset_upload.py Outdated
Comment on lines 14 to 17
def exception(self, msg, *args, **kwargs):
self._logger.exception(msg, *args, stack_info=True, **kwargs)
self._logger.critical(msg, *args, stack_info=True, **kwargs)
if self.dev_mode:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Loss of exception traceback information.

The change from logger.exception() to logger.critical() with stack_info=True loses the actual exception traceback:

  • stack_info=True logs the call stack (where the log statement was invoked)
  • exc_info=True (used internally by logger.exception()) logs the exception traceback (the actual error and its origin)

This means when an exception occurs, you'll see where logger.exception() was called, but not the actual exception details and traceback that caused the error.

🐛 Proposed fix to preserve exception traceback
 def exception(self, msg, *args, **kwargs):
-    self._logger.critical(msg, *args, stack_info=True, **kwargs)
+    self._logger.critical(msg, *args, exc_info=True, **kwargs)
     if self.dev_mode:
         raise

If you need both the call stack and exception info, use exc_info=True, stack_info=True.

🤖 Prompt for AI Agents
In `@chat-server/app/core/log.py` around lines 14 - 17, The custom logger method
exception in chat-server/app/core/log.py currently calls
self._logger.critical(..., stack_info=True) which records the call stack but
loses the actual exception traceback; update the exception method to pass
exc_info=True (and keep stack_info=True if desired) to self._logger so the
logged output includes the exception traceback; modify the exception function
(named exception) to call self._logger with exc_info=True (and stack_info=True
if you want both) and preserve the existing behavior of raising when
self.dev_mode is true.

Comment on lines +28 to +37
row_count: Optional[int] = None
columns: Optional[list[ColumnDetails]] = None
size: Optional[int] = None
file_path: Optional[str] = None
created_at: Optional[str] = None
updated_at: Optional[str] = None
created_by: Optional[str] = None
updated_by: Optional[str] = None
custom_prompt: Optional[str] = None
sql_query: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find where DatasetDetails is defined and used
rg -n "class DatasetDetails|DatasetDetails" --type py chat-server/ | head -20

Repository: factly/gopie

Length of output: 769


🏁 Script executed:

# Find all accesses to the optional fields mentioned in the review
rg -n "\.columns\s*[^=]|\.row_count\s*[^=]|\.size\s*[^=]|\.file_path\s*[^=]" --type py chat-server/app/ -A2 -B2 | head -50

Repository: factly/gopie

Length of output: 5588


🏁 Script executed:

# Look for iteration or direct access patterns on these fields without null checks
rg -n "for .* in .*\.columns|\.columns\[|\w+\.row_count \+|\w+\.row_count -" --type py chat-server/app/

Repository: factly/gopie

Length of output: 1470


🏁 Script executed:

# Check the context around the unsafe column iteration in regenerate_fuzzy_values_prompt.py
sed -n '100,120p' chat-server/app/workflow/prompts/multi_dataset_prompts/regenerate_fuzzy_values_prompt.py

Repository: factly/gopie

Length of output: 1060


🏁 Script executed:

# Check the context in dataset_info.py around line 135
sed -n '130,140p' chat-server/app/services/gopie/dataset_info.py

Repository: factly/gopie

Length of output: 583


🏁 Script executed:

# Let's look at the full DatasetDetails model definition to confirm the field types
sed -n '20,50p' chat-server/app/models/data.py

Repository: factly/gopie

Length of output: 890


Fix unsafe access to optional columns field in DatasetDetails consumers.

The columns: Optional[list[ColumnDetails]] field can now be None, but multiple locations iterate over it without null checks:

  • chat-server/app/workflow/prompts/multi_dataset_prompts/regenerate_fuzzy_values_prompt.py:109, 113 - List comprehension and generator expression iterate over matching_schema.columns without checking for None
  • chat-server/app/services/gopie/dataset_info.py:135 - Direct for-loop over schema.columns without null guard

Add null checks before any iteration or access to columns. Similarly, verify that row_count usage in calculations and file_path access also handle None values.

🤖 Prompt for AI Agents
In `@chat-server/app/models/data.py` around lines 28 - 37, DatasetDetails.columns
is now Optional and can be None; update all consumers to guard against None
before iterating or accessing it — e.g., in regenerate_fuzzy_values_prompt.py
where matching_schema.columns is used in list/generator comprehensions and in
dataset_info.py where schema.columns is looped, replace direct iterations with a
null-safe form (check matching_schema.columns is not None or use a fallback like
(matching_schema.columns or [])). Also add null checks before using row_count in
calculations and before using file_path for file operations so you don't assume
they are present (guard expressions around row_count and file_path where used).
Ensure you update usages referencing DatasetDetails, matching_schema.columns,
schema.columns, row_count, and file_path accordingly.

Comment on lines 34 to 43
try:
async with http_session.get(url, headers=headers) as response:
data = await response.json()
return DatasetDetails(**data)
async with await client.get(path) as response:
response_data = await response.json()
data = response_data["data"] if is_view else response_data
if is_view and data.get("sql_query"):
data["description"] = f"{data['description']}\n\nSQL Query:\n{data['sql_query']}"
return DatasetDetails(**data)
except Exception as e:
logger.exception(f"Error getting dataset info: {e!s}")
raise e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Python raise vs raise e exception traceback difference

💡 Result:

Inside an except … as e: block:

  • raise (bare raise) re-raises the currently handled (“active”) exception without changing its traceback (so you keep the original stack frames showing where it first failed). This is the intended re-raise form. [1]

  • raise e raises the exception object e via the “raise an exception object” form of raise (i.e., it’s treated like raising an explicit exception expression). Practically, this changes what traceback you see by making the re-throw point (raise e) the reported raise location (often obscuring the original failure site compared to bare raise). [1][2]

If you’re converting/wrapping exceptions:

  • Prefer raise NewError(...) from e to keep a clear chain (__cause__) from the original to the new exception. [1][3]
  • Use raise NewError(...) from None to suppress the original context in the default traceback output. [3]

Sources:
[1] Python Language Reference – raise statement (re-raise vs raising an explicit expression) https://docs.python.org/3/reference/simple_stmts.html#the-raise-statement
[2] Discussion/examples of traceback differences between raise and raise e https://stackoverflow.com/questions/36153805/difference-between-raise-and-raise-e
[3] PEP 409 / exception context suppression (raise ... from None) https://peps.python.org/pep-0409/


🏁 Script executed:

# Check if the file exists and examine the actual code at lines 34-43 and 62-68
if [ -f "chat-server/app/services/gopie/dataset_info.py" ]; then
  echo "=== Lines 34-43 ==="
  sed -n '34,43p' chat-server/app/services/gopie/dataset_info.py
  echo ""
  echo "=== Lines 62-68 ==="
  sed -n '62,68p' chat-server/app/services/gopie/dataset_info.py
else
  echo "File not found at expected path"
  # Try to find it
  find . -name "dataset_info.py" -type f 2>/dev/null
fi

Repository: factly/gopie

Length of output: 810


Use bare raise to preserve the original exception traceback.

Inside the except blocks at lines 34–43 and 62–68, raise e changes the reported raise location and obscures the original traceback. Since logger.exception() already captures the full exception details, use bare raise instead to maintain the complete call stack for debugging and error diagnostics.

🔧 Suggested fix
     except Exception as e:
         logger.exception(f"Error getting dataset info: {e!s}")
-        raise e
+        raise

Also applies to: 62-68

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 43-43: Use raise without specifying exception name

Remove exception name

(TRY201)

🤖 Prompt for AI Agents
In `@chat-server/app/services/gopie/dataset_info.py` around lines 34 - 43, Replace
the re-raising pattern in the except blocks that currently use "raise e" with a
bare "raise" so the original traceback is preserved; specifically update the
except Exception as e blocks in chat-server/app/services/gopie/dataset_info.py
(the block that logs via logger.exception("Error getting dataset info...") and
the other similar block later) to log with logger.exception(...) and then
perform a bare raise rather than "raise e".

Comment on lines +86 to 88
dense_vector = embeddings.embed_query(user_query)
sparse_vector = generate_sparse_vector(user_query)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "schema_search.py" -type f 2>/dev/null | head -20

Repository: factly/gopie

Length of output: 106


🏁 Script executed:

# Check for requirements files to determine langchain version
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | head -20

Repository: factly/gopie

Length of output: 153


🏁 Script executed:

# Also check the actual file content around lines 86-88
fd schema_search.py --type f | head -1 | xargs cat -n 2>/dev/null | sed -n '75,100p'

Repository: factly/gopie

Length of output: 1155


🏁 Script executed:

# Get the function signature and broader context
cat -n ./chat-server/app/services/qdrant/schema_search.py | sed -n '60,90p'

Repository: factly/gopie

Length of output: 1311


🏁 Script executed:

# Check the pyproject.toml for langchain/langchain-openai version
cat ./chat-server/pyproject.toml | head -100

Repository: factly/gopie

Length of output: 1748


🏁 Script executed:

# Search for aembed_query usage in the codebase to see if it's already used elsewhere
rg -i "aembed_query" --type py

Repository: factly/gopie

Length of output: 650


🌐 Web query:

langchain-openai OpenAIEmbeddings aembed_query async method version 0.3.9

💡 Result:

In langchain-openai==0.3.9, OpenAIEmbeddings includes an async aembed_query method.

  • Signature: async aembed_query(text: str, **kwargs) -> list[float] (returns a single embedding vector for one input string) [1].
  • Typical use:
    from langchain_openai import OpenAIEmbeddings
    
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    vec = await embeddings.aembed_query("The meaning of life is 42")
    [1]

Implementation-wise, aembed_query is a thin wrapper that calls aembed_documents([text]) and returns the first result [2].

Sources: LangChain OpenAIEmbeddings reference docs [1], LangChain source module showing aembed_query implementation [2].


🏁 Script executed:

# Check if there are other async embedding calls in the codebase as reference
rg "await.*embed" --type py -A 2 -B 2

Repository: factly/gopie

Length of output: 3850


Use the async aembed_query method to avoid blocking the event loop. The search_schemas function is async (line 68), but embed_query is a synchronous call that will block at line 86. Use await embeddings.aembed_query(user_query) instead. The method is available in langchain-openai ≥ 0.3.9 and is already used elsewhere in the codebase (e.g., health.py).

Suggested fix
-        dense_vector = embeddings.embed_query(user_query)
+        dense_vector = await embeddings.aembed_query(user_query)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dense_vector = embeddings.embed_query(user_query)
sparse_vector = generate_sparse_vector(user_query)
dense_vector = await embeddings.aembed_query(user_query)
sparse_vector = generate_sparse_vector(user_query)
🤖 Prompt for AI Agents
In `@chat-server/app/services/qdrant/schema_search.py` around lines 86 - 88, The
async function search_schemas is calling the blocking embeddings.embed_query
which will block the event loop; change the call to the async variant await
embeddings.aembed_query(user_query) and assign its result back to dense_vector,
leaving generate_sparse_vector(user_query) as-is; ensure the embeddings object
supports aembed_query (as used elsewhere, e.g., health.py) and update any nearby
error handling or type expectations for dense_vector if needed.

Comment on lines +5 to +6
import httpx
from bs4 import BeautifulSoup
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify httpx is declared in dependency manifests.
set -euo pipefail

deps=$(fd -a -t f 'pyproject.toml|poetry.lock|Pipfile|setup.cfg|requirements.*|constraints.*' || true)
if [ -z "$deps" ]; then
  echo "No dependency manifests found."
  exit 0
fi

echo "=== Found dependency manifest files ==="
echo "$deps"
echo
echo "=== Checking for httpx declaration ==="
rg -n 'httpx' $deps || echo "httpx not found in dependency manifests"

Repository: factly/gopie

Length of output: 324


🏁 Script executed:

#!/bin/bash
# Check for BeautifulSoup/beautifulsoup4 in dependency manifests
rg -n 'beautifulsoup|bs4' /home/jailuser/git/chat-server/pyproject.toml /home/jailuser/git/pulumi/requirements.txt /home/jailuser/git/scripts/starter-project/requirements.txt 2>/dev/null || echo "beautifulsoup4/bs4 not found in dependency manifests"

Repository: factly/gopie

Length of output: 132


Add httpx to chat-server/pyproject.toml dependencies.

The script imports httpx at line 5 but it is not declared in the project's dependency manifest. This will cause an ImportError in CI and production environments. Add httpx to the dependencies section of pyproject.toml alongside the existing beautifulsoup4 declaration.

🤖 Prompt for AI Agents
In `@chat-server/scripts/scrape_duckdb_docs.py` around lines 5 - 6, The project
imports httpx in scrape_duckdb_docs.py but pyproject.toml lacks that dependency;
update the pyproject.toml dependencies section to include "httpx" (alongside the
existing "beautifulsoup4") so CI and production environments can install
it—ensure the exact package name "httpx" is added under the same dependencies
list used for other runtime packages.

Comment on lines +16 to +23
async def _fetch_page(self, client: httpx.AsyncClient, url: str) -> str | None:
try:
response = await client.get(url, timeout=30.0)
response.raise_for_status()
return response.text
except Exception as e:
print(f"Failed to fetch {url}: {e}")
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and examine the relevant section
find . -name "scrape_duckdb_docs.py" -type f

Repository: factly/gopie

Length of output: 99


🏁 Script executed:

# Look for logging configuration and patterns in the chat-server directory
find chat-server -type f -name "*.py" | head -20

Repository: factly/gopie

Length of output: 792


🏁 Script executed:

# Search for logging imports and usage patterns in the codebase
rg "import logging" chat-server/ -t py -c

Repository: factly/gopie

Length of output: 125


🏁 Script executed:

# Search for JSON structured logging patterns
rg "json.*log|structlog|pydantic.*log" chat-server/ -t py -A 2

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Examine the logging configuration
cat -n chat-server/app/core/log.py

Repository: factly/gopie

Length of output: 1941


🏁 Script executed:

# Examine the actual file content
cat -n chat-server/scripts/scrape_duckdb_docs.py

Repository: factly/gopie

Length of output: 6154


🏁 Script executed:

# Look for other examples of error handling in the codebase
rg "httpx\." chat-server/ -t py -B 2 -A 5 | head -60

Repository: factly/gopie

Length of output: 2021


🏁 Script executed:

# Check for other error handling patterns
rg "except.*Error|except.*Exception" chat-server/ -t py -B 2 -A 3 | head -80

Repository: factly/gopie

Length of output: 6095


🏁 Script executed:

# Check if scripts are expected to follow the same logging standards
rg "from app.core.log|CustomLogger" chat-server/scripts/ -t py

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Look for other scripts to see their logging patterns
ls -la chat-server/scripts/

Repository: factly/gopie

Length of output: 308


🏁 Script executed:

# Verify if there's JSON logging anywhere in the codebase
rg "json.*format|JSONFormatter|structlog" chat-server/ -t py

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Verify httpx exception types exist
python3 << 'EOF'
import httpx
print("HTTPStatusError:", hasattr(httpx, 'HTTPStatusError'))
print("RequestError:", hasattr(httpx, 'RequestError'))
print("HTTPError:", hasattr(httpx, 'HTTPError'))
EOF

Repository: factly/gopie

Length of output: 112


Avoid broad Exception + print; use proper error handling with specific httpx exceptions and logging.

Catching all exceptions hides real bugs and masks errors. Handle httpx.HTTPStatusError (status codes) and httpx.RequestError (transport errors) explicitly, and emit logs instead of print statements. This aligns with the error handling and logging requirements in coding guidelines.

🔧 Proposed fix
 import asyncio
 import json
+import logging
 from pathlib import Path

 import httpx
 from bs4 import BeautifulSoup

+logger = logging.getLogger(__name__)
+
 class DuckDBDocsScraper:
     async def _fetch_page(self, client: httpx.AsyncClient, url: str) -> str | None:
         try:
             response = await client.get(url, timeout=30.0)
             response.raise_for_status()
             return response.text
-        except Exception as e:
-            print(f"Failed to fetch {url}: {e}")
+        except httpx.HTTPStatusError as e:
+            logger.warning(
+                f"Failed to fetch {url}: HTTP {e.response.status_code}",
+                extra={"url": url, "status_code": e.response.status_code},
+            )
+            return None
+        except httpx.RequestError as e:
+            logger.error(
+                f"Failed to fetch {url}: {e}",
+                extra={"url": url, "error": str(e)},
+            )
             return None
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 20-20: Consider moving this statement to an else block

(TRY300)


[warning] 21-21: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@chat-server/scripts/scrape_duckdb_docs.py` around lines 16 - 23, The
_fetch_page function currently catches all Exceptions and uses print; change it
to catch httpx.RequestError for transport-level failures around await
client.get(url, ...) and httpx.HTTPStatusError for non-2xx responses raised by
response.raise_for_status(), log errors via a module logger (e.g., logger =
logging.getLogger(__name__)) instead of print and include the URL and exception
details in the log (use logger.error or logger.exception), and still return
response.text on success or None on handled errors; keep the function signature
and return semantics the same.

Comment on lines +401 to +420
def _normalize_message(self, message: Any) -> tuple[str, str]:
"""Extract role and content from a message dict; handle missing/alternate keys."""
if isinstance(message, dict):
role = message.get("role") or message.get("type") or "user"
if role == "human":
role = "user"
content = (
message.get("content")
or message.get("text")
or message.get("message")
or ""
)
content = str(content) if content is not None else ""
else:
role = "user"
content = str(message) if message is not None else ""
content_escaped = (content or "").replace("\\", "\\\\").replace(
'"', '\\"'
).replace("\n", "\\n").replace("\r", "")
return role, content_escaped
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

LangChain message role type values ai human system tool OpenAI mapping

💡 Result:

LangChain message role/type values (core)

LangChain’s main message classes (Python/JS) correspond to these conceptual roles: system, user, assistant, tool; plus a legacy function role for OpenAI’s older function-calling API. [1], [2], [3]

  • SystemMessage → role system [2], [3]
  • HumanMessage → role user [2], [3]
  • AIMessage (and streaming AIMessageChunk) → role assistant [2]
  • ToolMessage → role tool [2], [3]
  • FunctionMessage (legacy) → role function (legacy) [2], [3]

OpenAI role mapping (what LangChain converts to)

When LangChain converts messages to OpenAI Chat Completions message params, it maps to OpenAI-style roles: user, assistant, system, tool, function, developer. [4], [5]

Typical mapping:

  • HumanMessage{"role":"user", ...} [2], [3]
  • AIMessage{"role":"assistant", ...} (and tool calls are carried on the assistant message) [6]
  • SystemMessage{"role":"system", ...} [6]
  • ToolMessage{"role":"tool", ...} (includes tool_call_id; may include name) [3], [6]
  • FunctionMessage (legacy) → {"role":"function", ...} [2], [3]

Note on developer: LangChain’s OpenAI adapter mentions model-specific mappings such as converting system → developer for certain “reasoning models,” and generally supports the developer role in its conversion layer. [4], [5]

Built-in converters

  • Python: convert_to_openai_messages(...) shows the concrete OpenAI dicts produced (including assistant tool_calls and tool messages). [6]
  • JS (@langchain/openai): convertMessagesToCompletionsMessageParams / convertStandardContentMessageToCompletionsMessage document the role mapping behavior (including developer). [4], [5]

Sources: LangChain concepts/docs and reference pages for message roles/types and OpenAI conversion utilities. [1]-[6]


🏁 Script executed:

# First, let's examine the actual file to understand the context
cat -n chat-server/tests/e2e/utils/generate_app_cases.py | sed -n '380,430p'

Repository: factly/gopie

Length of output: 2662


🏁 Script executed:

# Search for how messages are used in this file and if role normalization matters
rg -n "role" chat-server/tests/e2e/utils/generate_app_cases.py | head -20

Repository: factly/gopie

Length of output: 838


🏁 Script executed:

# Check if there are any tests or usages that show what roles are expected
rg -n "_normalize_message" chat-server/tests/e2e/utils/generate_app_cases.py -A 5 -B 5

Repository: factly/gopie

Length of output: 1575


🏁 Script executed:

# Search for OpenAI API schema or message format requirements in the codebase
rg -n "OpenAI\|openai" chat-server/tests/e2e/utils/generate_app_cases.py

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Search for where messages are created or come from
rg -n "messages\s*=" chat-server/tests/e2e/utils/generate_app_cases.py | head -20

Repository: factly/gopie

Length of output: 38


🏁 Script executed:

# Look for any place where "ai" role is created
rg -n '"ai"|\'ai\'' chat-server/tests/e2e/utils/generate_app_cases.py

Repository: factly/gopie

Length of output: 124


🏁 Script executed:

# Check the TestCasesList and test case structure
rg -n "class.*TestCase|messages.*:" chat-server/tests/e2e/utils/generate_app_cases.py -B 2 -A 5

Repository: factly/gopie

Length of output: 3974


🏁 Script executed:

# Search for LangChain imports and usage
rg -n "langchain|LangChain" chat-server/tests/e2e/utils/generate_app_cases.py -i

Repository: factly/gopie

Length of output: 109


Add role escaping to prevent code generation injection.

The current code escapes message content but leaves role values unescaped. Since role is directly inserted into generated Python code templates (line 448, 481), any special characters (quotes, backslashes) in a role value could break the output. Add escaping for role similar to content:

         if isinstance(message, dict):
             role = message.get("role") or message.get("type") or "user"
             if role == "human":
                 role = "user"
             content = (
                 message.get("content")
                 or message.get("text")
                 or message.get("message")
                 or ""
             )
             content = str(content) if content is not None else ""
         else:
             role = "user"
             content = str(message) if message is not None else ""
+        role_escaped = str(role).replace("\\", "\\\\").replace('"', '\\"')
         content_escaped = (content or "").replace("\\", "\\\\").replace(
             '"', '\\"'
         ).replace("\n", "\\n").replace("\r", "")
-        return role, content_escaped
+        return role_escaped, content_escaped

Also consider normalizing "ai""assistant" if test case messages come from LangChain sources (consistent with existing "human""user" normalization), though this may be defensive if the current data only produces OpenAI-standard roles.

🤖 Prompt for AI Agents
In `@chat-server/tests/e2e/utils/generate_app_cases.py` around lines 401 - 420,
The _normalize_message function currently escapes content but not role, which
can allow special chars in role to break generated Python code; update
_normalize_message to escape role the same way as content (replace backslash
with \\\\, double-quote with \\", newline with \\n, and remove \r) and return
the escaped role, and also normalize role values by mapping "human"→"user" and
"ai"→"assistant" (in addition to the existing "type" fallback) so generated
templates that inject role values use safe, normalized strings.

Comment on lines +31 to +33
class TestGenerateSqlPromptFormatting:
@pytest.fixture
def mock_dataset_schema(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find chat-server -name "pytest.ini" -o -name "pyproject.toml" -o -name "setup.cfg" | head -10

Repository: factly/gopie

Length of output: 82


🏁 Script executed:

ls -la chat-server/tests/unit/ 2>/dev/null | head -20

Repository: factly/gopie

Length of output: 1337


🏁 Script executed:

head -50 chat-server/tests/unit/test_generate_sql_prompt_formatting.py 2>/dev/null

Repository: factly/gopie

Length of output: 1594


🏁 Script executed:

grep -r "pytest.mark.unit" chat-server/tests/unit/ 2>/dev/null | head -5

Repository: factly/gopie

Length of output: 443


🏁 Script executed:

cat chat-server/pyproject.toml

Repository: factly/gopie

Length of output: 1748


🏁 Script executed:

cat -n chat-server/tests/unit/test_generate_sql_prompt_formatting.py

Repository: factly/gopie

Length of output: 7576


Add the unit test marker to align with project conventions.

Unit tests under chat-server/tests/unit require the pytest.mark.unit marker for reliable filtering. This file is currently missing it, while other tests in the same directory follow this requirement.

Add pytestmark = pytest.mark.unit at the module level (after the imports and docstring, as established in other test files in this directory):

🔧 Suggested change
@@
"""
Unit tests for generate_sql_prompt formatting logic, specifically verifying column value inclusion.
"""
from typing import TypedDict
from unittest.mock import MagicMock

import pytest

from app.models.data import ColumnValueMatching
from app.models.schema import DatasetSchema
from app.workflow.graph.sql_planner_graph.types import DatasetsInfo
from app.workflow.prompts.generate_sql_prompt import format_generate_sql_input

+pytestmark = pytest.mark.unit
🤖 Prompt for AI Agents
In `@chat-server/tests/unit/test_generate_sql_prompt_formatting.py` around lines
31 - 33, Add the missing module-level pytest marker by declaring pytestmark =
pytest.mark.unit near the top of the test module (after imports and any module
docstring) so this file follows the same convention as other unit tests; modify
the TestGenerateSqlPromptFormatting test module to include that pytestmark
declaration (referencing the module that contains class
TestGenerateSqlPromptFormatting).

Comment on lines +48 to +51
# Verify query_points was called with correct structure
call_args = mock_client.query_points.call_args
assert call_args.kwargs["filter"] is None
assert len(call_args.kwargs["prefetch"]) == 2 # Dense and sparse prefetch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential KeyError when accessing kwargs["filter"].

Based on the search_schemas implementation in the relevant code snippet, the filter parameter is passed inside each Prefetch object, not at the top level of query_points. If filter is not explicitly passed as a keyword argument, accessing call_args.kwargs["filter"] will raise a KeyError instead of returning None.

Other tests in this file (lines 77, 107, 137, 165) use the safer pattern: call_args.kwargs.get("filter").

Proposed fix
         # Verify query_points was called with correct structure
         call_args = mock_client.query_points.call_args
-        assert call_args.kwargs["filter"] is None
+        assert call_args.kwargs.get("filter") is None
         assert len(call_args.kwargs["prefetch"]) == 2  # Dense and sparse prefetch
🤖 Prompt for AI Agents
In `@chat-server/tests/unit/test_schema_search.py` around lines 48 - 51, The test
accesses query call kwargs with call_args.kwargs["filter"] which can raise
KeyError because search_schemas places filter inside each Prefetch rather than
as a top-level kwarg; update the assertion to use call_args.kwargs.get("filter")
(matching other tests) and keep the existing assertion on prefetch length
(mock_client.query_points, search_schemas, Prefetch are the symbols to
update/inspect).

…onous source-to-destination migration with hybrid vector support.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 4, 2026

Claude finished @paul-tharun's task —— View job


🔍 PR Review: Feat/e2e datasets metadata

Review Progress

  • Read repository setup and guidelines
  • Review architectural changes
  • Analyze security implementations
  • Check error handling patterns
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture Review

This PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support.

Positive Changes

1. Clean NL-to-SQL Graph Architecture

  • Supervisor routing: Clean logic in nl_to_sql_graph/node/supervisor.py routing single vs multi-dataset queries
  • Separation of concerns: Clear split between semantic search, SQL agent, and planning
  • LangGraph patterns: Proper use of StateGraph, Command, and typed state management

2. Multi-Tenant Organization Support

  • Proper isolation: Organization ID properly threaded through GopieClient and Qdrant filters
  • Security boundaries: Clear separation using must conditions for org_id in schema search
  • Header management: Consistent X-Organization-id header handling

3. SQL Injection Prevention

  • Query builders: Safe abstraction through utils/olap/ query builders
  • Parameterized approach: SQL generation through builders rather than string interpolation
  • Validation patterns: Good use of escaped values in column matching

🚨 Critical Security Issues

1. Missing Organization ID in Semantic Search

Location: chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py:20

The semantic search function calls search_schemas without passing org_id, allowing cross-tenant data leakage:

# VULNERABLE: No org_id passed
results = await search_schemas(
    user_query=user_query,
    embeddings=embeddings_model,
    dataset_ids=dataset_ids,
    project_ids=project_ids,
    # Missing: org_id=org_id
)

Fix Required: Extract org_id from config and pass it to search_schemas.

2. Division by Zero Risk

Location: chat-server/app/utils/graph_utils/table_utils.py:43

# VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100

Fix Required: Add zero-check before division.

3. Direct SQL Interpolation

Location: chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py:121

# POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"

While dataset_name comes from trusted Qdrant storage, consider defensive identifier quoting.


⚠️ Error Handling Issues

1. Inconsistent Error Guards

  • execute_sql.py:34: Checks isinstance(last_message, ErrorMessage) but last_message is a list
  • sql_agent.py:25: Accesses query_result.single_dataset_query_result before null check

2. Missing Resource Cleanup

  • No explicit cleanup logic for failed operations in new graph workflows
  • Broad exception handling masks specific failure modes

3. State Validation Issues

  • prepare_data: Returns without dataset_info but graph always routes to sql_agent
  • routing.py: Uses state.get("query_result", []) which can be list causing AttributeError

🚀 Performance Considerations

Good Optimizations

  • Smart sampling: Table size estimation prevents performance issues on large datasets
  • Vector search optimization: Hybrid search with proper filtering
  • Connection pooling: Singleton pattern for HTTP client connections
  • Async operations: Proper async/await throughout

⚠️ Potential Concerns

  • Multiple vector searches: DuckDB docs search adds latency to SQL planning
  • Memory usage: Complex nested state objects in high-concurrency scenarios
  • Missing timeouts: GopieClient lacks explicit timeout configuration

🧪 Test Coverage Assessment

Good Coverage

  • Existing unit tests cover GopieClient org_id handling
  • DSPy integration provides advanced evaluation
  • E2E visualization test infrastructure

Missing Tests

  • No tests for new /fetch-sql endpoint
  • No integration tests for NL-to-SQL graph workflow
  • Missing edge case tests for new graph failure scenarios
  • No tests for org_id security boundaries

📋 Code Quality Issues

1. Error Message Inconsistencies

# Vague error without context
raise ValueError("query_result is not properly initialized")

2. Magic Numbers

# Should be configurable
SELECT * FROM {table} LIMIT 200000

3. Missing Function Documentation

Several new functions lack comprehensive docstrings explaining return contracts.


🔧 Specific Recommendations

High Priority Fixes

  1. Add org_id to semantic search to prevent data leakage
  2. Add zero-check in calculate_sampling_percentage
  3. Fix error guard in execute_sql.py (list vs ErrorMessage)
  4. Add validation in sql_agent.py before accessing query_result

Medium Priority

  1. Add timeouts to GopieClient requests
  2. Improve error messages with specific context
  3. Add integration tests for new workflows
  4. Configure magic numbers via settings

Code Quality

  1. Add comprehensive docstrings to new functions
  2. Implement consistent retry logic across agents
  3. Add resource cleanup for failed operations

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⭐ Excellent modular design
Security ⭐⭐⭐⚪⚪ Good foundations, critical gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code, room for improvement

Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.


Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@chat-server/tests/scripts/reset_and_reindex_collection.py`:
- Around line 1-33: The imports at the top of reset_and_reindex_collection.py
include unnecessary "# noqa: E402" markers (on AsyncQdrantClient, models,
settings, logger, QdrantSetup, generate_sparse_vector); remove these trailing "#
noqa: E402" annotations from those import lines so Ruff can report real
import-order issues in future, leaving the import order and the sys.path
insertion unchanged and without adding other noqa comments.
- Around line 78-86: In run_generate_sparse_vector, replace the current re-raise
"raise e" with a bare "raise" to preserve the original traceback when an
exception is caught; keep the logger.error call as-is so the error is logged but
ensure you re-raise with bare raise in the except block of
generate_sparse_vector wrapper to retain original exception context.

Comment thread chat-server/tests/scripts/reset_and_reindex_collection.py
Comment thread chat-server/tests/scripts/reset_and_reindex_collection.py
@claude
Copy link
Copy Markdown

claude Bot commented Feb 5, 2026

Claude finished @paul-tharun's task —— View job


PR Review: Feat/e2e datasets metadata

I've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings:

Review Checklist

  • Read repository setup and guidelines (CLAUDE.md, configs)
  • Review architectural changes (new graphs, workflows)
  • Analyze API security and validation
  • Check error handling and edge cases
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture Review

This PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support.

Positive Changes

1. Clean NL-to-SQL Graph Architecture

  • Supervisor routing: Clean logic in nl_to_sql_graph/node/supervisor.py routing single vs multi-dataset queries
  • Separation of concerns: Clear split between semantic search, SQL agent, and planning
  • LangGraph patterns: Proper use of StateGraph, Command, and typed state management

2. Multi-Tenant Organization Support

  • Proper isolation: Organization ID properly threaded through GopieClient and Qdrant filters
  • Security boundaries: Clear separation using must conditions for org_id in schema search
  • Header management: Consistent X-Organization-id header handling

3. SQL Injection Prevention

  • Query builders: Safe abstraction through utils/olap/ query builders
  • Parameterized approach: SQL generation through builders rather than string interpolation
  • Validation patterns: Good use of escaped values in column matching

🚨 Critical Security Issues

1. Missing Organization ID in Semantic Search

Location: chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py:20

The semantic search function calls search_schemas without passing org_id, allowing cross-tenant data leakage:

# VULNERABLE: No org_id passed
results = await search_schemas(
    user_query=user_query,
    embeddings=embeddings_model,
    dataset_ids=dataset_ids,
    project_ids=project_ids,
    # Missing: org_id=org_id
)

Fix Required: Extract org_id from config and pass it to search_schemas.

2. Division by Zero Risk

Location: chat-server/app/utils/graph_utils/table_utils.py:43

# VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100

Fix Required: Add zero-check before division.

3. Direct SQL Interpolation

Location: chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py:121

# POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"

While dataset_name comes from trusted Qdrant storage, consider defensive identifier quoting.


⚠️ Error Handling Issues

1. Inconsistent Error Guards

  • execute_sql.py:34: Checks isinstance(last_message, ErrorMessage) but last_message is a list
  • sql_agent.py:25: Accesses query_result.single_dataset_query_result before null check

2. Missing Resource Cleanup

  • No explicit cleanup logic for failed operations in new graph workflows
  • Broad exception handling masks specific failure modes

3. State Validation Issues

  • prepare_data: Returns without dataset_info but graph always routes to sql_agent
  • routing.py: Uses state.get("query_result", []) which can be list causing AttributeError

🚀 Performance Considerations

Good Optimizations

  • Smart sampling: Table size estimation prevents performance issues on large datasets
  • Vector search optimization: Hybrid search with proper filtering
  • Connection pooling: Singleton pattern for HTTP client connections
  • Async operations: Proper async/await throughout

⚠️ Potential Concerns

  • Multiple vector searches: DuckDB docs search adds latency to SQL planning
  • Memory usage: Complex nested state objects in high-concurrency scenarios
  • Missing timeouts: GopieClient lacks explicit timeout configuration

🧪 Test Coverage Assessment

Good Coverage

  • Existing unit tests cover GopieClient org_id handling
  • DSPy integration provides advanced evaluation
  • E2E visualization test infrastructure

Missing Tests

  • No tests for new /fetch-sql endpoint
  • No integration tests for NL-to-SQL graph workflow
  • Missing edge case tests for new graph failure scenarios
  • No tests for org_id security boundaries

📋 Code Quality Issues

1. Error Message Inconsistencies

# Vague error without context
raise ValueError("query_result is not properly initialized")

2. Magic Numbers

# Should be configurable
SELECT * FROM {table} LIMIT 200000

3. Missing Function Documentation

Several new functions lack comprehensive docstrings explaining return contracts.


🔧 Specific Recommendations

High Priority Fixes

  1. Add org_id to semantic search to prevent data leakage
  2. Add zero-check in calculate_sampling_percentage
  3. Fix error guard in execute_sql.py (list vs ErrorMessage)
  4. Add validation in sql_agent.py before accessing query_result

Medium Priority

  1. Add timeouts to GopieClient requests
  2. Improve error messages with specific context
  3. Add integration tests for new workflows
  4. Configure magic numbers via settings

Code Quality

  1. Add comprehensive docstrings to new functions
  2. Implement consistent retry logic across agents
  3. Add resource cleanup for failed operations

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⭐ Excellent modular design
Security ⭐⭐⭐⚪⚪ Good foundations, critical gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code, room for improvement

Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.


…ation and synchronously initialize the sparse vector model.
@claude
Copy link
Copy Markdown

claude Bot commented Feb 5, 2026

Claude finished @paul-tharun's task —— View job


🔍 PR Review: Feat/e2e datasets metadata

I've conducted a comprehensive review of this large PR with 100 files changed. Here are my findings:

Review Progress

  • Read repository setup and guidelines (CLAUDE.md, configs)
  • Review architectural changes (new graphs, workflows)
  • Analyze API security and validation
  • Check error handling and edge cases
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture & Design

Positive Changes

  • Well-structured agent architecture: New nl_to_sql_graph and sql_planner_graph follow LangGraph patterns correctly
  • Separation of concerns: Clear separation between SQL planning, execution, and validation workflows
  • Type safety: Good use of Pydantic models for API validation (FetchSqlRequest, FetchSqlResponse)
  • Unified client pattern: New GopieClient centralizes API communication with proper header management

⚠️ Architectural Concerns

  • Commented code in workflows: chat-server/app/workflow/graph/sql_planner_graph/graph.py:19-30 has extensive commented conditional edges that should either be implemented or removed
  • Complex state management: The addition of multiple new graph types increases complexity without clear documentation of when each should be used

🔒 Security Analysis

Good Security Practices

  • Input validation: Proper Pydantic validation in new API endpoints (fetch_sql.py:21-22)
  • SQL injection prevention: Using parameterized queries and proper escaping (column_value_matching.py:40)
  • Header handling: Proper organization ID header management in GopieClient

🚨 Security Issues

  1. Insufficient input sanitization in column_value_matching.py:40:

    WHERE LOWER(CAST({column_name} AS VARCHAR)) LIKE '%' || LOWER('{escaped_value}') || '%'

    The escaped_value variable name suggests sanitization, but no actual escaping is shown in this snippet.

  2. Generic exception handling in fetch_sql.py:36-42:

    except Exception as e:
        logger.error(f"Error in fetch_sql: {e}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail="Failed to generate SQL queries, please try again.",
        ) from e

    This could mask important security errors and provide too generic error messages.


🚀 Performance Analysis

Performance Improvements

  • Smart sampling: New table size estimation and sampling logic in table_utils.py prevents performance issues on large datasets
  • Connection pooling: Using singleton pattern for HTTP client connections (SingletonAiohttp)
  • Async operations: Proper use of async/await throughout new components

⚠️ Performance Concerns

  • Multiple vector searches: New DuckDB documentation search (duckdb_docs/search.py) adds another vector search operation that could impact latency
  • Memory usage: Complex state objects with nested query results could consume significant memory in high-concurrency scenarios

🧪 Testing & Quality

Testing Infrastructure

  • Good test structure: Maintained separation between unit and e2e tests
  • DSPy integration: Advanced evaluation system with optimized evaluators

Testing Gaps

  1. No tests for new API endpoint: fetch_sql.py lacks dedicated test coverage
  2. Missing edge case tests: New graph workflows need tests for failure scenarios
  3. Integration tests: No tests covering the interaction between new nl_to_sql_graph and existing workflows

🐛 Bug & Error Handling Issues

  1. State validation inconsistency in sql_agent.py:27-28:

    if not query_result or not single_dataset_result:
        raise ValueError("query_result is not properly initialized")

    This doesn't provide context about what went wrong or how to fix it.

  2. Resource cleanup: Missing cleanup logic for failed operations in new graph workflows

  3. Retry logic: Inconsistent retry mechanisms across different agents - some have configurable retry counts, others don't


📋 Code Quality

Good Practices

  • Type hints: Comprehensive type annotations throughout new code
  • Documentation: Good docstrings in new functions like get_duckdb_docs()
  • Code organization: Logical file structure following established patterns

⚠️ Areas for Improvement

  • Hardcoded values: Magic numbers like LIMIT 200000 in SQL queries should be configurable
  • Long functions: Some functions exceed 50 lines and could benefit from decomposition
  • Duplicate logic: Similar error handling patterns repeated across multiple files

🔧 Specific Recommendations

  1. Security:

    # In column_value_matching.py - add proper escaping
    def escape_sql_value(value: str) -> str:
        return value.replace("'", "''").replace("%", "\\%").replace("_", "\\_")
  2. Error handling:

    # In fetch_sql.py - more specific exception handling
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except ConnectionError as e:
        raise HTTPException(status_code=503, detail="Service unavailable")
  3. Configuration:

    # Add to config.py
    DEFAULT_QUERY_LIMIT: int = 200000
    MAX_SAMPLING_PERCENTAGE: float = 5.0
  4. Testing:

    • Add unit tests for fetch_sql_router
    • Add integration tests for new graph workflows
    • Add performance tests for large dataset scenarios

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⚪ Well-designed but complex
Security ⭐⭐⭐⚪⚪ Good practices with some gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code with room for improvement

Overall: This is a substantial improvement to the system with well-thought-out architectural changes. The main concerns are security gaps, insufficient testing of new features, and some complexity that could be better documented. I recommend addressing the security issues before merging.


@claude
Copy link
Copy Markdown

claude Bot commented Feb 5, 2026

Claude finished @paul-tharun's task —— View job


PR Review: Feat/e2e datasets metadata

Review Progress

  • Read repository setup and guidelines
  • Review architectural changes
  • Analyze security implementations
  • Check error handling patterns
  • Evaluate performance implications
  • Assess test coverage
  • Provide comprehensive feedback

🏗️ Architecture Review

This PR introduces significant architectural improvements with a well-structured approach to NL-to-SQL processing and multi-tenant organization support.

Positive Changes

1. Clean NL-to-SQL Graph Architecture

  • Supervisor routing: Clean logic in nl_to_sql_graph/node/supervisor.py routing single vs multi-dataset queries
  • Separation of concerns: Clear split between semantic search, SQL agent, and planning
  • LangGraph patterns: Proper use of StateGraph, Command, and typed state management

2. Multi-Tenant Organization Support

  • Proper isolation: Organization ID properly threaded through GopieClient and Qdrant filters
  • Security boundaries: Clear separation using must conditions for org_id in schema search
  • Header management: Consistent X-Organization-id header handling

3. SQL Injection Prevention

  • Query builders: Safe abstraction through utils/olap/ query builders
  • Parameterized approach: SQL generation through builders rather than string interpolation
  • Validation patterns: Good use of escaped values in column matching

🚨 Critical Security Issues

1. Missing Organization ID in Semantic Search

Location: chat-server/app/workflow/graph/nl_to_sql_graph/node/semantic_search.py:20

The semantic search function calls search_schemas without passing org_id, allowing cross-tenant data leakage:

# VULNERABLE: No org_id passed
results = await search_schemas(
    user_query=user_query,
    embeddings=embeddings_model,
    dataset_ids=dataset_ids,
    project_ids=project_ids,
    # Missing: org_id=org_id
)

Fix Required: Extract org_id from config and pass it to search_schemas.

2. Division by Zero Risk

Location: chat-server/app/utils/graph_utils/table_utils.py:43

# VULNERABLE: Will crash if estimated_size is 0
sample_pct = (settings.TARGET_ROWS / estimated_size) * 100

Fix Required: Add zero-check before division.

3. Direct SQL Interpolation

Location: chat-server/app/workflow/graph/single_dataset_graph/node/prepare_data.py:121

# POTENTIAL RISK: Direct interpolation without validation
sample_data_query = f"SELECT * FROM {dataset_name} LIMIT 50"

While dataset_name comes from trusted Qdrant storage, consider defensive identifier quoting.


⚠️ Error Handling Issues

1. Inconsistent Error Guards

  • execute_sql.py:34: Checks isinstance(last_message, ErrorMessage) but last_message is a list
  • sql_agent.py:25: Accesses query_result.single_dataset_query_result before null check

2. Missing Resource Cleanup

  • No explicit cleanup logic for failed operations in new graph workflows
  • Broad exception handling masks specific failure modes

3. State Validation Issues

  • prepare_data: Returns without dataset_info but graph always routes to sql_agent
  • routing.py: Uses state.get("query_result", []) which can be list causing AttributeError

🚀 Performance Considerations

Good Optimizations

  • Smart sampling: Table size estimation prevents performance issues on large datasets
  • Vector search optimization: Hybrid search with proper filtering
  • Connection pooling: Singleton pattern for HTTP client connections
  • Async operations: Proper async/await throughout

⚠️ Potential Concerns

  • Multiple vector searches: DuckDB docs search adds latency to SQL planning
  • Memory usage: Complex nested state objects in high-concurrency scenarios
  • Missing timeouts: GopieClient lacks explicit timeout configuration

🧪 Test Coverage Assessment

Good Coverage

  • Existing unit tests cover GopieClient org_id handling
  • DSPy integration provides advanced evaluation
  • E2E visualization test infrastructure

Missing Tests

  • No tests for new /fetch-sql endpoint
  • No integration tests for NL-to-SQL graph workflow
  • Missing edge case tests for new graph failure scenarios
  • No tests for org_id security boundaries

📋 Code Quality Issues

1. Error Message Inconsistencies

# Vague error without context
raise ValueError("query_result is not properly initialized")

2. Magic Numbers

# Should be configurable
SELECT * FROM {table} LIMIT 200000

3. Missing Function Documentation

Several new functions lack comprehensive docstrings explaining return contracts.


🔧 Specific Recommendations

High Priority Fixes

  1. Add org_id to semantic search to prevent data leakage
  2. Add zero-check in calculate_sampling_percentage
  3. Fix error guard in execute_sql.py (list vs ErrorMessage)
  4. Add validation in sql_agent.py before accessing query_result

Medium Priority

  1. Add timeouts to GopieClient requests
  2. Improve error messages with specific context
  3. Add integration tests for new workflows
  4. Configure magic numbers via settings

Code Quality

  1. Add comprehensive docstrings to new functions
  2. Implement consistent retry logic across agents
  3. Add resource cleanup for failed operations

📊 Overall Assessment

Category Rating Comments
Architecture ⭐⭐⭐⭐⭐ Excellent modular design
Security ⭐⭐⭐⚪⚪ Good foundations, critical gaps
Performance ⭐⭐⭐⭐⚪ Smart optimizations implemented
Testing ⭐⭐⚪⚪⚪ Infrastructure good, coverage lacking
Code Quality ⭐⭐⭐⭐⚪ Clean code, room for improvement

Recommendation: This is a solid architectural improvement, but the security issue in semantic search must be fixed before merging to prevent cross-tenant data access.


Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
web/src/lib/stores/auth-store.ts (1)

202-211: ⚠️ Potential issue | 🟡 Minor

Missing role: null in catch block of checkSession.

When session check throws an error, role is not reset to null, unlike the successful !response.ok branch at line 197. This could leave stale role data in state after a network error.

🛡️ Proposed fix
       } catch (error) {
         console.error("Session check error:", error);
         set({
           user: null,
           isAuthenticated: false,
           accessToken: null,
           organizationId: null,
+          role: null,
         });
         setGlobalAccessToken(null);
         setGlobalOrganizationId(null);
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/lib/stores/auth-store.ts` around lines 202 - 211, In checkSession's
catch block the state reset omits role, leaving stale role values after errors;
update the set({...}) call inside the catch of checkSession to include role:
null (matching the !response.ok branch) so user/role/accessToken/organizationId
are all cleared on error, keeping state consistent with
setGlobalAccessToken(null) and setGlobalOrganizationId(null).
server/interfaces/http/routes/api/projects/datasets/update.go (1)

66-70: ⚠️ Potential issue | 🟡 Minor

Incorrect error message: "deleting" should be "updating".

The error response message at line 68 says "Error deleting dataset" but this is the update handler. This appears to be a copy-paste error.

🐛 Proposed fix
 		return ctx.Status(fiber.StatusInternalServerError).JSON(fiber.Map{
 			"error":   err.Error(),
-			"message": "Error deleting dataset",
+			"message": "Error updating dataset",
 			"code":    fiber.StatusInternalServerError,
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/interfaces/http/routes/api/projects/datasets/update.go` around lines
66 - 70, The error response in the update handler incorrectly says "Error
deleting dataset"; update the message string to "Error updating dataset" in the
JSON returned by the update route (the block that calls
ctx.Status(fiber.StatusInternalServerError).JSON with "error": err.Error());
ensure the same handler/return (the update.go update route) uses "Error updating
dataset" so the message matches the update operation.
server/interfaces/http/routes/source/s3/create.go (2)

154-156: ⚠️ Potential issue | 🔴 Critical

Cleanup struct is missing userID and role initialization.

When the cleanup struct is initialized and later updated, userID and role are never set. However, cleanupResources uses these fields when calling h.datasetSvc.Delete(rc.datasetID, rc.orgID, rc.userID, rc.role) at line 56.

This means cleanup operations after dataset creation will pass empty/zero values for userID and role, which could cause authorization failures or unexpected behavior during error recovery.

🐛 Proposed fix to initialize userID
 	// Initialize cleanup resource object
 	cleanup := resourceCleanup{
 		tableName: res.TableName,
+		userID:    userID,
+		orgID:     orgID,
 	}

And later, only update the dataset-specific fields:

 	// Update cleanup object to include dataset info
 	cleanup.hasDataset = true
 	cleanup.datasetID = dataset.ID
-	cleanup.orgID = dataset.OrgID

Also applies to: 194-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/interfaces/http/routes/source/s3/create.go` around lines 154 - 156,
resourceCleanup is constructed without setting userID and role, causing
cleanupResources to call datasetSvc.Delete with empty auth fields; when creating
the cleanup variable in create.go (resourceCleanup{...}) set userID and role
from the current handler/request context (the same userID and role values used
elsewhere in this handler) so those fields are populated, and ensure subsequent
updates (the later assignments at the other initialization site around lines
194-197) only overwrite dataset-specific fields (datasetID, orgID, tableName)
and not userID/role so datasetSvc.Delete receives correct auth parameters.

37-37: ⚠️ Potential issue | 🔴 Critical

The role field is actively used and required; it should be assigned, not removed.

The role field is passed to h.datasetSvc.Delete() at line 56 and is essential for authorization checks (if role == models.Admin). However, both role and userID are never assigned to the cleanup struct during initialization, causing the Delete call to receive zero-values. Since userID and role are available from the context in the upload function, they should be populated when initializing the cleanup struct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/interfaces/http/routes/source/s3/create.go` at line 37, The cleanup
struct is missing assignment of the required authorization fields; set
cleanup.userID and cleanup.role when you create the cleanup instance in the
upload handler so the later call to h.datasetSvc.Delete(...) receives the actual
context values instead of zero-values. Locate the upload function where the
cleanup struct is initialized (the struct with fields userID and role used by
h.datasetSvc.Delete) and populate those fields from the request context (the
same place you already read userID and role) so authorization checks (e.g., if
role == models.Admin) work correctly.
🧹 Nitpick comments (4)
web/src/app/projects/[projectId]/page.tsx (1)

19-23: Consider adding ProtectedPage wrapper for authentication.

Similar to other authenticated pages in this PR (e.g., HomePage), this page should be wrapped with ProtectedPage to ensure unauthenticated users are redirected appropriately.

As per coding guidelines: "Protect routes using the ProtectedRoute wrapper HOC for authentication-required pages".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/app/projects/`[projectId]/page.tsx around lines 19 - 23, The
ProjectPage component is not wrapped with the ProtectedPage auth wrapper; update
the default export so ProjectPage is returned/wrapped by ProtectedPage (i.e.,
export default ProtectedPage(ProjectPage) or wrap the JSX returned by
ProjectPage with <ProtectedPage>), ensuring the params prop handling (params:
Promise<{ projectId: string }>) is preserved and any server/client boundaries
remain correct; locate the ProjectPage function and apply the same ProtectedPage
pattern used by HomePage to enforce authentication and redirection for
unauthenticated users.
web/src/components/dataset/column-descriptions-modal.tsx (1)

233-243: Consider updating help text when editing is disabled.

The message "Click edit to add one" is shown even when canEdit is false, which could confuse users who don't see an edit button.

💡 Suggested improvement
                     ) : (
                       <p className="text-muted-foreground italic">
-                        No description available. Click edit to add one.
+                        No description available.{canEdit && " Click edit to add one."}
                       </p>
                     )}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/components/dataset/column-descriptions-modal.tsx` around lines 233 -
243, The "Click edit to add one." help text is shown even when editing is
disabled; update the JSX in column-descriptions-modal.tsx (the block using
descriptions[column.column_name] and the canEdit prop/variable) to conditionally
render the message: if descriptions[column.column_name] is missing and canEdit
is true show "Click edit to add one.", otherwise show a neutral message such as
"No description available." or "Editing disabled" when canEdit is false; adjust
the conditional around the <p> elements so the text reflects the canEdit state.
server/application/services/store.go (2)

61-68: Drop dead list parameters (or enforce them) for a clearer service contract.

ProjectService.List still accepts createdBy but does not use it, and this pattern now mirrors the role-removal transition. Tightening the signature here will reduce ambiguity for callers.

Also applies to: 107-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/application/services/store.go` around lines 61 - 68,
ProjectService.List currently accepts an unused createdBy parameter; either
remove it from the method signature and all callers or wire it through to the
repository call (update ProjectService.List signature and the call to
projectRepo.SearchProject to accept and forward createdBy, or update repository
if necessary). Locate the ProjectService.List method and the
projectRepo.SearchProject invocation to either delete the createdBy argument
from List and callers, or add createdBy handling in List and modify
SearchProject's signature and implementations to accept and filter by createdBy
so the parameter is not left unused.

28-29: Remove unused createdBy from Details service methods.

createdBy is accepted but ignored in both methods, which makes the API contract look stricter than the actual authorization path.

♻️ Proposed cleanup
-func (service *ProjectService) Details(id, orgID, createdBy string) (*models.Project, error) {
+func (service *ProjectService) Details(id, orgID string) (*models.Project, error) {
 	return service.projectRepo.Details(context.Background(), id, orgID)
 }

-func (service *DatasetService) Details(id, orgID, createdBy string) (*models.Dataset, error) {
+func (service *DatasetService) Details(id, orgID string) (*models.Dataset, error) {
 	return service.datasetRepo.Details(context.Background(), id, orgID)
 }

Also applies to: 93-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/application/services/store.go` around lines 28 - 29, The Details
method on ProjectService currently accepts an unused createdBy parameter; remove
createdBy from the ProjectService.Details signature and from any other service
Details variants, update all callers to stop passing createdBy, and adjust any
interface definitions and tests that reference ProjectService.Details; keep the
internal call to projectRepo.Details(context.Background(), id, orgID) unchanged.
Ensure you update method declarations (ProjectService.Details) and any
implementing interfaces so signatures match, run tests, and remove dead
parameters in related methods mentioned in the review (the other Details method
at lines 93-94).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/application/repositories/store.go`:
- Line 80: The ListByProjectAndOrg implementation is ignoring orgID and
createdBy when calling s.List; update the function (ListByProjectAndOrg) to pass
a filter that enforces tenant scoping and optional creator filtering instead of
only projectID and pagination—modify the call to s.List to include an
appropriate filter/map/selector that adds org_id = orgID and, if createdBy is
non-empty, created_by = createdBy (or whatever field names Dataset expects), so
the query returned by s.List is correctly restricted by org and creator.

In `@server/infrastructure/postgres/store/datasets/list_by_project_and_org.go`:
- Around line 9-11: The ListByProjectAndOrg implementation ignores orgID and
createdBy (it just calls s.List), so update it to apply org and creator
filtering: either (A) extend or call an underlying query method that accepts
orgID and createdBy and pass those values into the SQL/ORM filter, or (B) if
filtering is intentionally not applied, add a clear comment on
PgDatasetStore.ListByProjectAndOrg stating that org/user scoping is deferred and
why; specifically modify the function PgDatasetStore.ListByProjectAndOrg to use
orgID and createdBy in the dataset query (or call a new/internal method like
s.listWithFilters(ctx, projectID, orgID, createdBy, pagination)) to enforce
tenant/user isolation and return the paginated result.

In `@server/interfaces/http/routes/api/projects/details.go`:
- Around line 20-24: Replace the header-based org lookup with the
middleware-local org value: stop using ctx.Get(middleware.OrganizationIDHeader)
and instead read the org ID from the middleware-populated locals (e.g.,
ctx.Locals(<middleware's org-local-key>)) before calling
h.projectSvc.Details(projectID, orgID, userID); update the orgID variable
initialization so authorization uses the middleware-scoped org context (use the
exact local key constant provided by the middleware rather than the header
constant) and keep projectID and userID retrievals as-is.

In `@web/src/app/projects/`[projectId]/upload/page.tsx:
- Around line 9-36: The page lacks the authentication wrapper: wrap the
UploadDatasetPage with the app's ProtectedPage/ProtectedRoute so unauthenticated
users are redirected to login before any canEdit checks run; locate the
UploadDatasetPage export in web/src/app/projects/[projectId]/upload/page.tsx and
either return/render <ProtectedPage> (or the project's ProtectedRoute HOC)
around the component's JSX (ensuring DatasetUploadWizard remains inside) or
export the component via ProtectedPage(UploadDatasetPage), keep the existing
useAuthStore, useProject, and canEdit logic intact inside the wrapped component,
and ensure ProtectedPage handles redirect-to-login for unauthenticated users.

---

Outside diff comments:
In `@server/interfaces/http/routes/api/projects/datasets/update.go`:
- Around line 66-70: The error response in the update handler incorrectly says
"Error deleting dataset"; update the message string to "Error updating dataset"
in the JSON returned by the update route (the block that calls
ctx.Status(fiber.StatusInternalServerError).JSON with "error": err.Error());
ensure the same handler/return (the update.go update route) uses "Error updating
dataset" so the message matches the update operation.

In `@server/interfaces/http/routes/source/s3/create.go`:
- Around line 154-156: resourceCleanup is constructed without setting userID and
role, causing cleanupResources to call datasetSvc.Delete with empty auth fields;
when creating the cleanup variable in create.go (resourceCleanup{...}) set
userID and role from the current handler/request context (the same userID and
role values used elsewhere in this handler) so those fields are populated, and
ensure subsequent updates (the later assignments at the other initialization
site around lines 194-197) only overwrite dataset-specific fields (datasetID,
orgID, tableName) and not userID/role so datasetSvc.Delete receives correct auth
parameters.
- Line 37: The cleanup struct is missing assignment of the required
authorization fields; set cleanup.userID and cleanup.role when you create the
cleanup instance in the upload handler so the later call to
h.datasetSvc.Delete(...) receives the actual context values instead of
zero-values. Locate the upload function where the cleanup struct is initialized
(the struct with fields userID and role used by h.datasetSvc.Delete) and
populate those fields from the request context (the same place you already read
userID and role) so authorization checks (e.g., if role == models.Admin) work
correctly.

In `@web/src/lib/stores/auth-store.ts`:
- Around line 202-211: In checkSession's catch block the state reset omits role,
leaving stale role values after errors; update the set({...}) call inside the
catch of checkSession to include role: null (matching the !response.ok branch)
so user/role/accessToken/organizationId are all cleared on error, keeping state
consistent with setGlobalAccessToken(null) and setGlobalOrganizationId(null).

---

Nitpick comments:
In `@server/application/services/store.go`:
- Around line 61-68: ProjectService.List currently accepts an unused createdBy
parameter; either remove it from the method signature and all callers or wire it
through to the repository call (update ProjectService.List signature and the
call to projectRepo.SearchProject to accept and forward createdBy, or update
repository if necessary). Locate the ProjectService.List method and the
projectRepo.SearchProject invocation to either delete the createdBy argument
from List and callers, or add createdBy handling in List and modify
SearchProject's signature and implementations to accept and filter by createdBy
so the parameter is not left unused.
- Around line 28-29: The Details method on ProjectService currently accepts an
unused createdBy parameter; remove createdBy from the ProjectService.Details
signature and from any other service Details variants, update all callers to
stop passing createdBy, and adjust any interface definitions and tests that
reference ProjectService.Details; keep the internal call to
projectRepo.Details(context.Background(), id, orgID) unchanged. Ensure you
update method declarations (ProjectService.Details) and any implementing
interfaces so signatures match, run tests, and remove dead parameters in related
methods mentioned in the review (the other Details method at lines 93-94).

In `@web/src/app/projects/`[projectId]/page.tsx:
- Around line 19-23: The ProjectPage component is not wrapped with the
ProtectedPage auth wrapper; update the default export so ProjectPage is
returned/wrapped by ProtectedPage (i.e., export default
ProtectedPage(ProjectPage) or wrap the JSX returned by ProjectPage with
<ProtectedPage>), ensuring the params prop handling (params: Promise<{
projectId: string }>) is preserved and any server/client boundaries remain
correct; locate the ProjectPage function and apply the same ProtectedPage
pattern used by HomePage to enforce authentication and redirection for
unauthenticated users.

In `@web/src/components/dataset/column-descriptions-modal.tsx`:
- Around line 233-243: The "Click edit to add one." help text is shown even when
editing is disabled; update the JSX in column-descriptions-modal.tsx (the block
using descriptions[column.column_name] and the canEdit prop/variable) to
conditionally render the message: if descriptions[column.column_name] is missing
and canEdit is true show "Click edit to add one.", otherwise show a neutral
message such as "No description available." or "Editing disabled" when canEdit
is false; adjust the conditional around the <p> elements so the text reflects
the canEdit state.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40d0dbb and bda820b.

📒 Files selected for processing (25)
  • server/application/repositories/store.go
  • server/application/services/store.go
  • server/infrastructure/postgres/store/datasets/list_by_project_and_org.go
  • server/infrastructure/postgres/store/datasets/list_by_project_and_role.go
  • server/interfaces/http/middleware/validate.go
  • server/interfaces/http/routes/api/datasets/details.go
  • server/interfaces/http/routes/api/datasets/get-column-descriptions.go
  • server/interfaces/http/routes/api/datasets/update-column-descriptions.go
  • server/interfaces/http/routes/api/projects/datasets/delete.go
  • server/interfaces/http/routes/api/projects/datasets/details.go
  • server/interfaces/http/routes/api/projects/datasets/update.go
  • server/interfaces/http/routes/api/projects/details.go
  • server/interfaces/http/routes/api/projects/list.go
  • server/interfaces/http/routes/source/database/create.go
  • server/interfaces/http/routes/source/database/refresh.go
  • server/interfaces/http/routes/source/s3/create.go
  • web/src/app/page.tsx
  • web/src/app/projects/[projectId]/datasets/[datasetId]/page.tsx
  • web/src/app/projects/[projectId]/page.tsx
  • web/src/app/projects/[projectId]/upload/page.tsx
  • web/src/components/dataset/column-descriptions-modal.tsx
  • web/src/components/dataset/dataset-header.tsx
  • web/src/components/project/inline-project-editor.tsx
  • web/src/components/project/project-card.tsx
  • web/src/lib/stores/auth-store.ts
💤 Files with no reviewable changes (1)
  • server/infrastructure/postgres/store/datasets/list_by_project_and_role.go

DeleteByOrgAndCreator(ctx context.Context, datasetID, orgID, createdBy string) error
UpdateByOrgAndCreator(ctx context.Context, datasetID, orgID, createdBy string, params *models.UpdateDatasetParams) (*models.Dataset, error)
ListByProjectAndRole(ctx context.Context, projectID, orgID, createdBy string, role models.Role, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error)
ListByProjectAndOrg(ctx context.Context, projectID, orgID, createdBy string, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== ListByProjectAndOrg definitions =="
rg -n -C6 '\bListByProjectAndOrg\s*\(' server

echo
echo "== Candidate implementation files =="
fd 'list_by_project_and_org\.go$' server

echo
echo "== Inspect org/creator usage in implementations =="
for f in $(fd 'list_by_project_and_org\.go$' server); do
  echo "---- $f ----"
  rg -n -C6 'ListByProjectAndOrg|orgID|createdBy|org_id|created_by|WHERE|project_id' "$f"
done

Repository: factly/gopie

Length of output: 4469


ListByProjectAndOrg ignores orgID and createdBy parameters—org scoping is not enforced.

The implementation at server/infrastructure/postgres/store/datasets/list_by_project_and_org.go accepts these parameters but forwards only projectID and pagination to s.List(), ignoring tenant and creator constraints. Update the implementation to apply orgID (and createdBy if intended) to query filters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/application/repositories/store.go` at line 80, The ListByProjectAndOrg
implementation is ignoring orgID and createdBy when calling s.List; update the
function (ListByProjectAndOrg) to pass a filter that enforces tenant scoping and
optional creator filtering instead of only projectID and pagination—modify the
call to s.List to include an appropriate filter/map/selector that adds org_id =
orgID and, if createdBy is non-empty, created_by = createdBy (or whatever field
names Dataset expects), so the query returned by s.List is correctly restricted
by org and creator.

Comment on lines +9 to +11
func (s *PgDatasetStore) ListByProjectAndOrg(ctx context.Context, projectID, orgID, createdBy string, pagination models.Pagination) (*models.PaginationView[*models.Dataset], error) {
return s.List(ctx, projectID, pagination)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parameters orgID and createdBy are ignored in the implementation.

The method signature accepts orgID and createdBy but these values are never used—the call simply delegates to s.List(ctx, projectID, pagination). This creates a misleading API contract where callers may assume org-scoped or user-scoped filtering is applied when it is not.

If org/user filtering is intentionally deferred, consider adding a comment explaining this. Otherwise, incorporate these parameters into the query to enforce proper tenant isolation.

🔍 Verification: Check if List method supports org filtering
#!/bin/bash
# Description: Check the List method signature and implementation to see if it supports org/user filtering

# Find the List method implementation
ast-grep --pattern $'func ($_ *PgDatasetStore) List($$$) {
  $$$
}'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/infrastructure/postgres/store/datasets/list_by_project_and_org.go`
around lines 9 - 11, The ListByProjectAndOrg implementation ignores orgID and
createdBy (it just calls s.List), so update it to apply org and creator
filtering: either (A) extend or call an underlying query method that accepts
orgID and createdBy and pass those values into the SQL/ORM filter, or (B) if
filtering is intentionally not applied, add a clear comment on
PgDatasetStore.ListByProjectAndOrg stating that org/user scoping is deferred and
why; specifically modify the function PgDatasetStore.ListByProjectAndOrg to use
orgID and createdBy in the dataset query (or call a new/internal method like
s.listWithFilters(ctx, projectID, orgID, createdBy, pagination)) to enforce
tenant/user isolation and return the paginated result.

Comment on lines 20 to +24
projectID := ctx.Params("projectID")
orgID := ctx.Get(middleware.OrganizationIDHeader)
userID := ctx.Locals(middleware.UserCtxKey).(string)
role := ctx.Locals(middleware.RoleCtxKey).(models.Role)

project, err := h.projectSvc.Details(projectID, orgID, userID, role)
project, err := h.projectSvc.Details(projectID, orgID, userID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use middleware-scoped org context instead of request headers for tenant enforcement.

Line 21 currently trusts a user-provided header for orgID. For authorization-sensitive reads, use middleware locals to avoid tenant spoofing.

🔒 Suggested fix
-	orgID := ctx.Get(middleware.OrganizationIDHeader)
+	orgID := ctx.Locals(middleware.OrganizationCtxKey).(string)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
projectID := ctx.Params("projectID")
orgID := ctx.Get(middleware.OrganizationIDHeader)
userID := ctx.Locals(middleware.UserCtxKey).(string)
role := ctx.Locals(middleware.RoleCtxKey).(models.Role)
project, err := h.projectSvc.Details(projectID, orgID, userID, role)
project, err := h.projectSvc.Details(projectID, orgID, userID)
projectID := ctx.Params("projectID")
orgID := ctx.Locals(middleware.OrganizationCtxKey).(string)
userID := ctx.Locals(middleware.UserCtxKey).(string)
project, err := h.projectSvc.Details(projectID, orgID, userID)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/interfaces/http/routes/api/projects/details.go` around lines 20 - 24,
Replace the header-based org lookup with the middleware-local org value: stop
using ctx.Get(middleware.OrganizationIDHeader) and instead read the org ID from
the middleware-populated locals (e.g., ctx.Locals(<middleware's org-local-key>))
before calling h.projectSvc.Details(projectID, orgID, userID); update the orgID
variable initialization so authorization uses the middleware-scoped org context
(use the exact local key constant provided by the middleware rather than the
header constant) and keep projectID and userID retrievals as-is.

Comment on lines 9 to 36
export default function UploadDatasetPage({
params,
}: {
params: Promise<{ projectId: string }>;
}) {
const { projectId } = React.use(params);
const { role, user } = useAuthStore();
const router = useRouter();

const { data: project, isLoading } = useProject({
variables: { projectId },
});

React.useEffect(() => {
if (!isLoading && project) {
const canEdit = role === "admin" || project.created_by === user?.id;
if (!canEdit) {
router.replace(`/projects/${projectId}`);
}
}
}, [isLoading, project, role, user, projectId, router]);

if (isLoading || !project) return null;

const canEdit = role === "admin" || project.created_by === user?.id;
if (!canEdit) return null;

return <DatasetUploadWizard projectId={projectId} />;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing ProtectedPage wrapper for authentication.

Per coding guidelines, authentication-required pages should use the ProtectedRoute or ProtectedPage wrapper. While the canEdit check handles authorization, unauthenticated users should be redirected to login rather than seeing a blank page.

🛡️ Proposed fix
+import { ProtectedPage } from "@/components/auth/protected-page";
+
 export default function UploadDatasetPage({
   params,
 }: {
   params: Promise<{ projectId: string }>;
 }) {
   // ... existing code ...
 
-  return <DatasetUploadWizard projectId={projectId} />;
+  return (
+    <ProtectedPage>
+      <DatasetUploadWizard projectId={projectId} />
+    </ProtectedPage>
+  );
 }

As per coding guidelines: "Protect routes using the ProtectedRoute wrapper HOC for authentication-required pages".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/app/projects/`[projectId]/upload/page.tsx around lines 9 - 36, The
page lacks the authentication wrapper: wrap the UploadDatasetPage with the app's
ProtectedPage/ProtectedRoute so unauthenticated users are redirected to login
before any canEdit checks run; locate the UploadDatasetPage export in
web/src/app/projects/[projectId]/upload/page.tsx and either return/render
<ProtectedPage> (or the project's ProtectedRoute HOC) around the component's JSX
(ensuring DatasetUploadWizard remains inside) or export the component via
ProtectedPage(UploadDatasetPage), keep the existing useAuthStore, useProject,
and canEdit logic intact inside the wrapped component, and ensure ProtectedPage
handles redirect-to-login for unauthenticated users.

@surajmn1 surajmn1 force-pushed the feat/e2e-datasets-metadata branch 2 times, most recently from 7b6a888 to bda820b Compare March 20, 2026 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants