Skip to content

Conversation

@dinesh-verma-datahub
Copy link
Contributor

@dinesh-verma-datahub dinesh-verma-datahub commented Dec 17, 2025

Summary

Adds server-side user email filtering (pushdown) for BigQuery queries, allowing users to filter query data at the BigQuery level before it's returned to DataHub. This significantly improves performance for large datasets by reducing data transfer.

This feature mirrors the existing Snowflake pushdown_deny_usernames / pushdown_allow_usernames implementation for cross-platform consistency.

Related PR: connector-tests#612 - Integration tests for BigQuery pushdown filters

New Configuration Options

Field Type Description
pushdown_deny_usernames List[str] SQL LIKE patterns to exclude (e.g., bot_%, %@%.iam.gserviceaccount.com)
pushdown_allow_usernames List[str] SQL LIKE patterns to include (e.g., %@company.com)

Requires: use_queries_v2: true

Pattern Syntax

Uses standard SQL LIKE syntax:

  • % - matches any sequence of characters
  • _ - matches any single character
  • Matching is case-insensitive

Behavior

  • Deny wins: If a user matches both allow and deny patterns, they are excluded
  • Empty allow list: All users allowed (except those in deny list)
  • Server-side filtering: Patterns are pushed to BigQuery as LOWER(user_email) LIKE 'pattern' clauses

Example Usage

source:
  type: bigquery
  config:
    use_queries_v2: true
    
    # Filter out service accounts and bots
    pushdown_deny_usernames:
      - "%@%.iam.gserviceaccount.com"
      - "bot_%"
      - "%_bot@%"
    
    # Only include company emails
    pushdown_allow_usernames:
      - "%@company.com"
      - "%@subsidiary.com"

Generated SQL

WHERE (LOWER(user_email) NOT LIKE '%@%.iam.gserviceaccount.com')
  AND (LOWER(user_email) NOT LIKE 'bot_%')
  AND (LOWER(user_email) NOT LIKE '%_bot@%')
  AND (LOWER(user_email) LIKE '%@company.com' OR LOWER(user_email) LIKE '%@subsidiary.com')

Files Changed

  • bigquery_config.py - New config fields with validators
  • queries_extractor.py - SQL filter generation logic
  • bigquery.py - Wire config to extractor
  • bigquery_pre.md - User documentation
  • test_bigquery_queries_extractor.py - 50 unit tests

Security

  • SQL injection prevention via quote escaping (''')
  • Empty/whitespace patterns rejected by validator
  • Patterns validated before SQL generation

Testing

  • 50 unit tests covering core logic, edge cases, and security
  • Integration tests in separate connector-tests repository

Checklist

  • Consistent with Snowflake's pushdown implementation
  • Case-insensitive matching (matches Snowflake's ILIKE behavior)
  • SQL injection protection
  • Requires use_queries_v2: true (validated)
  • Documentation with examples
  • Pre-commit hooks pass

…iltering to BigQuery

Add a new `pushdown_user_filter` configuration option that enables pushing
the existing `user_email_pattern` filtering to BigQuery's INFORMATION_SCHEMA.JOBS
query using REGEXP_CONTAINS for improved performance.

Changes:
- Add `pushdown_user_filter` boolean config (default: false)
- Add `_build_user_filter_from_pattern()` to convert AllowDenyPattern to SQL
- Update query builder to accept user_filter parameter
- Wire config from BigQueryV2Config to the extractor
- Add comprehensive unit tests (30+ test cases)

Benefits:
- Single source of truth: reuses existing `user_email_pattern` config
- Backward compatible: disabled by default
- Full regex support via BigQuery REGEXP_CONTAINS()
- Improved performance for large query volumes

This follows the same pattern as Snowflake's pushdown filtering.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 17, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 17, 2025
@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Address security review feedback:

1. SQL Injection Prevention:
   - Switch from raw strings (r'...') to regular string literals
   - Properly escape backslashes first, then single quotes
   - This prevents quote breakout attacks like: test') OR 1=1 --

2. Improved Allow-All Pattern Detection:
   - Add _is_allow_all_pattern() helper function
   - Recognize common allow-all patterns: .*, .+, ^.*$, ^.+$
   - Reduces unnecessary filtering overhead

3. Add Security Tests:
   - Quote breakout SQL injection attempts
   - Backslash-quote escape bypass attempts
   - Multiple backslash edge cases
   - Full integration security test

4. Add Helper Function Tests:
   - TestEscapeForBigQueryString class
   - TestIsAllowAllPattern class
…00% code coverage

Address security review feedback and improve code quality:

Security Fixes:
- Switch from raw strings (r'...') to regular string literals
- Implement two-step escaping: backslashes first, then quotes
- Add comprehensive security tests for SQL injection prevention

Code Improvements:
- Add _is_allow_all_pattern() helper for pattern detection
- Use List[str] type hints instead of bare list
- Add detailed security notes in docstrings
- Enhance module-level docstring with test organization

Test Coverage (100%):
- Add TestFetchRegionQueryLogWithPushdown for integration tests
- Cover pushdown_user_filter=True path (lines 410-413)
- Cover pushdown_user_filter=False path (lines 414-416)
- 55+ test cases across 6 test classes
…er_filter

1. Missing Test Coverage:
   - Add test_whitespace_nonwhitespace_star_is_allow_all for [\s\S]*
   - Add test_whitespace_nonwhitespace_plus_is_allow_all for [\s\S]+
   - All 6 patterns in _is_allow_all_pattern() now have test coverage

2. User Documentation Enhancement:
   - Add comprehensive 'User Email Filtering Pushdown' section to bigquery_pre.md
   - Document when to use, example configuration, behavior, and prerequisites
   - Link from features list to new detailed section

3. Python 3.9 Compatibility Fix:
   - Fix parenthesized with statement syntax (Python 3.10+ only)
   - Use traditional 'with a, b:' syntax for Python 3.9 compatibility
   - This ensures TestFetchRegionQueryLogWithPushdown tests run on CI
Address CI test failures:

1. Fix failing tests:
   - Add ignoreCase=False to tests that check pattern translation logic
   - AllowDenyPattern defaults to ignoreCase=True which adds (?i) prefix
   - Tests now explicitly test pattern translation in isolation

2. Improve _is_allow_all_pattern() docstring:
   - List all 6 recognized 'allow all' patterns with descriptions
   - Document why multiple patterns are never considered 'allow all'

3. Add debug logging in _build_user_filter_from_pattern():
   - Log input patterns at translation start
   - Log each pattern's escape transformation
   - Log when 'allow all' patterns are detected and skipped
   - Log final generated SQL filter

4. Add documentation note to test file:
   - Explain why most tests use ignoreCase=False
   - Reference dedicated case-sensitivity tests for maintainers
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

…mes config

Implements server-side user filtering for BigQuery ingestion, consistent
with Snowflake's pushdown filtering approach.

Changes:
- Add pushdown_deny_usernames and pushdown_allow_usernames config fields
- Add _build_user_filter() to generate REGEXP_CONTAINS SQL conditions
- Add SQL injection protection via _escape_for_bigquery_string()
- Add Pydantic validators for pattern validation and use_queries_v2 check
- Add comprehensive unit tests (65 tests)
- Update documentation with regex vs LIKE comparison table

This feature pushes user email filtering directly to BigQuery's
INFORMATION_SCHEMA.JOBS query, reducing data transfer and improving
performance for large query volumes.

Related: acryldata/connector-tests (integration tests)
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jan 5, 2026
@dinesh-verma-datahub dinesh-verma-datahub changed the title feat(bigquery): Add pushdown_user_filter option to push user_email_pattern filtering to BigQuery SQL for improved performance feat(bigquery): Add pushdown_deny_usernames and pushdown_allow_usernames for server-side user filtering Jan 5, 2026
- Trim verbose _is_allow_all_pattern docstring (34 lines -> 12 lines)
  while keeping the important design choice explanation (all vs any)
- Fix stale reference to _build_user_filter_from_pattern() in docstring,
  now correctly references _build_user_filter()

Review feedback addressed
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Jan 7, 2026
@dinesh-verma-datahub dinesh-verma-datahub merged commit 81edffc into master Jan 7, 2026
73 of 74 checks passed
@dinesh-verma-datahub dinesh-verma-datahub deleted the feature/bigquery-pushdown-user-filter branch January 7, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants