-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(bigquery): Add pushdown_deny_usernames and pushdown_allow_usernames for server-side user filtering #15699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): Add pushdown_deny_usernames and pushdown_allow_usernames for server-side user filtering #15699
Conversation
…iltering to BigQuery Add a new `pushdown_user_filter` configuration option that enables pushing the existing `user_email_pattern` filtering to BigQuery's INFORMATION_SCHEMA.JOBS query using REGEXP_CONTAINS for improved performance. Changes: - Add `pushdown_user_filter` boolean config (default: false) - Add `_build_user_filter_from_pattern()` to convert AllowDenyPattern to SQL - Update query builder to accept user_filter parameter - Wire config from BigQueryV2Config to the extractor - Add comprehensive unit tests (30+ test cases) Benefits: - Single source of truth: reuses existing `user_email_pattern` config - Backward compatible: disabled by default - Full regex support via BigQuery REGEXP_CONTAINS() - Improved performance for large query volumes This follows the same pattern as Snowflake's pushdown filtering.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Address security review feedback: 1. SQL Injection Prevention: - Switch from raw strings (r'...') to regular string literals - Properly escape backslashes first, then single quotes - This prevents quote breakout attacks like: test') OR 1=1 -- 2. Improved Allow-All Pattern Detection: - Add _is_allow_all_pattern() helper function - Recognize common allow-all patterns: .*, .+, ^.*$, ^.+$ - Reduces unnecessary filtering overhead 3. Add Security Tests: - Quote breakout SQL injection attempts - Backslash-quote escape bypass attempts - Multiple backslash edge cases - Full integration security test 4. Add Helper Function Tests: - TestEscapeForBigQueryString class - TestIsAllowAllPattern class
…00% code coverage Address security review feedback and improve code quality: Security Fixes: - Switch from raw strings (r'...') to regular string literals - Implement two-step escaping: backslashes first, then quotes - Add comprehensive security tests for SQL injection prevention Code Improvements: - Add _is_allow_all_pattern() helper for pattern detection - Use List[str] type hints instead of bare list - Add detailed security notes in docstrings - Enhance module-level docstring with test organization Test Coverage (100%): - Add TestFetchRegionQueryLogWithPushdown for integration tests - Cover pushdown_user_filter=True path (lines 410-413) - Cover pushdown_user_filter=False path (lines 414-416) - 55+ test cases across 6 test classes
…er_filter 1. Missing Test Coverage: - Add test_whitespace_nonwhitespace_star_is_allow_all for [\s\S]* - Add test_whitespace_nonwhitespace_plus_is_allow_all for [\s\S]+ - All 6 patterns in _is_allow_all_pattern() now have test coverage 2. User Documentation Enhancement: - Add comprehensive 'User Email Filtering Pushdown' section to bigquery_pre.md - Document when to use, example configuration, behavior, and prerequisites - Link from features list to new detailed section 3. Python 3.9 Compatibility Fix: - Fix parenthesized with statement syntax (Python 3.10+ only) - Use traditional 'with a, b:' syntax for Python 3.9 compatibility - This ensures TestFetchRegionQueryLogWithPushdown tests run on CI
Address CI test failures: 1. Fix failing tests: - Add ignoreCase=False to tests that check pattern translation logic - AllowDenyPattern defaults to ignoreCase=True which adds (?i) prefix - Tests now explicitly test pattern translation in isolation 2. Improve _is_allow_all_pattern() docstring: - List all 6 recognized 'allow all' patterns with descriptions - Document why multiple patterns are never considered 'allow all' 3. Add debug logging in _build_user_filter_from_pattern(): - Log input patterns at translation start - Log each pattern's escape transformation - Log when 'allow all' patterns are detected and skipped - Log final generated SQL filter 4. Add documentation note to test file: - Explain why most tests use ignoreCase=False - Reference dedicated case-sensitivity tests for maintainers
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
…shdown-user-filter
…mes config Implements server-side user filtering for BigQuery ingestion, consistent with Snowflake's pushdown filtering approach. Changes: - Add pushdown_deny_usernames and pushdown_allow_usernames config fields - Add _build_user_filter() to generate REGEXP_CONTAINS SQL conditions - Add SQL injection protection via _escape_for_bigquery_string() - Add Pydantic validators for pattern validation and use_queries_v2 check - Add comprehensive unit tests (65 tests) - Update documentation with regex vs LIKE comparison table This feature pushes user email filtering directly to BigQuery's INFORMATION_SCHEMA.JOBS query, reducing data transfer and improving performance for large query volumes. Related: acryldata/connector-tests (integration tests)
- Trim verbose _is_allow_all_pattern docstring (34 lines -> 12 lines) while keeping the important design choice explanation (all vs any) - Fix stale reference to _build_user_filter_from_pattern() in docstring, now correctly references _build_user_filter() Review feedback addressed
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
Summary
Adds server-side user email filtering (pushdown) for BigQuery queries, allowing users to filter query data at the BigQuery level before it's returned to DataHub. This significantly improves performance for large datasets by reducing data transfer.
This feature mirrors the existing Snowflake
pushdown_deny_usernames/pushdown_allow_usernamesimplementation for cross-platform consistency.Related PR: connector-tests#612 - Integration tests for BigQuery pushdown filters
New Configuration Options
pushdown_deny_usernamesList[str]bot_%,%@%.iam.gserviceaccount.com)pushdown_allow_usernamesList[str]%@company.com)Requires:
use_queries_v2: truePattern Syntax
Uses standard SQL LIKE syntax:
%- matches any sequence of characters_- matches any single characterBehavior
LOWER(user_email) LIKE 'pattern'clausesExample Usage
Generated SQL
Files Changed
bigquery_config.py- New config fields with validatorsqueries_extractor.py- SQL filter generation logicbigquery.py- Wire config to extractorbigquery_pre.md- User documentationtest_bigquery_queries_extractor.py- 50 unit testsSecurity
'→'')Testing
connector-testsrepositoryChecklist
use_queries_v2: true(validated)