Skip to content

Bigquery does not work cleanly with data/job project separation #32789

Open
@withnale

Description

@withnale

Bug description

It would be ideal that the BigQuery support allowed for a separation between GCP projects. Often the datasets live in one project which is notionally owned by a different team. Allowing for this provides:

  • better isolation of concerns and therefore a better understanding of security roles and permissions
  • a clear separation of usage where the visualisation cost of superset (namely BQ slots) can be clearly recognised and budgeted for
  • simplified incident resolution in case of 'slowness' where BQ backend usage can be correlated to activity on a given superset instance

IMPLEMENTATION

So this would effectively need to formalise the separation between

  • project_id - the project in which the data is held
  • job_project_id - the project in which queries are run, where the slot allocation is used

From an implementation perspective this would require:

  1. all BQ queries to be fully qualified like select * from {project}.{dataset}.{table}
  2. all non-query usage to be able to differentiate when you are making queries (default project job_project_id) and when you are querying metadata (default project project_id)

From experimentation it seems that [1] is relatively easy to solve. If you create a dataset with a schema of project.dataset instead of just dataset then any chart that uses it will work correctly. However, this does not work when building a dataset through the UI, but a little use of SQLA_TABLE_MUTATOR seems to work (a shortened version minus error handling of what I've used is below):

def SQLA_TABLE_MUTATOR(table):
    if table.database.driver == 'bigquery':
        table.schema = f'{table.catalog}.{table.schema}'
    return table

However point [2] is a harder nut to crack. I've been experimenting with DB_CONNECTION_MUTATOR to achieve this but not sure this will be reliable enough. The other alternative would be a small monkey patch within superset/db_engine_specs/bigquery.py

It would be great if this logic was available out of the box.

Superset version

master / latest-dev

Python version

3.9

Node version

16

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions