Skip to content

Can't use Pandas to upload a REPEATED field (e.g. list of strings) #913

Closed
@emma-brainlabs

Description

@emma-brainlabs

I am trying to add a list of strings stored in a pandas Dataframe to a BigQuery table with a REPEATED field. When running this code:

import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account

df = pd.DataFrame([{"repeated": ["hi", "hello"], "not_repeated": "a_string"}])

table = bigquery.Table(
    "project.dataset_name.table_name",
    schema=[
        bigquery.SchemaField("repeated", "string", "REPEATED"),
        bigquery.SchemaField("not_repeated", "string", "NULLABLE"),
    ],
)

bigquery_client = bigquery.Client(
    credentials=service_account.Credentials.from_service_account_file(
        "service-account-credentials.json"
    )
)
bigquery_client.insert_rows_from_dataframe(table, df)

I get this error:

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    bigquery_client.insert_rows_from_dataframe(table, df)
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3433, in insert_rows_from_dataframe
    result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3381, in insert_rows
    json_rows = [_record_field_to_json(schema, row) for row in rows]
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3381, in <listcomp>
    json_rows = [_record_field_to_json(schema, row) for row in rows]
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 800, in dataframe_to_json_generator
    if pandas.isna(value):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Which stops the execution, and does not allow the code to upload to bigquery. I can confirm that if I run the same code without the list element (aka. df = pd.DataFrame([{"not_repeated": "a_string"}]), the error does not occur.

I think this can be traced back to the recently changed line if pandas.isna(value): from this previous PR (use pandas function to check for NaN #750) to solve this previous issue (dataframe_to_json_generator doesn't support pandas.NA type #729 ). As evaluating pandas.isna(value) on a list will give an array of bools, which can then not be interpreted by the if statement.

I can confirm that if I go to an older version of this library before this change was made, the code works.

Environment details

  • OS type and version: MacOS BigSur 11.5.2
  • Python version: Python 3.7.5
  • pip version: pip 19.2.3
  • google-cloud-bigquery version: 2.24.0

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions