Skip to content

load_table_from_dataframe support JSON column dtype #1966

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
jlynchMicron opened this issue Jul 2, 2024 · 4 comments
Open

load_table_from_dataframe support JSON column dtype #1966

jlynchMicron opened this issue Jul 2, 2024 · 4 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jlynchMicron
Copy link

jlynchMicron commented Jul 2, 2024

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe.
python-bigquery does not seem to currently support uploading dataframes where one of the columns in the destination table is JSON dtype.

Quick partial code example:

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(schema=schema)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

Result:
google.api_core.exceptions.BadRequest: 400 Unsupported field type: JSON; reason: invalid, message: Unsupported field type: JSON

Describe the solution you'd like
Implement support for loading data to BigQuery that contain JSON columns.

Additional context

$ conda list | grep bigquery
google-cloud-bigquery-core 3.25.0             pyhd8ed1ab_0  
google-cloud-bigquery-storage 2.25.0             pyhca7485f_0  
google-cloud-bigquery-storage-core 2.25.0             pyhca7485f_0

Related issues:
googleapis/python-bigquery-sqlalchemy#399
googleapis/python-bigquery-pandas#698
googleapis/python-bigquery-dataframes#816

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 2, 2024
@jlynchMicron
Copy link
Author

jlynchMicron commented Jul 2, 2024

I have also tried uploading the data as dtype STRING in hopes that it would be converted to JSON server-side, but that also results in the following error: google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/gdw-dev-sse/jobs?uploadType=resumable: Provided Schema does not match Table XXXXXXXXXXXXXXXX. Field summery_data has changed type from JSON to STRING
NOTE: Hiding table name

@jlynchMicron
Copy link
Author

I was able to upload my data with the following code, but JSON support should be added for pandas dataframes.

Code:

data['summery_datetime'] = data['summery_datetime'].astype(str) #Datetime objects are not JSON serializable, convert to datetime string.
data = data.to_dict(orient='records')
job = client.load_table_from_json(data, table_obj, job_config=job_config)
print(job.result())

@tswast
Copy link
Contributor

tswast commented Jul 8, 2024

Thanks @jlynchMicron for providing a workaround. I think there are a few problems we'll need to work through, one of which is that the bigquery backend doesn't support JSON in load jobs from Parquet files: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions Please file a Feature Request https://cloud.google.com/support/docs/issue-trackers (specifically, Create new BigQuery issue).

Another possible workaround is to use CSV as the source format. In your example

schema = [
    bigquery.SchemaField("system_id", "STRING"),
    bigquery.SchemaField("summery_type", "STRING"),
    bigquery.SchemaField("summery_datetime", "DATETIME"),
    bigquery.SchemaField("summery_is_good", "BOOL"),
    bigquery.SchemaField("summery_data", "JSON"), #Stored in dataframe as JSON formatted string
]
job_config = bigquery.LoadJobConfig(
    schema=schema,
    source_format="CSV",
)

job = client.load_table_from_dataframe(data, table_obj, job_config=job_config)
print(job.result())

For googlers watching this issue, I have proposed a design go/bf-json which proposes a JSONDtype in https://github.com/googleapis/python-db-dtypes-pandas which would allow us to autodetect when JSON is used in a DataFrame. Before that though, we could do the same and choose the appropriate serialization format depending on the provided schema. For example, parquet must be used with STRUCT/ARRAY columns, but CSV must be used for JSON.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jul 8, 2024
@leahecole leahecole assigned Linchin and unassigned leahecole Jul 16, 2024
@Linchin Linchin assigned chelsea-lin and unassigned Linchin Jul 23, 2024
@Linchin
Copy link
Contributor

Linchin commented Jul 23, 2024

googleapis/python-db-dtypes-pandas#284 can potentially fulfill this feature request.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants