-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Arrow Dtypes gets casted only on first write. #1528
Comments
I have encountered the same issue. I wrote a delta-table first in s3 with the following params data_to_write.write_delta(
target=s3_location,
mode="error",
storage_options={
"AWS_REGION": self.region_name,
"AWS_ACCESS_KEY_ID": self.boto_session.get_credentials().access_key,
"AWS_SECRET_ACCESS_KEY": self.boto_session.get_credentials().secret_key,
"AWS_S3_ALLOW_UNSAFE_RENAME": "true",
},
overwrite_schema=True,
delta_write_options={
"partition_by": [
"ingested_at_year",
"ingested_at_month",
"ingested_at_day",
"ingested_at_hour",
],
"name":"raw_events",
"description":"Events loaded from source bucket",
},
) On the next run, it fails due to the following error E ValueError: Schema of data does not match table schema
E Table schema:
E obj_key: large_string
E data: large_string
E ingested_at: timestamp[us, tz=UTC]
E ingested_at_year: int32
E ingested_at_month: uint32
E ingested_at_day: uint32
E ingested_at_hour: uint32
E ingested_at_minute: uint32
E ingested_at_second: uint32
E Data Schema:
E obj_key: string
E data: string
E ingested_at: timestamp[us]
E ingested_at_year: int32
E ingested_at_month: int32
E ingested_at_day: int32
E ingested_at_hour: int32
E ingested_at_minute: int32
E ingested_at_second: int32 No possible solution I've found |
Got it. It sounds like we need to make a smarter comparison function here: delta-rs/python/deltalake/writer.py Lines 177 to 180 in da10127
That would allow Integer signedness will require casting though. Unsigned integers are not supported in Delta Lake (see list of supported types here). So we need to cast them to the corresponding signed type. We can add a casting step in write function as well. |
Thanks! |
I also encountered the same issue with polars and deltars. If you want you can assign it to me @wjones127. |
Assigned! |
Actually, there are two issues at play here, one is on polars side not handling the unsignedness while creating the schema to write with, and the other one is the wrong comparison in the writer. Polars is doing this: # Workaround to prevent manual casting of large types
table = try_get_deltatable(target, storage_options) # type: ignore[arg-type]
if table is not None:
table_schema = table.schema()
if data_schema == table_schema.to_pyarrow(as_large_types=True):
data_schema = table_schema.to_pyarrow()
|
Issue is resolved with this PR: #1668 |
Environment
Delta-rs version: 0.10.0
Binding: python
Environment:
Bug
What happened: When I am writing my delta table from a Polars dataframe, or after converting the Polars dataframe in to PyArrow, then Delta-rs will cast some large arrow dtypes in their primitive types. The original arrow schema is using large_strings and large_lists and sometimes unsigned integers. However, the delta primitive types only are converted to strings and lists once you read the delta table back in arrow.
What you expected to happen:
This casting to work the second time as well when you're appending to the table. The primitive types should be also mapped to the other arrow dtypes without explicitly mentioning this with, by passing the schema. What I am doing now is reading the
dt.schema().to_pyarrow()
and then passing it back while writing but this seems clunky.Also, when I read a delta table with Polars it's able to easily cast the arrow dtypes in the polars arrow dtypes, so it seems to be one-directional here.
How to reproduce it:
More details:
The text was updated successfully, but these errors were encountered: