Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Arrow Dtypes gets casted only on first write. #1528

Closed
ion-elgreco opened this issue Jul 10, 2023 · 7 comments
Closed

Arrow Dtypes gets casted only on first write. #1528

ion-elgreco opened this issue Jul 10, 2023 · 7 comments
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers

Comments

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jul 10, 2023

Environment

Delta-rs version: 0.10.0

Binding: python

Environment:

  • Cloud provider: Azure
  • OS: Linux
  • Other:

Bug

What happened: When I am writing my delta table from a Polars dataframe, or after converting the Polars dataframe in to PyArrow, then Delta-rs will cast some large arrow dtypes in their primitive types. The original arrow schema is using large_strings and large_lists and sometimes unsigned integers. However, the delta primitive types only are converted to strings and lists once you read the delta table back in arrow.

ValueError: Schema of data does not match table schema

Table schema:
col1: large_string
col2: large_string
col3: large_string
col4: large_list
  child 0, item: uint32
col5: large_list
  child 0, item: double
col6: large_list
  child 0, item: uint32

Data Schema:
col1: string
col2: string
col3: string
col4: list
  child 0, item: int32
col5: list
  child 0, item: double
col6: list
  child 0, item: int32

What you expected to happen:

This casting to work the second time as well when you're appending to the table. The primitive types should be also mapped to the other arrow dtypes without explicitly mentioning this with, by passing the schema. What I am doing now is reading the dt.schema().to_pyarrow() and then passing it back while writing but this seems clunky.

Also, when I read a delta table with Polars it's able to easily cast the arrow dtypes in the polars arrow dtypes, so it seems to be one-directional here.

How to reproduce it:

More details:

@ion-elgreco ion-elgreco added the bug Something isn't working label Jul 10, 2023
@ion-elgreco ion-elgreco changed the title Arrow Dtypes get casted only in direction Arrow Dtypes get casted only on first write Jul 10, 2023
@ion-elgreco ion-elgreco changed the title Arrow Dtypes get casted only on first write Arrow Dtypes gets casted only on first write. Jul 10, 2023
@Tomperez98
Copy link

I have encountered the same issue. I wrote a delta-table first in s3 with the following params

data_to_write.write_delta(
            target=s3_location,
            mode="error",
            storage_options={
                "AWS_REGION": self.region_name,
                "AWS_ACCESS_KEY_ID": self.boto_session.get_credentials().access_key,
                "AWS_SECRET_ACCESS_KEY": self.boto_session.get_credentials().secret_key,
                "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
            },
            overwrite_schema=True,
            delta_write_options={
                "partition_by": [
                    "ingested_at_year",
                    "ingested_at_month",
                    "ingested_at_day",
                    "ingested_at_hour",
                ],
                "name":"raw_events",
                "description":"Events loaded from source bucket",
            },

        )

On the next run, it fails due to the following error

E               ValueError: Schema of data does not match table schema
E               Table schema:
E               obj_key: large_string
E               data: large_string
E               ingested_at: timestamp[us, tz=UTC]
E               ingested_at_year: int32
E               ingested_at_month: uint32
E               ingested_at_day: uint32
E               ingested_at_hour: uint32
E               ingested_at_minute: uint32
E               ingested_at_second: uint32
E               Data Schema:
E               obj_key: string
E               data: string
E               ingested_at: timestamp[us]
E               ingested_at_year: int32
E               ingested_at_month: int32
E               ingested_at_day: int32
E               ingested_at_hour: int32
E               ingested_at_minute: int32
E               ingested_at_second: int32

No possible solution I've found

@wjones127
Copy link
Collaborator

Got it. It sounds like we need to make a smarter comparison function here:

if table: # already exists
if schema != table.schema().to_pyarrow() and not (
mode == "overwrite" and overwrite_schema
):

That would allow string == large_string and list == large_list. Those can be passed through as-is.

Integer signedness will require casting though. Unsigned integers are not supported in Delta Lake (see list of supported types here). So we need to cast them to the corresponding signed type. We can add a casting step in write function as well.

@wjones127 wjones127 added good first issue Good for newcomers binding/python Issues for the Python package labels Jul 16, 2023
@Tomperez98
Copy link

Thanks!

@BinaryTree1
Copy link

BinaryTree1 commented Jul 18, 2023

I also encountered the same issue with polars and deltars. If you want you can assign it to me @wjones127.

@wjones127
Copy link
Collaborator

Assigned!

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Jul 28, 2023

@wjones127 @BinaryTree1

Actually, there are two issues at play here, one is on polars side not handling the unsignedness while creating the schema to write with, and the other one is the wrong comparison in the writer.

Polars is doing this:

        # Workaround to prevent manual casting of large types
        table = try_get_deltatable(target, storage_options)  # type: ignore[arg-type]


        if table is not None:
            table_schema = table.schema()


            if data_schema == table_schema.to_pyarrow(as_large_types=True):
                data_schema = table_schema.to_pyarrow()

data_schema == table_schema.to_pyarrow(as_large_types=True) will result in False because the left-hand side will have unsigned integers and the right-hand side will have integers.

@ion-elgreco
Copy link
Collaborator Author

Issue is resolved with this PR: #1668

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
binding/python Issues for the Python package bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants