Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add argument for DeltaTable storage options #6

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Apr 28, 2023

This adds a keyword for users to control the storage_options provided to DeltaTable.

Without this, users can't read from Azure Blob Storage (easily) without first configuring environment variables:

import os
import dask_deltatable
import deltalake

storage_options = {
    "account_name": "tomsynapsetest",
    "sas_token": os.environ["AZURE_SAS_TOKEN"],
}

df = dask_deltatable.read_delta_table(
    "az://test/delta/delta-table-361822/",
    storage_options=storage_options,  # these are provided to fsspec
)
print(df)

this raises with

python example.py
Traceback (most recent call last):
  File "/home/taugspurger/src/dask/dask_deltatable/example.py", line 10, in <module>
    df = dask_deltatable.read_delta_table(
  File "/home/taugspurger/src/dask/dask_deltatable/dask_deltatable/core.py", line 284, in read_delta_table
    dtw = DeltaTableWrapper(
  File "/home/taugspurger/src/dask/dask_deltatable/dask_deltatable/core.py", line 39, in __init__
    self.dt = DeltaTable(table_uri=self.path, version=self.version, storage_options=delta_storage_options)
  File "/home/taugspurger/src/dask/.direnv/python-3.10.8/lib/python3.10/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to read delta log object: Generic MicrosoftAzure error: Account must be specified

By providing delta_storage_options, we can get things working:

python example.py
Dask DataFrame Structure:
                  id
npartitions=5
               int64
                 ...
...              ...
                 ...
                 ...
Dask Name: from-delayed, 6 graph layers

Note that I chose to add a new keyword argument rather than passing the fsspec storage_options through. I don't think we can assume that they're compatible.

Copy link
Collaborator

@rajagurunath rajagurunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this great suggestion along with the PR fix!

It looks like a much cleaner way to handle cloud access credentials. The earlier version reads from the boto3 session and saves that cred in os.environ which seems like a hack and makes using this framework with other libraries a little difficult, as mentioned by you here.

  • Initially, I also thought of using the same storage options for both fsspec and deltalake storage options, I agree that they may not be compatible always. Handling it as two separate options looks good to me.

This PR opens up using this framework for another cloud filesystem!! Thanks to you! 🥇

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (e2e0805) 95.41% compared to head (4ca8e35) 95.41%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@           Coverage Diff           @@
##             main       #6   +/-   ##
=======================================
  Coverage   95.41%   95.41%           
=======================================
  Files           2        2           
  Lines         109      109           
=======================================
  Hits          104      104           
  Misses          5        5           
Impacted Files Coverage Δ
dask_deltatable/core.py 95.37% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rajagurunath
Copy link
Collaborator

Once again Thanks for this PR :) merging it.

@rajagurunath rajagurunath merged commit 4ce6272 into dask-contrib:main Apr 29, 2023
@TomAugspurger TomAugspurger deleted the tom/feature/delta-options branch April 30, 2023 12:08
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants