Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

POC: Add an inexpensive test to set anon if no creds are found in gcs #985

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Mar 20, 2025

Regarding the discussion on #740, this is a possible approach to specify
anon token in gcs if no creds is found.
This is the similar to the approach we already have s3.

The approach is to try different method of connection before defaulting
back to anon as gcs filesystem does. But we are skipping the metadata
check for the case of google cloud engine.

Testing the improvement with the code:

import time
from datachain import DataChain

start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds")

The improvement was from:
27.73566699028015 seconds

to:
Time taken: 6.101557016372681 seconds

Closes #740

Note: this is just a proof of concept PR, if we want to go this route or not. Needs refinement if we want to go this way. Created a PR this way so it is easier to view the difference.

Regarding the discussion on #740, this is a possible approach to specify
anon token in gcs if no creds is found.
This is the similar to the approach we already have s3.

The approach is to try different method of connection before defaulting
back to anon as gcs filesystem does. But we are skipping the metadata
check for the case of google cloud engine.

Testing the improvement with the code:
```py
import time
from datachain import DataChain

start = time.time()
images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg")
images.show()
end = time.time()
print(f"Time taken: {end - start} seconds")
```

The improvement was from:
27.73566699028015 seconds

to:
Time taken: 6.101557016372681 seconds

Closes #740
Copy link

cloudflare-workers-and-pages bot commented Mar 20, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: e85357c
Status: ✅  Deploy successful!
Preview URL: https://6600a416.datachain-documentation.pages.dev
Branch Preview URL: https://gcs-pre-check-poc.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire self-assigned this Mar 20, 2025
@amritghimire amritghimire requested review from ilongin, dreadatour and a team March 20, 2025 12:14
Copy link

codecov bot commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 87.18%. Comparing base (e4cb8f5) to head (e85357c).

Files with missing lines Patch % Lines
src/datachain/client/gcs.py 40.00% 19 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #985      +/-   ##
==========================================
- Coverage   88.18%   87.18%   -1.00%     
==========================================
  Files         133      133              
  Lines       12161    12193      +32     
  Branches     1698     1702       +4     
==========================================
- Hits        10724    10631      -93     
- Misses       1018     1138     +120     
- Partials      419      424       +5     
Flag Coverage Δ
datachain 87.14% <40.00%> (-0.97%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


# Define authentication methods to try
auth_methods = ["google_default", "cache"]
if is_on_gce(request_callback):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amritghimire what is the logic behind the request_callback? will always raise or only in certain cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to disable the connection to metadata server. It raises exception and the is_on_gce will check for existence of a file to determine if certain file exist. You can look into the code of is_on_gce for more context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't it be disabled on the fsspec side? does it mean that metadata server can be ignored, or does it mean that it won't be retrying anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be retrying it anymore.
Also, this is just a proof of concept PR, if we want to go this route or not. Needs refinement if we want to go this way. Created a PR this way so it is easier to view the difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, why can't we suggest the same improvement to the fsspec itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we go that route, let's see if this works for us or not first internally IMO.

This is a hacky fix for now to be sure.

@0x2b3bfa0 may have more context on this. If we can skip the metadata check or not.

I was wondering is it safe for datachain to skip the metadata check altogether or not?

So, users on google compute engine could use token to cloud to force this way of communication.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got sidetracked by a lot of things and I lack enough context to respond — and also lack time to get that context. Please ping me in a few day if you still need my input. Sorry! 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering is it safe for datachain to skip the metadata check altogether or not?

answer should be the same as for fsspec to my mind, or google lib

So, users on google compute engine could use token to cloud to force this way of communication.

they could, but probably won't expect the discrepancy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, this is a hacky fix. We can look for proper way to fix this or look for proper way either to suggest it to fsspec or google lib or use a request which doesn't retry for 5 times when it fails for gce check.

We first need to figure out or discuss if we want to implement something like this. @shcheklein cc. @iterative/datachain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want some hacky fix like this here, I don't think we should spend more time to see how to skip the metadata check here.


# Try various authentication methods
gcreds = GoogleCredentials(
token=ANONYMOUS_TOKEN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we pass anon token here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to initialize the creds. If anon is not passed, it will try to connect with other approaches.


# If anonymous access requested, force anonymous access
if kwargs.get("anon"):
kwargs["token"] = ANONYMOUS_TOKEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a way fsspec internally distinguishes it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCP missing creds slows down significantly
3 participants