-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
POC: Add an inexpensive test to set anon if no creds are found in gcs #985
base: main
Are you sure you want to change the base?
Conversation
Regarding the discussion on #740, this is a possible approach to specify anon token in gcs if no creds is found. This is the similar to the approach we already have s3. The approach is to try different method of connection before defaulting back to anon as gcs filesystem does. But we are skipping the metadata check for the case of google cloud engine. Testing the improvement with the code: ```py import time from datachain import DataChain start = time.time() images = DataChain.from_storage("gs://datachain-demo/dogs-and-cats/*jpg") images.show() end = time.time() print(f"Time taken: {end - start} seconds") ``` The improvement was from: 27.73566699028015 seconds to: Time taken: 6.101557016372681 seconds Closes #740
Deploying datachain-documentation with
|
Latest commit: |
e85357c
|
Status: | ✅ Deploy successful! |
Preview URL: | https://6600a416.datachain-documentation.pages.dev |
Branch Preview URL: | https://gcs-pre-check-poc.datachain-documentation.pages.dev |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #985 +/- ##
==========================================
- Coverage 88.18% 87.18% -1.00%
==========================================
Files 133 133
Lines 12161 12193 +32
Branches 1698 1702 +4
==========================================
- Hits 10724 10631 -93
- Misses 1018 1138 +120
- Partials 419 424 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
||
# Define authentication methods to try | ||
auth_methods = ["google_default", "cache"] | ||
if is_on_gce(request_callback): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amritghimire what is the logic behind the request_callback? will always raise or only in certain cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to disable the connection to metadata server. It raises exception and the is_on_gce will check for existence of a file to determine if certain file exist. You can look into the code of is_on_gce for more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't it be disabled on the fsspec side? does it mean that metadata server can be ignored, or does it mean that it won't be retrying anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be retrying it anymore.
Also, this is just a proof of concept PR, if we want to go this route or not. Needs refinement if we want to go this way. Created a PR this way so it is easier to view the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, why can't we suggest the same improvement to the fsspec itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we go that route, let's see if this works for us or not first internally IMO.
This is a hacky fix for now to be sure.
@0x2b3bfa0 may have more context on this. If we can skip the metadata check or not.
I was wondering is it safe for datachain to skip the metadata check altogether or not?
So, users on google compute engine could use token to cloud to force this way of communication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got sidetracked by a lot of things and I lack enough context to respond — and also lack time to get that context. Please ping me in a few day if you still need my input. Sorry! 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering is it safe for datachain to skip the metadata check altogether or not?
answer should be the same as for fsspec to my mind, or google lib
So, users on google compute engine could use token to cloud to force this way of communication.
they could, but probably won't expect the discrepancy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, this is a hacky fix. We can look for proper way to fix this or look for proper way either to suggest it to fsspec or google lib or use a request which doesn't retry for 5 times when it fails for gce check.
We first need to figure out or discuss if we want to implement something like this. @shcheklein cc. @iterative/datachain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want some hacky fix like this here, I don't think we should spend more time to see how to skip the metadata check here.
|
||
# Try various authentication methods | ||
gcreds = GoogleCredentials( | ||
token=ANONYMOUS_TOKEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we pass anon token here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to initialize the creds. If anon is not passed, it will try to connect with other approaches.
|
||
# If anonymous access requested, force anonymous access | ||
if kwargs.get("anon"): | ||
kwargs["token"] = ANONYMOUS_TOKEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a way fsspec internally distinguishes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
Regarding the discussion on #740, this is a possible approach to specify
anon token in gcs if no creds is found.
This is the similar to the approach we already have s3.
The approach is to try different method of connection before defaulting
back to anon as gcs filesystem does. But we are skipping the metadata
check for the case of google cloud engine.
Testing the improvement with the code:
The improvement was from:
27.73566699028015 seconds
to:
Time taken: 6.101557016372681 seconds
Closes #740