Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Deriva support #510

Merged
merged 16 commits into from
Feb 11, 2022
Merged

Deriva support #510

merged 16 commits into from
Feb 11, 2022

Conversation

hategan
Copy link

@hategan hategan commented Nov 10, 2021

This, together with the corresponding dms PR (whole-tale/girder_wt_data_manager#51), adds:

  • support for importing data from remote zip/bdbags using io_over_http
  • integration with deriva under api/v1/integrations/deriva
  • DMS downloads from http(s)://<loc>/<path>.zip?path=<path_in_zip>, again, using io_over_http
  • adds the deriva scope to Globus auth, which is needed to fetch files from deriva locations requiring authentication

I don't have a dashboard running, and I don't really understand the integrations thing (e.g., at what point in the integrations flow does the import actually happen?), so I tried to copy what I saw in other cases. That said, this can be tested by invoking the familiar importData endpoint with a link to a deriva bdbag, such as the one appearing in https://identifiers.fair-research.org/hdl:20.500.12633/11RHwdYqWNBZL, creating a session with the resulting folder, and mounting that session with girderfs/wt_dms_fs. That's somewhat tedious, but I don't know of a better solution.


IMPORT_PROVIDERS = ImportProviders()
IMPORT_PROVIDERS.addProvider(BDBagProvider())
IMPORT_PROVIDERS.addProvider(DerivaProvider())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order here matters unfortunately. BDBagProvider just checks if url/pid ends with .zip, which is true for Deriva resolved urls, e.g.:

https://identifiers.fair-research.org/hdl:20.500.12633/Uzn848Mmg1Wk ->
https://pbcconsortium.s3.amazonaws.com/wholetale/7c066384317e1395b28e082c66209a25/2021-12-01_13.38.24/Dataset_1-882P.zip

For now, I'd suggest pushing DerivaProvider before BDBagProvider until we figure out better solution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed! Will fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed. Not sure why github is not marking this as outdated.

server/lib/bdbag/bdbag_provider.py Outdated Show resolved Hide resolved
server/lib/bdbag/bdbag_provider.py Outdated Show resolved Hide resolved
server/lib/bdbag/bdbag_provider.py Outdated Show resolved Hide resolved
Comment on lines -124 to +163
line = f.readline()
line = _text(f.readline())
while line:
self._parse_fetch_line(root, line.strip())
line = f.readline()
line = _text(f.readline())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why _text is necessary here. Shouldn't a proper mode in open() be used instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all versions of zipfile support that, at least as far as I can tell. In particular, the one I have, which appears to be from Python 3.8, complains when zipfile.Path.open has anything else than 'r' or 'w'.

sz = -1
if 'size' in entity:
sz = entity['size']
name = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we default to something nice like:

name = pathlib.Path(urlparse(entity.getValue()).path).name  # where urlparse is from urllib.parse

potentially with:

if name.endswith(".zip"):
    name = name[:-4]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should.

@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Merging #510 (9788b95) into master (0711238) will decrease coverage by 1.11%.
The diff coverage is 63.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #510      +/-   ##
==========================================
- Coverage   92.89%   91.78%   -1.12%     
==========================================
  Files          55       58       +3     
  Lines        4294     4443     +149     
==========================================
+ Hits         3989     4078      +89     
- Misses        305      365      +60     
Impacted Files Coverage Δ
server/__init__.py 90.20% <42.85%> (-3.05%) ⬇️
server/lib/deriva/integration.py 45.16% <45.16%> (ø)
server/lib/deriva/auth.py 47.36% <47.36%> (ø)
server/lib/resolvers.py 82.08% <56.52%> (-13.57%) ⬇️
server/lib/deriva/provider.py 68.00% <68.00%> (ø)
server/lib/bdbag/bdbag_provider.py 90.90% <86.66%> (-0.93%) ⬇️
server/constants.py 89.13% <100.00%> (+0.75%) ⬆️
server/lib/__init__.py 98.07% <100.00%> (+0.07%) ⬆️
server/rest/integration.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0711238...9788b95. Read the comment docs.

Copy link

@craig-willis craig-willis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the following URLs:

The basic import works, but I noticed the following:

  • The dataset identifier is set to unknown on registration/import. I expected this to be set to the identifier of the bag (minid or URL?). A side effect of this is if I download the bag and import it directly via /importBag, the dataset record is updated (provider changes from DERIVA to BDBag). I expected to have two different datasets. I suggest adding a valid identifier value and checking it during import.
  • A user without permission on the DERIVA dataset is able to import the bag, but reading a remote file (i.e., listed in fetch.txt) fails due to permission denied. (With fuse, this comes across as input/output error). I expect this is a DERIVA configuration issue, but I didn't expect any user to be able to import the zip.

super().__init__('DERIVA')

def matches(self, entity: Entity) -> bool:
return str(entity.getValue()).startswith('https://pbcconsortium.s3.amazonaws.com/')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting this in a configuration setting instead of hardcoding ala https://github.com/whole-tale/girder_wholetale/blob/master/server/lib/dataverse/provider.py#L96? This way we can add new DERIVA installations without re-deploying.

@Xarthisius
Copy link
Collaborator

@hategan ^^ just a friendly ping.

@hategan
Copy link
Author

hategan commented Feb 9, 2022

Ok, so I rebased this (and the DMS) on auth_requests and used that scheme for the headers. It wasn't entirely smooth, since the DERIVA token does not have the resources_server pointing to the FQDN, so I had to override the default mechanism.

@Xarthisius otherwise, auth_requests seems like a nice way of doing things and I approve of that general message. There were some inconsistencies in the verificators (some inherit from Verificator, some don't), but I guess that's fine in Python.

I also added settings wherever there were hardcoded URLs, as @craig-willis suggested. It is somewhat clear that, should this be adopted by the larger DERIVA community, there will be multiple URLs.

I did a manual test. At first I got an auth error inside the DMS. We have no way of reporting these to the user or offering the chance to take some pre-defined corrective action, but that's another story. It worked after I logged in and out of both deriva and Globus.

@craig-willis craig-willis self-requested a review February 11, 2022 21:52
Copy link

@craig-willis craig-willis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic import is working. After discussion with @Xarthisius, I recognize now that the issue with overwriting the existing dataset is actually bigger than just Deriva and will file a separate issue. From my understanding export/publish has yet to be implemented, so will leave those for future testing.

@Xarthisius Xarthisius merged commit fd35a45 into master Feb 11, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants