Deriva support #510

hategan · 2021-11-10T06:09:58Z

This, together with the corresponding dms PR (whole-tale/girder_wt_data_manager#51), adds:

support for importing data from remote zip/bdbags using io_over_http
integration with deriva under api/v1/integrations/deriva
DMS downloads from http(s)://<loc>/<path>.zip?path=<path_in_zip>, again, using io_over_http
adds the deriva scope to Globus auth, which is needed to fetch files from deriva locations requiring authentication

I don't have a dashboard running, and I don't really understand the integrations thing (e.g., at what point in the integrations flow does the import actually happen?), so I tried to copy what I saw in other cases. That said, this can be tested by invoking the familiar importData endpoint with a link to a deriva bdbag, such as the one appearing in https://identifiers.fair-research.org/hdl:20.500.12633/11RHwdYqWNBZL, creating a session with the resulting folder, and mounting that session with girderfs/wt_dms_fs. That's somewhat tedious, but I don't know of a better solution.

Xarthisius · 2021-12-02T17:08:44Z

server/lib/__init__.py


 IMPORT_PROVIDERS = ImportProviders()
 IMPORT_PROVIDERS.addProvider(BDBagProvider())
+IMPORT_PROVIDERS.addProvider(DerivaProvider())


Order here matters unfortunately. BDBagProvider just checks if url/pid ends with .zip, which is true for Deriva resolved urls, e.g.:

https://identifiers.fair-research.org/hdl:20.500.12633/Uzn848Mmg1Wk -> https://pbcconsortium.s3.amazonaws.com/wholetale/7c066384317e1395b28e082c66209a25/2021-12-01_13.38.24/Dataset_1-882P.zip

For now, I'd suggest pushing DerivaProvider before BDBagProvider until we figure out better solution.

Yes, indeed! Will fix.

Should be fixed. Not sure why github is not marking this as outdated.

server/lib/bdbag/bdbag_provider.py

Xarthisius · 2021-12-02T17:13:57Z

server/lib/bdbag/bdbag_provider.py

-                line = f.readline()
+                line = _text(f.readline())
                while line:
                    self._parse_fetch_line(root, line.strip())
-                    line = f.readline()
+                    line = _text(f.readline())


I'm not really sure why _text is necessary here. Shouldn't a proper mode in open() be used instead?

Not all versions of zipfile support that, at least as far as I can tell. In particular, the one I have, which appears to be from Python 3.8, complains when zipfile.Path.open has anything else than 'r' or 'w'.

Xarthisius · 2021-12-02T17:16:53Z

server/lib/deriva/provider.py

+        sz = -1
+        if 'size' in entity:
+            sz = entity['size']
+        name = None


Can we default to something nice like:

name = pathlib.Path(urlparse(entity.getValue()).path).name # where urlparse is from urllib.parse

potentially with:

if name.endswith(".zip"): name = name[:-4]

We probably should.

codecov · 2021-12-06T19:53:35Z

Codecov Report

Merging #510 (9788b95) into master (0711238) will decrease coverage by 1.11%.
The diff coverage is 63.79%.

@@            Coverage Diff             @@
##           master     #510      +/-   ##
==========================================
- Coverage   92.89%   91.78%   -1.12%     
==========================================
  Files          55       58       +3     
  Lines        4294     4443     +149     
==========================================
+ Hits         3989     4078      +89     
- Misses        305      365      +60

Impacted Files	Coverage Δ
server/__init__.py	`90.20% <42.85%> (-3.05%)`	⬇️
server/lib/deriva/integration.py	`45.16% <45.16%> (ø)`
server/lib/deriva/auth.py	`47.36% <47.36%> (ø)`
server/lib/resolvers.py	`82.08% <56.52%> (-13.57%)`	⬇️
server/lib/deriva/provider.py	`68.00% <68.00%> (ø)`
server/lib/bdbag/bdbag_provider.py	`90.90% <86.66%> (-0.93%)`	⬇️
server/constants.py	`89.13% <100.00%> (+0.75%)`	⬆️
server/lib/__init__.py	`98.07% <100.00%> (+0.07%)`	⬆️
server/rest/integration.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0711238...9788b95. Read the comment docs.

craig-willis

I tested the following URLs:

The basic import works, but I noticed the following:

The dataset identifier is set to unknown on registration/import. I expected this to be set to the identifier of the bag (minid or URL?). A side effect of this is if I download the bag and import it directly via /importBag, the dataset record is updated (provider changes from DERIVA to BDBag). I expected to have two different datasets. I suggest adding a valid identifier value and checking it during import.
A user without permission on the DERIVA dataset is able to import the bag, but reading a remote file (i.e., listed in fetch.txt) fails due to permission denied. (With fuse, this comes across as input/output error). I expect this is a DERIVA configuration issue, but I didn't expect any user to be able to import the zip.

craig-willis · 2021-12-13T21:27:23Z

server/lib/deriva/provider.py

+        super().__init__('DERIVA')
+
+    def matches(self, entity: Entity) -> bool:
+        return str(entity.getValue()).startswith('https://pbcconsortium.s3.amazonaws.com/')


Consider putting this in a configuration setting instead of hardcoding ala https://github.com/whole-tale/girder_wholetale/blob/master/server/lib/dataverse/provider.py#L96? This way we can add new DERIVA installations without re-deploying.

Xarthisius · 2022-01-06T17:00:11Z

@hategan ^^ just a friendly ping.

hategan · 2022-02-09T06:32:03Z

Ok, so I rebased this (and the DMS) on auth_requests and used that scheme for the headers. It wasn't entirely smooth, since the DERIVA token does not have the resources_server pointing to the FQDN, so I had to override the default mechanism.

@Xarthisius otherwise, auth_requests seems like a nice way of doing things and I approve of that general message. There were some inconsistencies in the verificators (some inherit from Verificator, some don't), but I guess that's fine in Python.

I also added settings wherever there were hardcoded URLs, as @craig-willis suggested. It is somewhat clear that, should this be adopted by the larger DERIVA community, there will be multiple URLs.

I did a manual test. At first I got an auth error inside the DMS. We have no way of reporting these to the user or offering the chance to take some pre-defined corrective action, but that's another story. It worked after I logged in and out of both deriva and Globus.

…s order, the Deriva one would never get hit.

Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>

… be specific to DERIVA instances.

…anually.

craig-willis

Basic import is working. After discussion with @Xarthisius, I recognize now that the issue with overwriting the existing dataset is actually bigger than just Deriva and will file a separate issue. From my understanding export/publish has yet to be implemented, so will leave those for future testing.

hategan mentioned this pull request Nov 10, 2021

Deriva support whole-tale/girder_wt_data_manager#51

Merged

hategan requested review from craig-willis and Xarthisius November 10, 2021 06:12

hategan mentioned this pull request Nov 10, 2021

Added deriva scope to globus oauth provider. whole-tale/girder#14

Closed

Xarthisius reviewed Dec 2, 2021

View reviewed changes

craig-willis requested changes Dec 13, 2021

View reviewed changes

hategan and others added 14 commits February 11, 2022 14:15

Made size parameter consistent with the rest of the api

9a00882

Added support for remote zip files through httpio.

b84f13c

Fixed resolver flow

1fa98cd

Added minid resolver, used by deriva

e665c05

Added deriva integration endpoint

2882393

Add Deriva scopes to Globus oauth provider

f4f39ca

Switched order of Deriva and BDBag providers, since, with the previou…

a9e5ce6

…s order, the Deriva one would never get hit.

Fixed undescribable bug

28238f2

Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>

Removed useless try/except

1e12097

Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>

No need for base_url here now.

6535192

Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>

Derive the name of the dataset from the URL path instead of having none.

321fc7d

Flake8 fixes

4f08889

Changed export URLs and scopes for DERIVA to settings, since they may…

08d2ea1

… be specific to DERIVA instances.

Use auth_requests to inject deriva auth headers instead of doing so m…

a869b2a

…anually.

Xarthisius force-pushed the deriva_support branch from 8b5988d to a869b2a Compare February 11, 2022 20:27

Xarthisius added 2 commits February 11, 2022 14:30

Remove unused imports

be54b48

Not every token has a 'scope'

9788b95

Xarthisius approved these changes Feb 11, 2022

View reviewed changes

craig-willis self-requested a review February 11, 2022 21:52

craig-willis approved these changes Feb 11, 2022

View reviewed changes

Xarthisius merged commit fd35a45 into master Feb 11, 2022

Xarthisius deleted the deriva_support branch February 11, 2022 22:11

craig-willis mentioned this pull request Feb 13, 2022

Can't import two datasets with the same name #521

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deriva support #510

Deriva support #510

hategan commented Nov 10, 2021 •

edited

Loading

Xarthisius Dec 2, 2021

hategan Dec 6, 2021

hategan Dec 10, 2021

Xarthisius Dec 2, 2021

hategan Dec 6, 2021

Xarthisius Dec 2, 2021

hategan Dec 6, 2021

codecov bot commented Dec 6, 2021 •

edited

Loading

craig-willis left a comment •

edited

Loading

craig-willis Dec 13, 2021

Xarthisius commented Jan 6, 2022

hategan commented Feb 9, 2022

craig-willis left a comment

Deriva support #510

Deriva support #510

Conversation

hategan commented Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 6, 2021 • edited Loading

Codecov Report

craig-willis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xarthisius commented Jan 6, 2022

hategan commented Feb 9, 2022

craig-willis left a comment

Choose a reason for hiding this comment

hategan commented Nov 10, 2021 •

edited

Loading

codecov bot commented Dec 6, 2021 •

edited

Loading

craig-willis left a comment •

edited

Loading