-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Deriva support #510
Deriva support #510
Conversation
|
||
IMPORT_PROVIDERS = ImportProviders() | ||
IMPORT_PROVIDERS.addProvider(BDBagProvider()) | ||
IMPORT_PROVIDERS.addProvider(DerivaProvider()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order here matters unfortunately. BDBagProvider
just checks if url/pid ends with .zip
, which is true for Deriva resolved urls, e.g.:
https://identifiers.fair-research.org/hdl:20.500.12633/Uzn848Mmg1Wk ->
https://pbcconsortium.s3.amazonaws.com/wholetale/7c066384317e1395b28e082c66209a25/2021-12-01_13.38.24/Dataset_1-882P.zip
For now, I'd suggest pushing DerivaProvider before BDBagProvider until we figure out better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed! Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed. Not sure why github is not marking this as outdated.
line = f.readline() | ||
line = _text(f.readline()) | ||
while line: | ||
self._parse_fetch_line(root, line.strip()) | ||
line = f.readline() | ||
line = _text(f.readline()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure why _text
is necessary here. Shouldn't a proper mode in open()
be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all versions of zipfile support that, at least as far as I can tell. In particular, the one I have, which appears to be from Python 3.8, complains when zipfile.Path.open has anything else than 'r' or 'w'.
server/lib/deriva/provider.py
Outdated
sz = -1 | ||
if 'size' in entity: | ||
sz = entity['size'] | ||
name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we default to something nice like:
name = pathlib.Path(urlparse(entity.getValue()).path).name # where urlparse is from urllib.parse
potentially with:
if name.endswith(".zip"):
name = name[:-4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should.
Codecov Report
@@ Coverage Diff @@
## master #510 +/- ##
==========================================
- Coverage 92.89% 91.78% -1.12%
==========================================
Files 55 58 +3
Lines 4294 4443 +149
==========================================
+ Hits 3989 4078 +89
- Misses 305 365 +60
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the following URLs:
- https://girder.local.wholetale.org/api/v1/integration/deriva?url=https%3A%2F%2Fpbcconsortium.s3.amazonaws.com%2Fwholetale%2F5ad7cdf55b0d5007601015b7ff1ea8d6%2F2021-11-09_21.47.58%2FDataset_1-882P.zip&force=false
- https://girder.local.wholetale.org/api/v1/integration/deriva?url=https%3A%2F%2Fidentifiers.fair-research.org%2Fhdl%3A20.500.12633%2FUzn848Mmg1Wk
The basic import works, but I noticed the following:
- The dataset
identifier
is set tounknown
on registration/import. I expected this to be set to the identifier of the bag (minid or URL?). A side effect of this is if I download the bag and import it directly via/importBag
, the dataset record is updated (provider changes fromDERIVA
toBDBag
). I expected to have two different datasets. I suggest adding a valid identifier value and checking it during import. - A user without permission on the DERIVA dataset is able to import the bag, but reading a remote file (i.e., listed in fetch.txt) fails due to permission denied. (With fuse, this comes across as input/output error). I expect this is a DERIVA configuration issue, but I didn't expect any user to be able to import the zip.
server/lib/deriva/provider.py
Outdated
super().__init__('DERIVA') | ||
|
||
def matches(self, entity: Entity) -> bool: | ||
return str(entity.getValue()).startswith('https://pbcconsortium.s3.amazonaws.com/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider putting this in a configuration setting instead of hardcoding ala https://github.com/whole-tale/girder_wholetale/blob/master/server/lib/dataverse/provider.py#L96? This way we can add new DERIVA installations without re-deploying.
@hategan ^^ just a friendly ping. |
Ok, so I rebased this (and the DMS) on auth_requests and used that scheme for the headers. It wasn't entirely smooth, since the DERIVA token does not have the resources_server pointing to the FQDN, so I had to override the default mechanism. @Xarthisius otherwise, auth_requests seems like a nice way of doing things and I approve of that general message. There were some inconsistencies in the verificators (some inherit from Verificator, some don't), but I guess that's fine in Python. I also added settings wherever there were hardcoded URLs, as @craig-willis suggested. It is somewhat clear that, should this be adopted by the larger DERIVA community, there will be multiple URLs. I did a manual test. At first I got an auth error inside the DMS. We have no way of reporting these to the user or offering the chance to take some pre-defined corrective action, but that's another story. It worked after I logged in and out of both deriva and Globus. |
…s order, the Deriva one would never get hit.
Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>
Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>
Co-authored-by: Kacper Kowalik <xarthisius.kk@gmail.com>
… be specific to DERIVA instances.
8b5988d
to
a869b2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic import is working. After discussion with @Xarthisius, I recognize now that the issue with overwriting the existing dataset is actually bigger than just Deriva and will file a separate issue. From my understanding export/publish has yet to be implemented, so will leave those for future testing.
This, together with the corresponding dms PR (whole-tale/girder_wt_data_manager#51), adds:
I don't have a dashboard running, and I don't really understand the integrations thing (e.g., at what point in the integrations flow does the import actually happen?), so I tried to copy what I saw in other cases. That said, this can be tested by invoking the familiar importData endpoint with a link to a deriva bdbag, such as the one appearing in https://identifiers.fair-research.org/hdl:20.500.12633/11RHwdYqWNBZL, creating a session with the resulting folder, and mounting that session with girderfs/wt_dms_fs. That's somewhat tedious, but I don't know of a better solution.