-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Version Exported Bags #5
Comments
Thanks, @ThomasThelen. I prefer the first option, but I can see why you might like the second. As we discussed, I support these changes and it will be good to focus on improving any interoperability issues now if we are making changes. I also spent a few minutes comparing your proposal to the RDA BagPack recommendation, and that might suggest a few alterations to be considered. In particular, they have other recommendations on where to include the metadata files, ORE files, etc. So that would make a third option, in which they suggest the following locations for the Option 3: RDA "BagPack" recommendation
This supports multiple metadata dialects (in this case, EML and ORE), and allows multiple metadata files for each object in the bag, named using the object identifier. So that mjight work for SystemMetadata documents (of which we would have 5 in this particular example). In addition, our ORE file and EML file would both have their own identifiers, checksums, and SystemMetadata. By treating them as 'data', then they would be recognized as first class objects in the package. So I wonder if its better to put them in the data dir or metadata dir. In addition, BagPack requires the use of a datacite.xml file. And the RDA spec also requires the use of a BagIt Profile document. So, the issues I see to be resolved:
|
Regarding the first three points: it's my opinion that any metadata should be kept out of the I'm in the camp of placing Supporting datacite xml doesn't sound like a bad idea, there may be some isomorphisms between the EML document that we can take advantage of. The "running Tales locally" feature requires a specific structure (see the run_tpl variable. When the DataONE structure is decided, I'm going to modify that template to account for the change. |
Re:
👍 I prefer putting oai-ore in
Which is both a nod towards Re:
I'm not sure I totally follow. I figure the Here's how I think it could look in an end-to-end sense: User has folder,
They create a DataONE Data Package out of it (upload each object, create metadata, create ORE). The ORE would have triples like:
Then, when serialized as a BagIt bag, the ORE would remain untouched and go in the
After all that, if we wanted to re-ingest the Bagit Bag back into DataONE as a Data Package, I guess we could just populate the Re:
I like the idea of having System Metadata files in the bag but this doesn't scale well at all so I think we should skip it. If we really want additional metadata about each object in the bag, we could add another tag file with a subset of what's in System Metadata. The Bag already has sizes and checksums so I think the next most useful thing would be the format ID.
I wasn't thinking this adds that requirement. I thought the |
These are all really good points, and it seems were mostly in agreement, with possibly a few minor differences. I think we should plan a time to discuss this and reach consensus. Some notes from questions above:
I would like to be able to round-trip a DataONE datapackage using only the information in the bag. If we omit SystemMetadata, we lose critical metadata to that round trip, including formatId, access policies, replication policies, and others. So, I think we should include a sysmeta for each file.
The RDA spec is clear that conformant packages must include a datacite.xml. It would be cool if we conformed, but I agree it seems arbitrary to require just that metadata spec. So I would be ok with leaving it out.
I definitely see the advantage of putting the machine-readable metadata in
This proposal means that every package has a single root folder, whereas I think the same thing can happen is the user is allowed to add folders at the root of the package. I'd prefer not forcing a single root folder, which allows users to include multiple folders at the top level of the package. In a bag, this might look like:
Let's discuss before a final determination is made. |
I'm currently summarizing and organizing these points in a google document that I'll share before the meeting starts. I'll send an invitation out targeting next Tuesday. |
@ThomasThelen You might want to review the doc I wrote up about filename issues again as well: https://hpad.dataone.org/GYUwjAxswGwOwFoBMwAsBOBrhwEYNwFY4AOBMSYAZmFziUIAYBDIA===# I think it's important that we address the issues of filename sanitation, uniqueness and that we try to make filenames consistent across everything. We should define how ORE filenames interact with the system metadata fileName field, so that we know what the procedure is for dealing with conflicts, if it's ok to substitute one for the other if one is missing or malformed, if we should add missing extension from the formatId, etc. My vote is to have the science metadata go in the data dir and the system metadata in the metadata dir. I think putting the science metadata in the data dir shows how important we think science metadata is. And I think it's good for people to take a look at the science metadata files, or at least be aware that they're there, before they start trying to use the data files. The system metadata files are generally on the order of 1-2K, so no problem to include. I think the package should only have a single root folder, so that when you extract it, it adds only one directory in the folder you're in. It's annoying when you extract a package and end up with a bunch of files and dirs mixed in with whatever is in your current dir. |
Background
With PR #4, we no longer need to have a pid mapping file for packages whose resource maps encode the file path. If this change is made, we should consider making any additional changes to the package export format now.
Proposals
Depreciate the pid-mapping.txt file in the exported bags.
This can be done for packages with file paths in the resource map however, for older packages the
pid-mapping.txt
file is probably still required.Relocate
oai-ore.txt
insidedata/
By relocating the file in the
data/
directory, we no longer have to declare it as a tag file (which doesn't actually break the bagit spec). Other system that may ingest our bags won't have to worry about parsing the additional tag file if we do this.I've outlined two possible formats for a V2 export format. I'm leaning towards the second suggestion because it has a more clear distinction as to which files are relevant to the data package.
Consider a package named
Frog Counts
that is exported in the proposed V2 format.Option 1
data/
(see Option 2)data/
rootOption 2
data/
data/
Scope of Changes
Changes will have to made to software project in the DataONE ecosystem that handle exporting and importing. These include
I'd like to gather questions, comments, and concerns in this issue. Feel free to reply below.
People interested in this probably include @mbjones, @datadavev, @taojing2002, @amoeba, @csjx
The text was updated successfully, but these errors were encountered: