Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Version Exported Bags #5

Open
ThomasThelen opened this issue Aug 16, 2019 · 6 comments
Open

Version Exported Bags #5

ThomasThelen opened this issue Aug 16, 2019 · 6 comments

Comments

@ThomasThelen
Copy link
Member

Background

With PR #4, we no longer need to have a pid mapping file for packages whose resource maps encode the file path. If this change is made, we should consider making any additional changes to the package export format now.

Proposals

  1. Depreciate the pid-mapping.txt file in the exported bags.
    This can be done for packages with file paths in the resource map however, for older packages the pid-mapping.txt file is probably still required.

  2. Relocate oai-ore.txt inside data/
    By relocating the file in the data/ directory, we no longer have to declare it as a tag file (which doesn't actually break the bagit spec). Other system that may ingest our bags won't have to worry about parsing the additional tag file if we do this.

I've outlined two possible formats for a V2 export format. I'm leaning towards the second suggestion because it has a more clear distinction as to which files are relevant to the data package.

Consider a package named Frog Counts that is exported in the proposed V2 format.

Option 1

  1. The root directory of the package is not placed in data/ (see Option 2)
  2. The ORE is at the data/ root
<base directory>/
├── bagit.txt
├── bag-info.txt
├── manifest-<algorithm>.txt
└── data
    ├── oai-ore.txt
    ├── data-file-1.csv
    ├── data-file-2.csv
    ├── data-file-3.hdf
    └── metadata-file-1.xml

Option 2

  1. The data package is placed in a folder within data/
  2. The ORE document is placed within data/
<base directory>/
├── bagit.txt
├── bag-info.txt
├── manifest-<algorithm>.txt
└── data/
    ├── oai-ore.txt
    └── Frog Counts/
        ├── data-file-1.csv
        ├── data-file-2.csv
        ├── data-file-3.hdf
        └── metadata-file-1.xml

Scope of Changes

Changes will have to made to software project in the DataONE ecosystem that handle exporting and importing. These include

  1. GMN
  2. Metacat

I'd like to gather questions, comments, and concerns in this issue. Feel free to reply below.

People interested in this probably include @mbjones, @datadavev, @taojing2002, @amoeba, @csjx

@mbjones
Copy link
Member

mbjones commented Aug 17, 2019

Thanks, @ThomasThelen. I prefer the first option, but I can see why you might like the second. As we discussed, I support these changes and it will be good to focus on improving any interoperability issues now if we are making changes.

I also spent a few minutes comparing your proposal to the RDA BagPack recommendation, and that might suggest a few alterations to be considered. In particular, they have other recommendations on where to include the metadata files, ORE files, etc. So that would make a third option, in which they suggest the following locations for the metadata:

Option 3: RDA "BagPack" recommendation

<base directory>/
├── bagit.txt
├── bag-info.txt
├── manifest-<algorithm>.txt
└── data
    ├── data-file-1.csv
    ├── data-file-2.csv
    ├── data-file-3.hdf
└── metadata
    └── oai-ore.xml
    └── eml.xml
    └── datacite.xml
    └── sysmeta-id01.xml
    └── sysmeta-id02.xml
    └── sysmeta-id03.xml
    └── sysmeta-id04.xml
    └── sysmeta-id05.xml

This supports multiple metadata dialects (in this case, EML and ORE), and allows multiple metadata files for each object in the bag, named using the object identifier. So that mjight work for SystemMetadata documents (of which we would have 5 in this particular example). In addition, our ORE file and EML file would both have their own identifiers, checksums, and SystemMetadata. By treating them as 'data', then they would be recognized as first class objects in the package. So I wonder if its better to put them in the data dir or metadata dir. In addition, BagPack requires the use of a datacite.xml file. And the RDA spec also requires the use of a BagIt Profile document. So, the issues I see to be resolved:

  • Does the ORE document go in the root, in metadata, or in data
  • Does the main science metadata (e.g., eml) go in data where it currently is or in metadata
  • where do the series of SystemMetadata files go (one per data and metadata file)?
  • do we really want to introduce the requirement that all packages have a datacite.xml file?
  • do we want to follow or use the BagIt profile for RDA BagPacks?

@ThomasThelen
Copy link
Member Author

ThomasThelen commented Aug 19, 2019

Regarding the first three points: it's my opinion that any metadata should be kept out of the data/ folder for the benefit of the user. I imagine that most users downloading a package are only concerned with the actual data files. I think that adding anything other than the "desired" data files in data/ will be confusing and require them to sift through the directory to pull out the "desired" files.

I'm in the camp of placing oai-ore.xml at the root because it's going to potentially describe file paths. If we place oai-ore.xml in metadata/, we'd have to append each path with ../data (which isn't a big deal since we already have to append it with data/) during export. I don't think we'd want to store this path in the ORE that exists on DataONE. It might make more sense to keep the pre-pending consistent by putting metadata/ in front of each metadata document. This is possible is we place the ORE at the root.

Supporting datacite xml doesn't sound like a bad idea, there may be some isomorphisms between the EML document that we can take advantage of.

The "running Tales locally" feature requires a specific structure (see the run_tpl variable. When the DataONE structure is decided, I'm going to modify that template to account for the change.

@amoeba
Copy link
Contributor

amoeba commented Aug 19, 2019

Re:

@ThomasThelen: I think that adding anything other than the "desired" data files in data/ will be confusing and require them to sift through the directory to pull out the "desired" files.

👍

I prefer putting oai-ore in ./metadata as it's not a BagIt-specific file. RDA's spec says ./metadata should:

be used for all kinds of accompanying metadata in the bag, e.g. provenance information in ProvONE or access information following the WebACL standard.

Which is both a nod towards ./metadata being the place for RDF/XML or SystemMetadata.

Re:

@ThomasThelen: we'd have to append each path with ../data (which isn't a big deal since we already have to append it with data/) during export.

I'm not sure I totally follow. I figure the prov:locatedAt triple would always express the file's intended path when materialized on disk from some shared top level. Whether it's in a bag or not seems like a separate issue, or at least I think we can keep them separate.

Here's how I think it could look in an end-to-end sense:

User has folder, my_folder on their computer with a subfolder and a CSV in it. These are the "desired" files you mention above:

my_folder
└── subfolder
    └── mydata.csv

They create a DataONE Data Package out of it (upload each object, create metadata, create ORE). The ORE would have triples like:

<https://cn.dataone.org/cn/v2/resolve/mydata.csv> prov:locatedAt "./subfolder/mydata.csv"

Then, when serialized as a BagIt bag, the ORE would remain untouched and go in the ./metadata folder. We'd use the ORE's prov:locatedAt triples to populate the ./data folder and related files. Then the fetch.txt file would have a line like:

https://cn.dataone.org/cn/v2/resolve/mydata.csv 12345 data/subfolder/mydata.csv

After all that, if we wanted to re-ingest the Bagit Bag back into DataONE as a Data Package, I guess we could just populate the prov:locatedAt triples from the disk. The ORE ends up just being useful metadata as far as the Bag is concerned.

Re:

@mbjones: where do the series of SystemMetadata files go (one per data and metadata file)?

I like the idea of having System Metadata files in the bag but this doesn't scale well at all so I think we should skip it. If we really want additional metadata about each object in the bag, we could add another tag file with a subset of what's in System Metadata. The Bag already has sizes and checksums so I think the next most useful thing would be the format ID.

@mbjones : do we really want to introduce the requirement that all packages have a datacite.xml file?

I wasn't thinking this adds that requirement. I thought the datacite.xml file would be produced when converting a Data Package to BagIt and possibly on re-ingest but not described in the ORE.

@mbjones
Copy link
Member

mbjones commented Aug 25, 2019

These are all really good points, and it seems were mostly in agreement, with possibly a few minor differences. I think we should plan a time to discuss this and reach consensus. Some notes from questions above:

@amoeba: I like the idea of having System Metadata files in the bag but this doesn't scale well at all so I think we should skip it.

I would like to be able to round-trip a DataONE datapackage using only the information in the bag. If we omit SystemMetadata, we lose critical metadata to that round trip, including formatId, access policies, replication policies, and others. So, I think we should include a sysmeta for each file.

@amoeba: I wasn't thinking this adds that requirement [for datacite.xml].

The RDA spec is clear that conformant packages must include a datacite.xml. It would be cool if we conformed, but I agree it seems arbitrary to require just that metadata spec. So I would be ok with leaving it out.

@amoeba: I prefer putting oai-ore in ./metadata as it's not a BagIt-specific file.

I definitely see the advantage of putting the machine-readable metadata in metadata. The only downside is that it fails to recognize that files like our oia-ore.xml and eml.xml in the DataONE model are fully versioned members of the data package, and thus need to have their own identifier, sysmeta, and entries in the ORE resource map. We treat science metadata as a first-class citizen in our data packages. As long as we are aware of that, I think its helpful to put them all in metadata.

@ThomasThelen: The data package is placed in a folder within data/

This proposal means that every package has a single root folder, whereas I think the same thing can happen is the user is allowed to add folders at the root of the package. I'd prefer not forcing a single root folder, which allows users to include multiple folders at the top level of the package. In a bag, this might look like:

data
├── Frog\ Counts
│   └── counts.csv
├── inputs
│   ├── input1.csv
│   └── input2.csv
├── outputs
│   └── output1.csv
├── table1.csv
└── table2.csv

Let's discuss before a final determination is made.

@ThomasThelen
Copy link
Member Author

I'm currently summarizing and organizing these points in a google document that I'll share before the meeting starts. I'll send an invitation out targeting next Tuesday.

@rogerdahl
Copy link
Contributor

@ThomasThelen You might want to review the doc I wrote up about filename issues again as well:

https://hpad.dataone.org/GYUwjAxswGwOwFoBMwAsBOBrhwEYNwFY4AOBMSYAZmFziUIAYBDIA===#

I think it's important that we address the issues of filename sanitation, uniqueness and that we try to make filenames consistent across everything.

We should define how ORE filenames interact with the system metadata fileName field, so that we know what the procedure is for dealing with conflicts, if it's ok to substitute one for the other if one is missing or malformed, if we should add missing extension from the formatId, etc.

My vote is to have the science metadata go in the data dir and the system metadata in the metadata dir. I think putting the science metadata in the data dir shows how important we think science metadata is. And I think it's good for people to take a look at the science metadata files, or at least be aware that they're there, before they start trying to use the data files.

The system metadata files are generally on the order of 1-2K, so no problem to include.

I think the package should only have a single root folder, so that when you extract it, it adds only one directory in the folder you're in. It's annoying when you extract a package and end up with a bunch of files and dirs mixed in with whatever is in your current dir.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants