Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Clarify how can applications tell which top level element is intended #360

Open
cmungall opened this issue Nov 17, 2022 · 5 comments
Open

Comments

@cmungall
Copy link
Member

There are 3 top level elements (TLE) in phenopackets:

https://phenopacket-schema.readthedocs.io/en/latest/toplevel.html

All of the examples here:

https://phenopacket-schema.readthedocs.io/en/latest/examples.html

Use a phenopacket as a top level element

However, other repos have examples that use other top level elements; e.g. https://github.com/phenopackets/phenopacket-tools/blob/gh-pages/examples/families/family.yml

If an application is presented with a phenopacket document D, how should the application determine how to interpret it?

  1. Attempt to parse using each TLE schema until it finds one that passes. Note that in certain perverse cases, this could lead to abiguity
  2. The application should attempt to sniff the right TLE from the filename. E.g. in the example above "family.yml" looks like family should be the TLE
  3. Behavior is undefined, and a phenopacket-conforming application must receive a tuple of two messages, both the document D plus an additional TLE type designator T

None of these seem particularly satisfactory. Perhaps future versions of phenopackets could include a type designator field in each TLE so applications can clearly and unambiguously interpret a document

@cmungall
Copy link
Member Author

cmungall commented Nov 17, 2022

It looks like phenopacket-tools is going with strategy 3 and 1:

i.e the user should specify the element type (strategy 3), otherwise default to strategy 1.

I think this should be better documented in the main schema repo so all applications can implement analogous strategies

@julesjacobsen
Copy link
Collaborator

julesjacobsen commented Jan 6, 2023

A possible extension could be to use a wrapper object which will explicitly provide the type. Also this could allow other structures, such as the Phenopacket GA4GH Pedigree implementation to be added.

message PhenopacketWrapper {
    oneOf message {
        Phenopacket phenopacket = 1;
        Family family = 2;
        Cohort cohort = 3;
        // Ga4ghPedigree pedigree 4;  // possible additions might include this
    }
}

e.g.

# this is definitely a phenopacket, because it says so...
---
phenopacket:
    id: 12345
    subject:
        id: "Bart"
    phenotypicFeatures:
        - type:
              id: "HP:0000952"
              label: "Jaundice"

rather than this

# this is probably a phenopacket, because it has a top-level 'subject' field 
---
id: 12345
subject:
    id: "Bart"
phenotypicFeatures:
    - type:
          id: "HP:0000952"
          label: "Jaundice"

@ielis
Copy link
Contributor

ielis commented Apr 5, 2023

It is possible to implement "sniffing" - determine the format (YAML, JSON, or Protobuf) and the top-level element, at least to some extent. A simple strategy can test if the document looks like YAML or JSON using file signatures. JSON should start with {, YAML has lines with comments (#), document separators (although I'd prefer not seeing them in our setting), or top-level fields. If this fails, the sniffer can look for magic bytes (e.g. gzip) or fall back to Protobuf (or throw).

Sniffing top-level element can also be done to some extent. For JSON and YAML, it is possible to search for discriminatory top-level fields - fields that can be found only in specific elements (e.g. pedigree in Family). Unfortunately, I am not sure this simple algorithm will work with Protobuf, since the field names are not part of the payload.

Sniffing can help but it will always be fallible (I think) unless we add a wrapper element. The wrapper would, however, cause other pain..

I implemented the sniffing in phenopacket-tools

@pnrobinson
Copy link
Collaborator

This is a great feature! It might be good to add a --strict flag (or to make this feature optional), because sometimes it is good for software to die if the input is not clear.

@ielis
Copy link
Contributor

ielis commented Apr 6, 2023

@pnrobinson yeah, the sniffing is turned off if the user is explicit with the input using -f | --format or -e | --element CLI options.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants