Clarify how can applications tell which top level element is intended #360

cmungall · 2022-11-17T17:02:54Z

There are 3 top level elements (TLE) in phenopackets:

https://phenopacket-schema.readthedocs.io/en/latest/toplevel.html

All of the examples here:

https://phenopacket-schema.readthedocs.io/en/latest/examples.html

Use a phenopacket as a top level element

However, other repos have examples that use other top level elements; e.g. https://github.com/phenopackets/phenopacket-tools/blob/gh-pages/examples/families/family.yml

If an application is presented with a phenopacket document D, how should the application determine how to interpret it?

Attempt to parse using each TLE schema until it finds one that passes. Note that in certain perverse cases, this could lead to abiguity
The application should attempt to sniff the right TLE from the filename. E.g. in the example above "family.yml" looks like family should be the TLE
Behavior is undefined, and a phenopacket-conforming application must receive a tuple of two messages, both the document D plus an additional TLE type designator T

None of these seem particularly satisfactory. Perhaps future versions of phenopackets could include a type designator field in each TLE so applications can clearly and unambiguously interpret a document

cmungall · 2022-11-17T17:14:45Z

It looks like phenopacket-tools is going with strategy 3 and 1:

Implement element sniffing phenopacket-tools#111

i.e the user should specify the element type (strategy 3), otherwise default to strategy 1.

I think this should be better documented in the main schema repo so all applications can implement analogous strategies

julesjacobsen · 2023-01-06T13:34:25Z

A possible extension could be to use a wrapper object which will explicitly provide the type. Also this could allow other structures, such as the Phenopacket GA4GH Pedigree implementation to be added.

message PhenopacketWrapper {
    oneOf message {
        Phenopacket phenopacket = 1;
        Family family = 2;
        Cohort cohort = 3;
        // Ga4ghPedigree pedigree 4;  // possible additions might include this
    }
}

e.g.

# this is definitely a phenopacket, because it says so...
---
phenopacket:
    id: 12345
    subject:
        id: "Bart"
    phenotypicFeatures:
        - type:
              id: "HP:0000952"
              label: "Jaundice"

rather than this

# this is probably a phenopacket, because it has a top-level 'subject' field 
---
id: 12345
subject:
    id: "Bart"
phenotypicFeatures:
    - type:
          id: "HP:0000952"
          label: "Jaundice"

ielis · 2023-04-05T20:46:35Z

It is possible to implement "sniffing" - determine the format (YAML, JSON, or Protobuf) and the top-level element, at least to some extent. A simple strategy can test if the document looks like YAML or JSON using file signatures. JSON should start with {, YAML has lines with comments (#), document separators (although I'd prefer not seeing them in our setting), or top-level fields. If this fails, the sniffer can look for magic bytes (e.g. gzip) or fall back to Protobuf (or throw).

Sniffing top-level element can also be done to some extent. For JSON and YAML, it is possible to search for discriminatory top-level fields - fields that can be found only in specific elements (e.g. pedigree in Family). Unfortunately, I am not sure this simple algorithm will work with Protobuf, since the field names are not part of the payload.

Sniffing can help but it will always be fallible (I think) unless we add a wrapper element. The wrapper would, however, cause other pain..

I implemented the sniffing in phenopacket-tools

pnrobinson · 2023-04-05T21:58:35Z

This is a great feature! It might be good to add a --strict flag (or to make this feature optional), because sometimes it is good for software to die if the input is not clear.

ielis · 2023-04-06T14:11:57Z

@pnrobinson yeah, the sniffing is turned off if the user is explicit with the input using -f | --format or -e | --element CLI options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify how can applications tell which top level element is intended #360

Clarify how can applications tell which top level element is intended #360

cmungall commented Nov 17, 2022

cmungall commented Nov 17, 2022 •

edited

Loading

julesjacobsen commented Jan 6, 2023 •

edited

Loading

ielis commented Apr 5, 2023

pnrobinson commented Apr 5, 2023

ielis commented Apr 6, 2023

Clarify how can applications tell which top level element is intended #360

Clarify how can applications tell which top level element is intended #360

Comments

cmungall commented Nov 17, 2022

cmungall commented Nov 17, 2022 • edited Loading

julesjacobsen commented Jan 6, 2023 • edited Loading

ielis commented Apr 5, 2023

pnrobinson commented Apr 5, 2023

ielis commented Apr 6, 2023

cmungall commented Nov 17, 2022 •

edited

Loading

julesjacobsen commented Jan 6, 2023 •

edited

Loading