Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ga4gh_identify adds digest properties in-place regardless of in_place argument #440

Open
jsstevenson opened this issue Jul 25, 2024 · 1 comment · May be fixed by #461
Open

ga4gh_identify adds digest properties in-place regardless of in_place argument #440

jsstevenson opened this issue Jul 25, 2024 · 1 comment · May be fixed by #461
Labels
bug Something isn't working question Further information is requested

Comments

@jsstevenson
Copy link
Contributor

I have a basic allele. If I manually add the VA ID, the compacted JSON dump is 323 chars:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt"
>>> len(allele.model_dump_json(exclude_none=True))
323

If I use ga4gh_identify, though, even if I set in_place to "never", digest fields are added that makes the final JSON over 400 characters:

>>> allele = Allele(location={"end": 44908822, "start": 44908821, "sequenceReference": {"id": "NC_0000019.10", "type": "SequenceReference", "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"}, "type": "SequenceLocation"}, state={"sequence": "T", "type":  "LiteralSequenceExpression"}, type="Allele")
>>> allele.id = ga4gh_identify(allele, "never")
>>> len(allele.model_dump_json(exclude_none=True))
411
>>> print(allele.model_dump_json(exclude_none=True, indent=2))
{
  "id": "ga4gh:VA.0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "type": "Allele",
  "digest": "0AePZIWZUNsUlQTamyLrjm2HWUw2opLt",
  "location": {
    "type": "SequenceLocation",
    "digest": "wIlaGykfwHIpPY2Fcxtbx4TINbbODFVz",
    "sequenceReference": {
      "id": "NC_0000019.10",
      "type": "SequenceReference",
      "refgetAccession": "SQ.IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl"
    },
    "start": 44908821,
    "end": 44908822
  },
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "T"
  }
}

This isn't the end of the world, but an increase in size of ~25% ends up being pretty hefty in extreme cases. is it intentional/necessary for ga4gh_identify to add these digest fields in-place, or can that be taken out? Or alternatively, maybe a ga4gh_compact method could strip unset/redundant fields?

@jsstevenson jsstevenson added bug Something isn't working question Further information is requested labels Jul 25, 2024
@korikuzma
Copy link
Contributor

Since digest is optional, I think it's a bug. I think in_place should apply to both id and digest fields. @ga4gh/vrs-python-maintainers thoughts?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants