Skip to content

Add the skip_reference_sequence and ignore_reference_sequence options #2019

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Dec 7, 2021

Conversation

clwgg
Copy link
Contributor

@clwgg clwgg commented Dec 6, 2021

Description

Fixes #1971

Adds two new C flags: TSK_LOAD_SKIP_REFERENCE_SEQUENCE and TSK_CMP_IGNORE_REFERENCE_SEQUENCE.
These are exposed to python as the skip_reference_sequence flag to TableCollection.load and TreeSequence.load, as well as the ignore_reference_sequence flag to TableCollection.equals and TreeSequence.equals.
Since the flags were written this way in the issue I kept them "written out", though I think an argument could be made to shorten reference_sequence to refseq in all four cases (your call @jeromekelleher, @benjeffery).

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@clwgg clwgg requested a review from jeromekelleher December 6, 2021 06:28
@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Merging #2019 (b983188) into main (0c7549e) will increase coverage by 11.79%.
The diff coverage is 97.05%.

❗ Current head b983188 differs from pull request most recent head 21bc675. Consider uploading reports for the commit 21bc675 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2019       +/-   ##
===========================================
+ Coverage   81.56%   93.35%   +11.79%     
===========================================
  Files          27       27               
  Lines       25404    25536      +132     
  Branches     1112     1112               
===========================================
+ Hits        20720    23840     +3120     
+ Misses       4619     1660     -2959     
+ Partials       65       36       -29     
Flag Coverage Δ
c-tests 92.37% <100.00%> (-0.01%) ⬇️
lwt-tests 89.05% <ø> (ø)
python-c-tests 72.14% <95.65%> (+0.06%) ⬆️
python-tests 98.78% <87.50%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/tables.py 98.76% <75.00%> (+14.82%) ⬆️
c/tskit/tables.c 90.26% <100.00%> (-0.01%) ⬇️
python/_tskitmodule.c 90.96% <100.00%> (+0.01%) ⬆️
python/tskit/trees.py 97.80% <100.00%> (+50.97%) ⬆️
python/tskit/provenance.py 100.00% <0.00%> (+5.71%) ⬆️
python/tskit/metadata.py 99.01% <0.00%> (+18.90%) ⬆️
python/tskit/util.py 100.00% <0.00%> (+40.43%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c7549e...21bc675. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thanks @clwgg. I spotted a few minor issues.

Can you update the changelogs also please? (Should be a copy/paste from the ignore_tables entries)

c/tskit/tables.c Outdated
int kas_flags = options & TSK_LOAD_SKIP_TABLES ? 0 : KAS_READ_ALL;
int kas_flags = KAS_READ_ALL;
if ((options & TSK_LOAD_SKIP_TABLES)
| (options & TSK_LOAD_SKIP_REFERENCE_SEQUENCE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bitwise or I think, we want boolean or ||.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you wanted to be fancy you could say options & (TSK_LOAD_SKIP_TABLES|TSK_LOAD_SKIP_REFERENCE_SEQUENCE), but I think I'd prefer the more obvious version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops :shame:

@@ -2997,11 +2997,18 @@ def load(file, *, skip_tables=False):
Please note that with this option set, it is not possible to load data from
a stream of multiple tree sequences using consecutive calls to
:meth:`tskit.load`.
:param bool skip_reference_sequence: If True, the tree sequence is read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should say that the skip_x options require a seekable stream, and they'll fail if you call them on a socket or stdin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was wondering about that but wasn't sure how to fit it into the :param structure. should I just throw it in at the top-level before the :param blocks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw it in before the :param: blocks yeah, maybe as a .. warning:: or something?

@jeromekelleher
Copy link
Member

Since the flags were written this way in the issue I kept them "written out", though I think an argument could be made to shorten reference_sequence to refseq in all four cases

I agree there's an argument, but we've erred on the side of verbosity elsewhere so we might as well stick with what we have now, IMO. I'm happy to change if someone else feels strongly about it, though, I'm not bothered either way.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though one doc string needed a little tweak.

c/tskit/tables.h Outdated
Comment on lines 3551 to 3552
If the TSK_LOAD_SKIP_TABLES option is set, only the top-level
information of the table collection will be read, leaving all tables empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the TSK_LOAD_SKIP_TABLES option is set, only the top-level
information of the table collection will be read, leaving all tables empty.
If the TSK_LOAD_SKIP_TABLES option is set, only the non-table
information from the table collection will be read, leaving all tables with zero rows and no metadata or schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @benjeffery! since that's the skip_tables docstring, should I open another PR for that? or are y'all ok with throwing that in with the skip_refseq PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in this PR as that text only needs changing because of the ref seq.

@clwgg
Copy link
Contributor Author

clwgg commented Dec 6, 2021

  • added documentation review suggestions
  • added code review suggestions
  • added changelogs
  • rebased & squashed

@jeromekelleher
Copy link
Member

Looks great, thanks @clwgg, let's merge!

@benjeffery benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Dec 7, 2021
@mergify mergify bot merged commit afb1848 into tskit-dev:main Dec 7, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Dec 7, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the skip_reference_sequence and ignore_reference_sequence options
3 participants