Add the skip_reference_sequence and ignore_reference_sequence options #1971

jeromekelleher · 2021-11-30T12:59:21Z

We recently added the concept of "table data" in 82a56e7 with the addition of the skip_tables flag to tskit.load() and the ignore_tables flag to TableCollection.equals() (and the corresponding flags to the C API). Since that change was made we also in parallel added basic support for reference sequence data. As @bhaller points out (#1854 (comment)) the skip_tables option loads the reference sequence data.

The skip_tables option was initially motivated by the desire to get access to the top-level metadata only (#1854). Providing access only to the metadata is a non-starter I think, because it's much simpler to skip loading stuff into the table collection that it is to provide separate APIs for accessing the metadata. So, there will always be some extra info that comes with the metadata, and this is correct I think: what if I was going through a bunch of files just to read their uuid values? This isn't metadata, and I wouldn't want to read the whole file just to get them either.

The question then is what we do from this point. Since we want the option of not loading reference sequence data, the options as I see it are:

Add similar flags like skip_reference_sequence and ignore_reference_sequence to load and equals
Regard reference_sequence as table data, and document as such
Rename the skip_tables and ignore_tables flags to something like top_level_only and be clear that we don't consider reference_sequence as top level data.

Any thoughts @bhaller @clwgg?

The text was updated successfully, but these errors were encountered:

benjeffery · 2021-11-30T13:33:26Z

To me, the reference sequence is in the same category as the table data, and we use the existing flags. Happy to hear of any use cases that need just the reference though.

jeromekelleher · 2021-11-30T13:35:14Z

I agree - any better name skip_tables which captures this though?

benjeffery · 2021-11-30T13:41:18Z

As you say metadata_only doesn't work, skip_data might work, but is a little confusing?

bhaller · 2021-11-30T15:36:20Z

To me, the reference sequence is in the same category as the table data, and we use the existing flags.

This seems fine to me. Of course, who knows what different functionality end users might request next :->. It might be that simply providing multiple flags – load this, don't load that – would provide flexibility for the future that might prove useful. But perhaps it makes sense to wait and see, rather than trying to put all that flexibility in at the outset (perhaps complicating the API unnecessarily).

If the policy choice is that the reference sequence is "table data", then perhaps rather than renaming the flag away from skip_tables, the reference sequence ought to be renamed to reflect that policy? It could be called, e.g., refseq_table or some such?

I'm not sure I understand the objection to the flag name metadata_only, though. If the reference sequence is not loaded, what remains besides metadata that is loaded? @jeromekelleher you wrote "So, there will always be some extra info that comes with the metadata" – what is that extra info, exactly?

clwgg · 2021-11-30T17:39:08Z

I agree with the above -- it doesn't feel like there is the need right now for a proliferation of flags to load to cover essentially a single use case.
I could see it be confusing though to treat the reference sequence as "table data" since we don't really have the notion of rows for this data structure, right? All the table data have a fairly common interface with add_row etc, so calling the reference a table, but then not providing the otherwise consistent table interface could be confusing I think.
I'm wondering if the reference has to be either top-level or table, or if it could be of its own "class" in a way? I suppose this is basically Option 3 from Jerome above. If it's clearly documented that the reference isn't top-level (though I am not sure what to call it instead), I think it would be easier to come up with a flag name that is exclusive of both tables and reference (and leave it open to skip other things in the future by calling it "that thing" instead of top-level).

jeromekelleher · 2021-11-30T19:04:03Z

@jeromekelleher you wrote "So, there will always be some extra info that comes with the metadata" – what is that extra info, exactly?

Right now, there's sequence_length and file_uuid. We haven't done much with the UUIDs, but they are there and may prove useful in the future.

I think you're going to have the casting vote @bhaller - if you want to be able to load a tree sequence but skipping the reference sequence data like you mention here, then by far the simplest way to facilitate this is to have a skip_reference_sequence flag.

bhaller · 2021-11-30T19:19:02Z

I think you're going to have the casting vote @bhaller - if you want to be able to load a tree sequence but skipping the reference sequence data like you mention here, then by far the simplest way to facilitate this is to have a skip_reference_sequence flag.

Yes, if that flag is available I will use it (barring unforeseen snafus), and I think it will make a real difference to end users in terms of memory usage. Thanks for listening. :->

jeromekelleher · 2021-11-30T20:10:50Z

OK, sounds like a decision then? We add the skip_reference_sequence and ignore_reference_sequence flags once #1944 is in.

bhaller · 2021-11-30T20:38:35Z

@jeromekelleher skip_reference_sequence, yes. What is ignore_reference_sequence again? Sorry if I lost the thread here somewhere. :->

clwgg · 2021-11-30T21:24:53Z

@bhaller together with the skip_tables flag to load, we also introduced a ignore_tables flag to TableCollection.equals() which we used in the PR primarily for testing the skip_tables functionality. The proposal here is to do essentially the same for the reference_sequence.

jeromekelleher · 2021-12-01T09:16:35Z

I've changed the title of this issue accordingly. @clwgg is there any chance we could use your expertise here? 😄

clwgg · 2021-12-01T17:54:32Z

sure, I'm happy to work on it! what is the approx. merge window time line for 0.4.0/1.0.0 at this point? (just to see if I can get it done in time)

benjeffery · 2021-12-01T18:10:32Z

Were hoping about a week.

jeromekelleher · 2021-12-01T18:58:38Z

I'll try to get #1944 merged tomorrow @clwgg

jeromekelleher · 2021-12-02T17:22:39Z

The big update for references sequences is merged @clwgg, so the way is clear if you'd like to pick this one up!

bhaller · 2021-12-05T05:20:09Z

@clwgg it would be great if this happened soon; we're getting down to the wire on getting this stuff in before SLiM 3.7 needs to ship. Just FYI, if you are able to get to it. Thanks!

jeromekelleher · 2021-12-05T12:25:39Z

There's no pressure on you to do this @clwgg, but could you let us know if you'll be able to get to it in the next day or two? @benjeffery or I would be happy to pick it up instead, as we're very keen to tag 0.4.0 (and a C 0.99 release) so that we can unblock some downstream stuff.

clwgg · 2021-12-05T20:22:22Z

yup, on it today!

Closes tskit-dev#1971

Closes #1971

bhaller · 2021-12-07T20:23:05Z

Just an update: the tskit_one_point_oh branch has been updated to tskit 0.99.15, and appropriate fixes have been put in to match the changes done on the tskit side with respect to reading/writing the reference sequence. Changes pushed to GitHub. Seems good so far; running the full test suite now, which takes several hours.

jeromekelleher added this to the Python 0.4.0 milestone Nov 30, 2021

jeromekelleher mentioned this issue Nov 30, 2021

avoid double copies of the reference sequence? #1970

Closed

jeromekelleher changed the title ~~Should reference_sequence be consider "table" data, top-level data, or something else?~~ Add the skip_reference_sequence and ignore_reference_sequence options Dec 1, 2021

benjeffery added the enhancement New feature or request label Dec 1, 2021

bhaller mentioned this issue Dec 4, 2021

Take string methods for reference_sequence #2011

Merged

clwgg mentioned this issue Dec 6, 2021

Add the skip_reference_sequence and ignore_reference_sequence options #2019

Merged

3 tasks

clwgg added a commit to clwgg/tskit that referenced this issue Dec 6, 2021

Add options to skip/ignore the reference sequence

21bc675

Closes tskit-dev#1971

mergify bot closed this as completed in #2019 Dec 7, 2021

mergify bot pushed a commit that referenced this issue Dec 7, 2021

Add options to skip/ignore the reference sequence

afb1848

Closes #1971

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the skip_reference_sequence and ignore_reference_sequence options #1971

Add the skip_reference_sequence and ignore_reference_sequence options #1971

jeromekelleher commented Nov 30, 2021

benjeffery commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021

benjeffery commented Nov 30, 2021

bhaller commented Nov 30, 2021

clwgg commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021 •

edited

Loading

bhaller commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021

bhaller commented Nov 30, 2021

clwgg commented Nov 30, 2021 •

edited

Loading

jeromekelleher commented Dec 1, 2021

clwgg commented Dec 1, 2021

benjeffery commented Dec 1, 2021

jeromekelleher commented Dec 1, 2021

jeromekelleher commented Dec 2, 2021

bhaller commented Dec 5, 2021

jeromekelleher commented Dec 5, 2021

clwgg commented Dec 5, 2021

bhaller commented Dec 7, 2021

Add the skip_reference_sequence and ignore_reference_sequence options #1971

Add the skip_reference_sequence and ignore_reference_sequence options #1971

Comments

jeromekelleher commented Nov 30, 2021

benjeffery commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021

benjeffery commented Nov 30, 2021

bhaller commented Nov 30, 2021

clwgg commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021 • edited Loading

bhaller commented Nov 30, 2021

jeromekelleher commented Nov 30, 2021

bhaller commented Nov 30, 2021

clwgg commented Nov 30, 2021 • edited Loading

jeromekelleher commented Dec 1, 2021

clwgg commented Dec 1, 2021

benjeffery commented Dec 1, 2021

jeromekelleher commented Dec 1, 2021

jeromekelleher commented Dec 2, 2021

bhaller commented Dec 5, 2021

jeromekelleher commented Dec 5, 2021

clwgg commented Dec 5, 2021

bhaller commented Dec 7, 2021

jeromekelleher commented Nov 30, 2021 •

edited

Loading

clwgg commented Nov 30, 2021 •

edited

Loading