Skip to content

Add the skip_reference_sequence and ignore_reference_sequence options #1971

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
jeromekelleher opened this issue Nov 30, 2021 · 19 comments · Fixed by #2019
Closed

Add the skip_reference_sequence and ignore_reference_sequence options #1971

jeromekelleher opened this issue Nov 30, 2021 · 19 comments · Fixed by #2019
Labels
enhancement New feature or request
Milestone

Comments

@jeromekelleher
Copy link
Member

We recently added the concept of "table data" in 82a56e7 with the addition of the skip_tables flag to tskit.load() and the ignore_tables flag to TableCollection.equals() (and the corresponding flags to the C API). Since that change was made we also in parallel added basic support for reference sequence data. As @bhaller points out (#1854 (comment)) the skip_tables option loads the reference sequence data.

The skip_tables option was initially motivated by the desire to get access to the top-level metadata only (#1854). Providing access only to the metadata is a non-starter I think, because it's much simpler to skip loading stuff into the table collection that it is to provide separate APIs for accessing the metadata. So, there will always be some extra info that comes with the metadata, and this is correct I think: what if I was going through a bunch of files just to read their uuid values? This isn't metadata, and I wouldn't want to read the whole file just to get them either.

The question then is what we do from this point. Since we want the option of not loading reference sequence data, the options as I see it are:

  1. Add similar flags like skip_reference_sequence and ignore_reference_sequence to load and equals
  2. Regard reference_sequence as table data, and document as such
  3. Rename the skip_tables and ignore_tables flags to something like top_level_only and be clear that we don't consider reference_sequence as top level data.

Any thoughts @bhaller @clwgg?

@jeromekelleher jeromekelleher added this to the Python 0.4.0 milestone Nov 30, 2021
@benjeffery
Copy link
Member

To me, the reference sequence is in the same category as the table data, and we use the existing flags. Happy to hear of any use cases that need just the reference though.

@jeromekelleher
Copy link
Member Author

I agree - any better name skip_tables which captures this though?

@benjeffery
Copy link
Member

As you say metadata_only doesn't work, skip_data might work, but is a little confusing?

@bhaller
Copy link

bhaller commented Nov 30, 2021

To me, the reference sequence is in the same category as the table data, and we use the existing flags.

This seems fine to me. Of course, who knows what different functionality end users might request next :->. It might be that simply providing multiple flags – load this, don't load that – would provide flexibility for the future that might prove useful. But perhaps it makes sense to wait and see, rather than trying to put all that flexibility in at the outset (perhaps complicating the API unnecessarily).

If the policy choice is that the reference sequence is "table data", then perhaps rather than renaming the flag away from skip_tables, the reference sequence ought to be renamed to reflect that policy? It could be called, e.g., refseq_table or some such?

I'm not sure I understand the objection to the flag name metadata_only, though. If the reference sequence is not loaded, what remains besides metadata that is loaded? @jeromekelleher you wrote "So, there will always be some extra info that comes with the metadata" – what is that extra info, exactly?

@clwgg
Copy link
Contributor

clwgg commented Nov 30, 2021

I agree with the above -- it doesn't feel like there is the need right now for a proliferation of flags to load to cover essentially a single use case.
I could see it be confusing though to treat the reference sequence as "table data" since we don't really have the notion of rows for this data structure, right? All the table data have a fairly common interface with add_row etc, so calling the reference a table, but then not providing the otherwise consistent table interface could be confusing I think.
I'm wondering if the reference has to be either top-level or table, or if it could be of its own "class" in a way? I suppose this is basically Option 3 from Jerome above. If it's clearly documented that the reference isn't top-level (though I am not sure what to call it instead), I think it would be easier to come up with a flag name that is exclusive of both tables and reference (and leave it open to skip other things in the future by calling it "that thing" instead of top-level).

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Nov 30, 2021

@jeromekelleher you wrote "So, there will always be some extra info that comes with the metadata" – what is that extra info, exactly?

Right now, there's sequence_length and file_uuid. We haven't done much with the UUIDs, but they are there and may prove useful in the future.

I think you're going to have the casting vote @bhaller - if you want to be able to load a tree sequence but skipping the reference sequence data like you mention here, then by far the simplest way to facilitate this is to have a skip_reference_sequence flag.

@bhaller
Copy link

bhaller commented Nov 30, 2021

I think you're going to have the casting vote @bhaller - if you want to be able to load a tree sequence but skipping the reference sequence data like you mention here, then by far the simplest way to facilitate this is to have a skip_reference_sequence flag.

Yes, if that flag is available I will use it (barring unforeseen snafus), and I think it will make a real difference to end users in terms of memory usage. Thanks for listening. :->

@jeromekelleher
Copy link
Member Author

OK, sounds like a decision then? We add the skip_reference_sequence and ignore_reference_sequence flags once #1944 is in.

@bhaller
Copy link

bhaller commented Nov 30, 2021

@jeromekelleher skip_reference_sequence, yes. What is ignore_reference_sequence again? Sorry if I lost the thread here somewhere. :->

@clwgg
Copy link
Contributor

clwgg commented Nov 30, 2021

@bhaller together with the skip_tables flag to load, we also introduced a ignore_tables flag to TableCollection.equals() which we used in the PR primarily for testing the skip_tables functionality. The proposal here is to do essentially the same for the reference_sequence.

@jeromekelleher jeromekelleher changed the title Should reference_sequence be consider "table" data, top-level data, or something else? Add the skip_reference_sequence and ignore_reference_sequence options Dec 1, 2021
@jeromekelleher
Copy link
Member Author

I've changed the title of this issue accordingly. @clwgg is there any chance we could use your expertise here? 😄

@benjeffery benjeffery added the enhancement New feature or request label Dec 1, 2021
@clwgg
Copy link
Contributor

clwgg commented Dec 1, 2021

sure, I'm happy to work on it! what is the approx. merge window time line for 0.4.0/1.0.0 at this point? (just to see if I can get it done in time)

@benjeffery
Copy link
Member

Were hoping about a week.

@jeromekelleher
Copy link
Member Author

I'll try to get #1944 merged tomorrow @clwgg

@jeromekelleher
Copy link
Member Author

The big update for references sequences is merged @clwgg, so the way is clear if you'd like to pick this one up!

@bhaller
Copy link

bhaller commented Dec 5, 2021

@clwgg it would be great if this happened soon; we're getting down to the wire on getting this stuff in before SLiM 3.7 needs to ship. Just FYI, if you are able to get to it. Thanks!

@jeromekelleher
Copy link
Member Author

There's no pressure on you to do this @clwgg, but could you let us know if you'll be able to get to it in the next day or two? @benjeffery or I would be happy to pick it up instead, as we're very keen to tag 0.4.0 (and a C 0.99 release) so that we can unblock some downstream stuff.

@clwgg
Copy link
Contributor

clwgg commented Dec 5, 2021

yup, on it today!

@bhaller
Copy link

bhaller commented Dec 7, 2021

Just an update: the tskit_one_point_oh branch has been updated to tskit 0.99.15, and appropriate fixes have been put in to match the changes done on the tskit side with respect to reading/writing the reference sequence. Changes pushed to GitHub. Seems good so far; running the full test suite now, which takes several hours.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants