-
Notifications
You must be signed in to change notification settings - Fork 76
Finalise alignment export #1894
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1894 +/- ##
==========================================
+ Coverage 93.15% 93.16% +0.01%
==========================================
Files 27 27
Lines 25061 25099 +38
Branches 1104 1109 +5
==========================================
+ Hits 23345 23383 +38
Misses 1682 1682
Partials 34 34
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
b17940d
to
63a77f7
Compare
37cca5e
to
737fe42
Compare
This is ready for a look, can you have a look please @hyanwong @benjeffery? |
68795fa
to
35b6719
Compare
Bumping this - this is a major new piece of functionality so it would be good to get some eyes on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fine to me, modulo some minor comments (especially about how we detect missing data).
One more general point is that we should perhaps think about how in the future we would deal with indels, so that we don't paint ourselves into a corner now. However , it's hard to imagine how to align sequences with indels, unless the reference allele is the longest one (e.g. ATC) and any mutations are the same length or shorter. There would also be a weird interaction between e.g. a site at position 10 with ancestral_state "ATC" and a site at position 11 with ancestral state "G". Nevertheless, will will have to deal with indels at some point, so it might be worth thinking about how other libraries do it.
Note that I think this also closes #353 doesn't it? Unless we want to leave an issue open to deal with indels? |
Yep - it's in the commit messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, % @hyanwong's questions.
35b6719
to
f230916
Compare
Thanks for the comments @hyanwong - I've made another pass. We're not going to get this perfect for missing data on this pass, but it'll be good enough to cover 99% of use cases. I'm fairly sure it's currently conservative, so that we may throw an error where not strictly necessary, but we'll never output something that's actually wrong. |
Re indels, I agree we'll have to tackle this at some point. The only thing we'll need to worry about here I think in terms of forward compatibility is whether we insist that the reference sequence is equal to the sequence length when storing the reference data (#146). I don't think it affects the code here. |
Right. I was wondering how people store multiple sequences in a FASTA file if there are indels? Do they just allow each sequence to be a different length, and to hell with the alignment & reference sequence? Or do they pad out all the deletions with "-", and potentially have more characters in each sequence than in the reference? It might be helpful to talk to someone who knows about this sort of thing, although I agree it doesn't seem like it should affect this PR. |
Closes tskit-dev#1897 Closes tskit-dev#1818 Also fix incorrect documentation on genotype_matrix and variants wrt to missing data
f230916
to
cc0a9e7
Compare
OK, merging this! Now the big question is whether we go ahead and release 0.4.0 like this or if we implement reference sequences first... |
Add support for outputting alignments in FASTA and nexus formats, and fixes #1893.