Skip to content

Finalise alignment export #1894

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 5 commits into from
Nov 11, 2021
Merged

Finalise alignment export #1894

merged 5 commits into from
Nov 11, 2021

Conversation

jeromekelleher
Copy link
Member

Add support for outputting alignments in FASTA and nexus formats, and fixes #1893.

@jeromekelleher jeromekelleher mentioned this pull request Nov 7, 2021
@codecov
Copy link

codecov bot commented Nov 7, 2021

Codecov Report

Merging #1894 (cc0a9e7) into main (6f3f942) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1894      +/-   ##
==========================================
+ Coverage   93.15%   93.16%   +0.01%     
==========================================
  Files          27       27              
  Lines       25061    25099      +38     
  Branches     1104     1109       +5     
==========================================
+ Hits        23345    23383      +38     
  Misses       1682     1682              
  Partials       34       34              
Flag Coverage Δ
c-tests 92.21% <ø> (ø)
lwt-tests 89.14% <ø> (ø)
python-c-tests 68.30% <18.91%> (-0.14%) ⬇️
python-tests 98.76% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/cli.py 95.93% <ø> (ø)
python/tskit/text_formats.py 100.00% <100.00%> (ø)
python/tskit/trees.py 97.79% <100.00%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f3f942...cc0a9e7. Read the comment docs.

@jeromekelleher
Copy link
Member Author

This is ready for a look, can you have a look please @hyanwong @benjeffery?

@jeromekelleher jeromekelleher force-pushed the nexus-data branch 3 times, most recently from 68795fa to 35b6719 Compare November 9, 2021 13:41
@jeromekelleher
Copy link
Member Author

Bumping this - this is a major new piece of functionality so it would be good to get some eyes on it.

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine to me, modulo some minor comments (especially about how we detect missing data).

One more general point is that we should perhaps think about how in the future we would deal with indels, so that we don't paint ourselves into a corner now. However , it's hard to imagine how to align sequences with indels, unless the reference allele is the longest one (e.g. ATC) and any mutations are the same length or shorter. There would also be a weird interaction between e.g. a site at position 10 with ancestral_state "ATC" and a site at position 11 with ancestral state "G". Nevertheless, will will have to deal with indels at some point, so it might be worth thinking about how other libraries do it.

@hyanwong
Copy link
Member

Note that I think this also closes #353 doesn't it? Unless we want to leave an issue open to deal with indels?

@jeromekelleher
Copy link
Member Author

Note that I think this also closes #353 doesn't it? Unless we want to leave an issue open to deal with indels?

Yep - it's in the commit messages

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, % @hyanwong's questions.

@jeromekelleher
Copy link
Member Author

Thanks for the comments @hyanwong - I've made another pass. We're not going to get this perfect for missing data on this pass, but it'll be good enough to cover 99% of use cases. I'm fairly sure it's currently conservative, so that we may throw an error where not strictly necessary, but we'll never output something that's actually wrong.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Nov 11, 2021

Re indels, I agree we'll have to tackle this at some point. The only thing we'll need to worry about here I think in terms of forward compatibility is whether we insist that the reference sequence is equal to the sequence length when storing the reference data (#146). I don't think it affects the code here.

@hyanwong
Copy link
Member

Re indels, I agree we'll have to tackle this at some point. The only thing we'll need to worry about here I think in terms of forward compatibility is whether we insist that the reference sequence is equal to the sequence length when storing the reference data (#146). I don't think it affects the code here.

Right. I was wondering how people store multiple sequences in a FASTA file if there are indels? Do they just allow each sequence to be a different length, and to hell with the alignment & reference sequence? Or do they pad out all the deletions with "-", and potentially have more characters in each sequence than in the reference? It might be helpful to talk to someone who knows about this sort of thing, although I agree it doesn't seem like it should affect this PR.

Closes tskit-dev#1897
Closes tskit-dev#1818

Also fix incorrect documentation on genotype_matrix and variants wrt to
missing data
@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 11, 2021
@jeromekelleher
Copy link
Member Author

OK, merging this! Now the big question is whether we go ahead and release 0.4.0 like this or if we implement reference sequences first...

@mergify mergify bot merged commit 1fcb3f6 into tskit-dev:main Nov 11, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 11, 2021
@jeromekelleher jeromekelleher deleted the nexus-data branch November 11, 2021 16:59
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default missing data character should not be "-"
3 participants