Finalise alignment export #1894

jeromekelleher · 2021-11-07T13:51:54Z

Add support for outputting alignments in FASTA and nexus formats, and fixes #1893.

codecov · 2021-11-07T14:03:56Z

Codecov Report

Merging #1894 (cc0a9e7) into main (6f3f942) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1894      +/-   ##
==========================================
+ Coverage   93.15%   93.16%   +0.01%     
==========================================
  Files          27       27              
  Lines       25061    25099      +38     
  Branches     1104     1109       +5     
==========================================
+ Hits        23345    23383      +38     
  Misses       1682     1682              
  Partials       34       34

Flag	Coverage Δ
c-tests	`92.21% <ø> (ø)`
lwt-tests	`89.14% <ø> (ø)`
python-c-tests	`68.30% <18.91%> (-0.14%)`	⬇️
python-tests	`98.76% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/cli.py	`95.93% <ø> (ø)`
python/tskit/text_formats.py	`100.00% <100.00%> (ø)`
python/tskit/trees.py	`97.79% <100.00%> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f3f942...cc0a9e7. Read the comment docs.

jeromekelleher · 2021-11-08T15:28:07Z

This is ready for a look, can you have a look please @hyanwong @benjeffery?

jeromekelleher · 2021-11-10T16:07:29Z

Bumping this - this is a major new piece of functionality so it would be good to get some eyes on it.

hyanwong

This all looks fine to me, modulo some minor comments (especially about how we detect missing data).

One more general point is that we should perhaps think about how in the future we would deal with indels, so that we don't paint ourselves into a corner now. However , it's hard to imagine how to align sequences with indels, unless the reference allele is the longest one (e.g. ATC) and any mutations are the same length or shorter. There would also be a weird interaction between e.g. a site at position 10 with ancestral_state "ATC" and a site at position 11 with ancestral state "G". Nevertheless, will will have to deal with indels at some point, so it might be worth thinking about how other libraries do it.

python/tskit/trees.py

hyanwong · 2021-11-10T16:59:02Z

Note that I think this also closes #353 doesn't it? Unless we want to leave an issue open to deal with indels?

jeromekelleher · 2021-11-10T19:15:56Z

Note that I think this also closes #353 doesn't it? Unless we want to leave an issue open to deal with indels?

Yep - it's in the commit messages

benjeffery

LGTM, % @hyanwong's questions.

Closes tskit-dev#1816 Closes tskit-dev#353

Closes tskit-dev#1840

Closes tskit-dev#1893

jeromekelleher · 2021-11-11T14:46:00Z

Thanks for the comments @hyanwong - I've made another pass. We're not going to get this perfect for missing data on this pass, but it'll be good enough to cover 99% of use cases. I'm fairly sure it's currently conservative, so that we may throw an error where not strictly necessary, but we'll never output something that's actually wrong.

jeromekelleher · 2021-11-11T14:49:00Z

Re indels, I agree we'll have to tackle this at some point. The only thing we'll need to worry about here I think in terms of forward compatibility is whether we insist that the reference sequence is equal to the sequence length when storing the reference data (#146). I don't think it affects the code here.

hyanwong · 2021-11-11T15:31:05Z

Re indels, I agree we'll have to tackle this at some point. The only thing we'll need to worry about here I think in terms of forward compatibility is whether we insist that the reference sequence is equal to the sequence length when storing the reference data (#146). I don't think it affects the code here.

Right. I was wondering how people store multiple sequences in a FASTA file if there are indels? Do they just allow each sequence to be a different length, and to hell with the alignment & reference sequence? Or do they pad out all the deletions with "-", and potentially have more characters in each sequence than in the reference? It might be helpful to talk to someone who knows about this sort of thing, although I agree it doesn't seem like it should affect this PR.

Closes tskit-dev#1897 Closes tskit-dev#1818 Also fix incorrect documentation on genotype_matrix and variants wrt to missing data

jeromekelleher · 2021-11-11T16:41:30Z

OK, merging this! Now the big question is whether we go ahead and release 0.4.0 like this or if we implement reference sequences first...

jeromekelleher mentioned this pull request Nov 7, 2021

Update fasta code #1889

Closed

jeromekelleher force-pushed the nexus-data branch from b17940d to 63a77f7 Compare November 7, 2021 16:20

jeromekelleher marked this pull request as ready for review November 8, 2021 15:27

jeromekelleher force-pushed the nexus-data branch from 37cca5e to 737fe42 Compare November 8, 2021 15:27

jeromekelleher requested review from hyanwong and benjeffery November 8, 2021 15:28

jeromekelleher force-pushed the nexus-data branch 3 times, most recently from 68795fa to 35b6719 Compare November 9, 2021 13:41

hyanwong reviewed Nov 10, 2021

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

python/tskit/trees.py Outdated Show resolved Hide resolved

python/tskit/trees.py Show resolved Hide resolved

python/tskit/trees.py Show resolved Hide resolved

benjeffery approved these changes Nov 11, 2021

View reviewed changes

jeromekelleher mentioned this pull request Nov 11, 2021

Implement Tree.has_isolated_samples #1908

Open

jeromekelleher added 4 commits November 11, 2021 14:33

Update fasta implementation

2c2ea27

Closes tskit-dev#1816 Closes tskit-dev#353

Implement nexus data section

bb3c76f

Closes tskit-dev#1840

Change default missing data char to "N"

791ad0c

Closes tskit-dev#1893

Move fasta tests into test_phylo_formats

2762f6a

jeromekelleher force-pushed the nexus-data branch from 35b6719 to f230916 Compare November 11, 2021 14:43

hyanwong approved these changes Nov 11, 2021

View reviewed changes

Remove support for missing data from alignments

cc0a9e7

Closes tskit-dev#1897 Closes tskit-dev#1818 Also fix incorrect documentation on genotype_matrix and variants wrt to missing data

jeromekelleher force-pushed the nexus-data branch from f230916 to cc0a9e7 Compare November 11, 2021 16:40

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 11, 2021

mergify bot merged commit 1fcb3f6 into tskit-dev:main Nov 11, 2021

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 11, 2021

jeromekelleher deleted the nexus-data branch November 11, 2021 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalise alignment export #1894

Finalise alignment export #1894

jeromekelleher commented Nov 7, 2021

codecov bot commented Nov 7, 2021 •

edited

Loading

jeromekelleher commented Nov 8, 2021

jeromekelleher commented Nov 10, 2021

hyanwong left a comment

hyanwong commented Nov 10, 2021

jeromekelleher commented Nov 10, 2021

benjeffery left a comment

jeromekelleher commented Nov 11, 2021

jeromekelleher commented Nov 11, 2021 •

edited

Loading

hyanwong commented Nov 11, 2021

jeromekelleher commented Nov 11, 2021

Finalise alignment export #1894

Finalise alignment export #1894

Conversation

jeromekelleher commented Nov 7, 2021

codecov bot commented Nov 7, 2021 • edited Loading

Codecov Report

jeromekelleher commented Nov 8, 2021

jeromekelleher commented Nov 10, 2021

hyanwong left a comment

Choose a reason for hiding this comment

hyanwong commented Nov 10, 2021

jeromekelleher commented Nov 10, 2021

benjeffery left a comment

Choose a reason for hiding this comment

jeromekelleher commented Nov 11, 2021

jeromekelleher commented Nov 11, 2021 • edited Loading

hyanwong commented Nov 11, 2021

jeromekelleher commented Nov 11, 2021

codecov bot commented Nov 7, 2021 •

edited

Loading

jeromekelleher commented Nov 11, 2021 •

edited

Loading