Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature request: Utilize Diamond's contig features #10

Open
schorlton opened this issue Mar 21, 2021 · 1 comment
Open

Feature request: Utilize Diamond's contig features #10

schorlton opened this issue Mar 21, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@schorlton
Copy link

Thanks for the awesome software! If I understand the code correctly, Diamond is executed using largely default parameters. I'd suggest adding in ‐‐range‐culling ‐‐top 10 -F 15 (source), but this will likely require rewrites of other areas of contigtax. These parameters will perform local Diamond alignment, retaining the top hit (within 10%) in each area of the query contig. We'd then have to not filter by bitscore in contigtax, and also should rely on the evalue parameter of Diamond instead of filtering on that. Just wanted to get the discussion started - there's probably a bunch of other design decisions that I'm missing.

@johnne
Copy link
Collaborator

johnne commented Apr 5, 2021

Thanks for using the software and for making the suggestion!

Yes most of the call to diamond is made with default parameters, and configurable options are so far only --evalue, --top, --mode (blastx/blastp), --blocksize and --chunks. Note that you can use --top in both the search and assign steps of contigtax. In search the value is passed directly to Diamond which means the resulting output will already be filtered to whatever --top setting is used (10% by default). In the assign step you have the option of supplying the --top parameter again and here the value is used to filter the results file once more prior to assigning taxonomies (default here is 5%). My idea was that since the search step is the most time consuming you can run it with slightly more relaxed bitscore filtering and then modify the stringency at the assign step if need be.

I briefly remember thinking about the --range-culling feature of diamond a long time ago, but never got around to testing it that much. From what I understand the idea with it is to allow several hits with lower scores than the best-scoring hit to be reported from the same contig. This may impact the output from contigtax, especially for long contigs, depending on how the assignment step is run.
I've noticed that contigtax seems to perform best at contigs <10 kbp in length and that for very long contigs (close to complete bacterial chromosomes) hits for ribosomal sequences are likely to limit the resolution of assignments made because these genes have high bitscores and are well conserved between lineages which pushes the LCA up in the taxonomic hierarchy.
Using range-culling would probably at least make sure more hits are reported for long contigs. However, with the default rank_lca mode of assigning taxonomy I suspect the final output will be the same, because all reported hits hits are used to assign the LCA (also those with high bitscores). With the rank_vote assign mode the output may however change since here contigtax makes a decision (takes a 'vote') from the list of hit taxa, choosing the one which makes up at least vote_threshold (default = 0.5) of taxa at the considered rank. For a contig queried with range-culling the extra reported hits could then help push some taxa above the vote_threshold, leading them to be assigned to the contig. That may however take some additional coding, maybe to make contigtax take a vote on a per region basis.

Again, thanks for making the suggestion. It's interesting to think about and discuss these things. Adding the range-culling feature as an option to contigtax should not be a problem as it doesn't appear to affect the output format and thus doesn't cause problems with downstream assignments. I noticed that the feature requires diamond to be run with a frameshift penalty which is mostly recommended for error prone long-read sequence output and not the assembled short-read sequences I had in mind when designing contigtax, but nevertheless it may have it's benefits.

I've found that the lca classify functionality of sourmash is a good complement to contigtax because it performs very well at the long contigs where contigtax struggles.

@johnne johnne added the enhancement New feature or request label Apr 6, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants