Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

mpilup takes 5h per sample with nanopore data #1

Open
duceppemo opened this issue May 20, 2022 · 1 comment
Open

mpilup takes 5h per sample with nanopore data #1

duceppemo opened this issue May 20, 2022 · 1 comment

Comments

@duceppemo
Copy link
Contributor

Hi Tod,

I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.

First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:

conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0

There are a few Warnings printed in the terminal while the step1 runs. The command I used:

$ vsnp3_step1.py -n -r1 '/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz' -f /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta         -b /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk

The terminal output:

vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)



Best Reference Finding with Sourmash 
2022-05-19 14:51:17

== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.

WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity   match
----------   -----
  6.3%       NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
  6.2%       NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
  6.2%       CP016401.1 Mycobacterium caprae strain Allgaeu genome

Sample: MBWGS036
Top Sourmash Finding: NC_002945.4 
Reference Set: Mycobacterium_AF2122 
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta

#############


Spoligotype 
2022-05-19 14:51:23

Align and make VCF file 
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf temp1.vcf
	--recode-INFO-all
	--out temp2
	--recode
	--remove-indels

Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds

Zero Coverage 
2022-05-19 17:12:07
	Positions with no coverage: 12,953, 0.297777% of reference

MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability

As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?

The log file looks like this:


vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)

Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41

Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14

The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!

Here's the content of the Excel stats file:

sample	date	FASTA/s	Sourmash Sequence Similarity	Found_Reference_Set	FASTQ_R1	R1 File Size	R1 Read Count	R1 Length Sum	R1 Min Length	R1 Ave Length	R1 Max Length	R1 Passing Q20	R1 Passing Q30	R1 Read Quality Ave	Spoligotype Spacer Counts	Spoligotype Binary Code	Spoligotype Octal Code	Spoligotype SB Number	Groups	Aligner	Mapped Paired Reads	Mapped Single Reads	Unmapped Reads	Unmapped Percent	Unmapped Assembled Contigs	Duplicate Paired Reads	Duplicate Single Reads	Duplicate Percent of Mapped Reads	BAM/Reference File	Reference Length	Genome with Coverage	Average Depth	No Coverage Bases	Percent Ref with Zero Coverage	Quality SNPs
MBWGS009	2022-05-19_17-40-28	NC_002945v4.fasta	3.9%:3b48a55512e8dedc2b8d6e33699893bd	Mycobacterium_AF2122	MBWGS009.fastq.gz	248.4 MB	74,874	262,465,587	1	3,505.4	36,224	65.27%	36.07%	13.8717	20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0	binary-1101011101011110110111111111100000111100000	octal-656573377603600	SB1071	group file not provided	Minimap2	0	74,847	2,725	3.5%	skipped assembly	0	442	0.6%	MBWGS009_nodup.bam made with NC_002945v4	4,349,904	99.81%	59.1X	8,295	0.190694%	596

So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.

Thanks!
Marco

@stuber
Copy link
Contributor

stuber commented May 23, 2022

Thanks for checking out vsnp3 and sending issues seen.

I've had inconsistent results with vcftools and bcftools. I typically see bcftools installed via the freebayes requirement so have left it out from explicit requirement list. Same with vcflib for vcftools. I've fought with conda installing bcftools as a Python 2 tool when asking for Python 3 when specifying the install explicitly. I've had best results leaving them out of the explicit requirements and letting them be installed as requirements of freebayes and vcflib. Same with the libcrypto (and other libraries). Other than having comments like this here to help other users, I am convinced that because everyone's environment is slightly different conda may require troubleshooting to either "fix" a user's environment or to fix something being overlooked by conda. That being said I should look at replacing these tools since they're often problematic. I did this for pysam/samtools. These tools would often (but not always) cause conflicting libraries, so pysam was removed from vsnp3. I will be working soon to provide vsnp3 as a container. Hopefully this will ease installation, or at least provide another option.

Nanopore is beta at best. Especially since the technology is steadily changing. Can you share the FASTQ file you're using? If so I would like to troubleshoot.

Sourmash runs quick and I like seeing the "best reference" even when specifying. I should change my wording so there isn't confusion. It should still be using the reference you specified. I'm going to update the wording.

I would like to improve Nanopore support. This has been a first test at seeing how it may work, but the datasets tried so far have been few. This input is good to get.

@stuber stuber closed this as completed Jul 27, 2022
@stuber stuber reopened this Jul 27, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants