You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.
First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:
conda create -y -n vsnp3 -c bioconda vsnp3=3.06
conda install -c bioconda vcftools bcftools
# I had a problem with some vcftools library that could be solved by creating a symbolic link
ln -s /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.1 /home/bioinfo/miniconda3/envs/vsnp3/lib/libcrypto.so.1.0.0
There are a few Warnings printed in the terminal while the step1 runs. The command I used:
vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='/home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS036.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)
Best Reference Finding with Sourmash
2022-05-19 14:51:17
== This is sourmash version 4.4.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
select query k=31 automatically.
loaded query: /home/bioinfo/analyses/mbovis_... (k=31, DNA)
loaded 1 databases.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
WARNING: Cannot estimate ANI because size estimation for at least one of these sketches may be inaccurate.
11 matches; showing first 3:
similarity match
---------- -----
6.3% NC_002945.4 Mycobacterium bovis AF2122/97 genome assembly...
6.2% NZ_CP041790.1 Mycobacterium tuberculosis strain SEA170200...
6.2% CP016401.1 Mycobacterium caprae strain Allgaeu genome
Sample: MBWGS036
Top Sourmash Finding: NC_002945.4
Reference Set: Mycobacterium_AF2122
Top reference that is automatically available: /home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta
#############
Spoligotype
2022-05-19 14:51:23
Align and make VCF file
2022-05-19 14:52:36
[M::mm_idx_gen::0.136*1.01] collected minimizers
[M::mm_idx_gen::0.160*1.95] sorted minimizers
[M::main::0.160*1.95] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.184*1.83] mid_occ = 11
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.194*1.79] distinct minimizers: 770441 (96.15% are singletons); average occurrences: 1.053; average spacing: 5.362; total length: 4349904
[M::worker_pipeline::3.132*5.40] mapped 9607 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x map-ont -R @RG\tID:MBWGS036\tSM:MBWGS036\tPL:ILLUMINA\tPI:250 -t 8 -o MBWGS036.sam /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/step1/MBWGS036.fastq.gz
[M::main] Real time: 3.149 sec; CPU: 16.941 sec; Peak RSS: 0.794 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[markdup] warning: unable to calculate estimated library size. Read pairs 0 should be greater than duplicate pairs 0, which should both be non zero.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf temp1.vcf
--recode-INFO-all
--out temp2
--recode
--remove-indels
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Warning: Expected at least 2 parts in INFO entry: ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
After filtering, kept 1 out of 1 Individuals
Outputting VCF file...
After filtering, kept 516 out of a possible 611 Sites
Run Time = 0.00 seconds
Zero Coverage
2022-05-19 17:12:07
Positions with no coverage: 12,953, 0.297777% of reference
MBWGS036 Poor FASTQ Usability
MBWGS036 Acceptable Reference Usability
As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?
The log file looks like this:
vsnp3_step1.py SET ARGUMENTS:
Namespace(FASTQ_R1='MBWGS009.fastq.gz', FASTQ_R2=None, FASTA=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.fasta'], gbk=['/home/bioinfo/vsnp3_test_dataset/vsnp_dependencies/Mycobacterium_AF2122/NC_002945v4.gbk'], reference_type=None, nanopore=True, assemble_unmap=False, debug=False)
Call Summary:
SYSTEM CALL: minimap2 -a -x map-ont -R "@RG\tID:MBWGS009\tSM:MBWGS009\tPL:ILLUMINA\tPI:250" -t 8 /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/MBWGS009.fastq.gz -o MBWGS009.sam -- 2022-05-19_17:45:11
SYSTEM CALL: samtools fixmate -O bam,level=1 -m MBWGS009.sam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools sort -l 1 -@8 -o MBWGS009_pos_srt.bam MBWGS009_fixmate.bam -- 2022-05-19_17:45:24
SYSTEM CALL: samtools markdup -f markduplicate_stats.txt -r -O bam,level=1 MBWGS009_pos_srt.bam MBWGS009_nodup.bam -- 2022-05-19_17:45:24
NOTE: Read stats gathered by markduplicate_stats.txt -- 2022-05-19_17:45:24
NOTE: Nanopore - bcftools mpileup used to call SNPs and make VCF files *** -- 2022-05-20_00:17:23
SYSTEM CALL: bcftools mpileup --threads 16 -Ou -f /home/bioinfo/analyses/mbovis_nanopore_vsnp3/fastq/MBWGS009/NC_002945v4.fasta MBWGS009_nodup.bam | bcftools call --threads 16 -mv -v -Ov -o MBWGS009_unfiltered_hapall.vcf -- 2022-05-20_00:17:23
SYSTEM CALL: vcffilter -f "QUAL > 20" MBWGS009_unfiltered_hapall.vcf > temp1.vcf -- 2022-05-20_00:17:23
NOTE: Nanopore QUAL values increased by 100 to obtain closer values seen with Illumina reads, and allowing VCF files from both platforms to be ran together. -- 2022-05-20_00:17:23
NOTE: Skipped unmapped read assembly -- 2022-05-20_00:17:23
IMPORT: VCF_Annotation(gbk_list=self.gbk, vcf_file=filtered_hapall) -- 2022-05-20_00:17:25
IMPORT: Zero_Coverage(FASTA=reference, bam=nodup_bamfile, vcf=filtered_hapall,) -- 2022-05-20_00:17:41
NOTE: Files moved to temp_dir and removed: *_unmapped*.fastq.gz, *_all.bam, *_fixmate.bam, *_pos_srt.bam, markduplicate_stats.txt, *.bai, *_filtered_hapall.vcf, *_mapfix_hapall.vcf, *_unfiltered_hapall.vcf, *_filtered_hapall_nanopore.vcf, *.sam, *.amb, *.ann, *.bwt, *.pac, *.fasta.sa, *_sorted.bam, *.dict, chrom_ranges.txt, *.fai, dup_metrics.csv -- 2022-05-20_00:17:41
Versions:
vSNP3: 3.06
Bio, 1.79
numpy, 1.22.3
pandas, 1.4.2
Minimap2: 2.24-r1122
Freebayes: v1.3.6
samtools 1.15
Using htslib 1.14
The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!
Here's the content of the Excel stats file:
sample date FASTA/s Sourmash Sequence Similarity Found_Reference_Set FASTQ_R1 R1 File Size R1 Read Count R1 Length Sum R1 Min Length R1 Ave Length R1 Max Length R1 Passing Q20 R1 Passing Q30 R1 Read Quality Ave Spoligotype Spacer Counts Spoligotype Binary Code Spoligotype Octal Code Spoligotype SB Number Groups Aligner Mapped Paired Reads Mapped Single Reads Unmapped Reads Unmapped Percent Unmapped Assembled Contigs Duplicate Paired Reads Duplicate Single Reads Duplicate Percent of Mapped Reads BAM/Reference File Reference Length Genome with Coverage Average Depth No Coverage Bases Percent Ref with Zero Coverage Quality SNPs
MBWGS009 2022-05-19_17-40-28 NC_002945v4.fasta 3.9%:3b48a55512e8dedc2b8d6e33699893bd Mycobacterium_AF2122 MBWGS009.fastq.gz 248.4 MB 74,874 262,465,587 1 3,505.4 36,224 65.27% 36.07% 13.8717 20:23:0:27:0:24:26:24:0:28:0:16:26:26:26:0:28:32:0:23:27:28:36:32:36:43:35:35:31:0:0:0:0:0:32:38:36:35:0:0:0:0:0 binary-1101011101011110110111111111100000111100000 octal-656573377603600 SB1071 group file not provided Minimap2 0 74,847 2,725 3.5% skipped assembly 0 442 0.6% MBWGS009_nodup.bam made with NC_002945v4 4,349,904 99.81% 59.1X 8,295 0.190694% 596
So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.
Thanks!
Marco
The text was updated successfully, but these errors were encountered:
Thanks for checking out vsnp3 and sending issues seen.
I've had inconsistent results with vcftools and bcftools. I typically see bcftools installed via the freebayes requirement so have left it out from explicit requirement list. Same with vcflib for vcftools. I've fought with conda installing bcftools as a Python 2 tool when asking for Python 3 when specifying the install explicitly. I've had best results leaving them out of the explicit requirements and letting them be installed as requirements of freebayes and vcflib. Same with the libcrypto (and other libraries). Other than having comments like this here to help other users, I am convinced that because everyone's environment is slightly different conda may require troubleshooting to either "fix" a user's environment or to fix something being overlooked by conda. That being said I should look at replacing these tools since they're often problematic. I did this for pysam/samtools. These tools would often (but not always) cause conflicting libraries, so pysam was removed from vsnp3. I will be working soon to provide vsnp3 as a container. Hopefully this will ease installation, or at least provide another option.
Nanopore is beta at best. Especially since the technology is steadily changing. Can you share the FASTQ file you're using? If so I would like to troubleshoot.
Sourmash runs quick and I like seeing the "best reference" even when specifying. I should change my wording so there isn't confusion. It should still be using the reference you specified. I'm going to update the wording.
I would like to improve Nanopore support. This has been a first test at seeing how it may work, but the datasets tried so far have been few. This input is good to get.
Hi Tod,
I've been trying the nanopore support of vSNP3 and I think it still needs to be optimized.
First, when installing vSNP3 with conda, it lacks 2 dependencies: vcftools and bcftools. To get a fully working pipeline (I only tested step 1 so far), I had to run:
There are a few Warnings printed in the terminal while the step1 runs. The command I used:
The terminal output:
As you can notice, the top reference has a very low % value. It still picks the right one, but this part of the pipeline is not optimized for Nanopore. Also, why is it still looking for the best reference is we already told which one to use?
The log file looks like this:
The main issue right now is that the mpileup step (using bcftools) takes about 5h per sample. I just can rerun all my samples with vSNP3 if it takes that long!
Here's the content of the Excel stats file:
So any plans on improving support for Nanopore? I actually haven't tested vSNP3 on paired end data yet, so I don't know if the speed problem is only Nanopore related or not. Let me know if you need more info.
Thanks!
Marco
The text was updated successfully, but these errors were encountered: