Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Problem about different results with different threads #121

Closed
liugui opened this issue Apr 24, 2017 · 1 comment
Closed

Problem about different results with different threads #121

liugui opened this issue Apr 24, 2017 · 1 comment

Comments

@liugui
Copy link

liugui commented Apr 24, 2017

When I use BWA(version 0.7.15-r1140), I select the BWA-MEM algorithm and here is my command:

bwa mem -t 2 reference_file fastq1.fq.gz fastq2.fq.gz > result1.sam

This works well and then I use this command to improve the threads:

bwa mem -t 10 reference_file fastq1.fq.gz fastq2.fq.gz > result2.sam

This also works well but when I compare the result1.sam with result2.sam, they are different! I also test with -t 6, -t 16, and all the results are different. However, when I run with the same threads twice, the results are identical. So I found that BWA-MEM will get different results with different threads.

Then I read the source code and found this:

kt_for(opt->n_threads, worker1, &w, (opt->flag&MEM_F_PE)? n>>1 : n); // find mapping positions
	for (i = 0; i < opt->n_threads; ++i) smem_aux_destroy(w.aux[i]);
free(w.aux);
if (opt->flag&MEM_F_PE) { // infer insert sizes if not provided
    if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t)); // if pes0 != NULL, set the insert-size distribution as pes0
    else mem_pestat(opt, bns->l_pac, n, w.regs, pes); // otherwise, infer the insert size distribution from data
}
kt_for(opt->n_threads, worker2, &w, (opt->flag&MEM_F_PE)? n>>1 : n);

That is, BWA-MEM use n_threads(such as -t 6, n = 6) to find mapping positions, but use only 1 thread to execute the function mem_pestat to calculate avg(average of the insert size) and std(standard deviation of the insert size), which are important to find pair information. According to BWA, every thread will process around 10000000bp, so:

If I use -t 2, BWA will calculate avg and std with 2 x 10000000bp
If I use -t 10, BWA will calculate avg and std with 10 x 10000000bp
If I use -t 16, BWA will calculate avg and std with 16 x 10000000bp

So I know why the results are different with different threads.

I wonder to know if there is anything wrong with my opinion? If it's correct, I want to know how to evaluate the difference? The difference will change which filed of the SAM record( such as RNAME or POS)? If it's wrong, I want to know the real reason to make the difference.

Any reply will be much appreciated!

@lh3
Copy link
Owner

lh3 commented Apr 25, 2017

You haven't done anything wrong. The bwa-mem result does change with the number of threads. Use a large -K like -K 10000000 if you prefer stable results regardless of the number of threads in use.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants