Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

IntelDeflater intermittently fails to properly compress outputs with GKL 0.8.8 #177

Closed
droazen opened this issue Jan 30, 2023 · 3 comments · Fixed by #178
Closed

IntelDeflater intermittently fails to properly compress outputs with GKL 0.8.8 #177

droazen opened this issue Jan 30, 2023 · 3 comments · Fixed by #178
Assignees

Comments

@droazen
Copy link

droazen commented Jan 30, 2023

As reported by @kachulis in broadinstitute/gatk#8141, we are finding that the IntelDeflater in GKL version 0.8.8 seems to intermittently fail to properly compress outputs:

I run PrintReads over and over again, on the same input data, not doing anything, just read in, write out, ie gatk PrintRead -I input.bam -O output.bam. Mostly, I just get an identical 9GB bam over and over again (as confirmed by md5). However, sometimes (~10% of the time it seems), I get a MUCH larger “bam”, more like ~45GB. In runs where I get these larger output files, they are not always the same size, sometimes 45GB, sometimes 47GB (still always with the same input file, same commandline, same wdl task, etc). The runs that produce these larger bam also take much longer, with slower “reads per minute rate). They report exactly the same number of reads processed in the logs as the “normal” runs.

Looking inside the large output “bams” with gsutil cat, I see the header suddenly transitioning from compressed looking jibberish to a plaintext header, and then after a bit back to compressed looking jibberish again. Additionally, if I run these large bams through samtools view to get samtools to write them as a bam (ie samtools view big.bam -o samtools_out.bam) the resulting bam is much smaller ~6GB. It kind of seems like sometimes gatk will just stop compressing the output, and then start back up again, seemingly randomly??

This does not occur with GKL 0.8.6, and seems to have been introduced by the upgrade from GKL 0.8.6 to GKL 0.8.8 in https://github.com/broadinstitute/gatk/pull/7203/files

Additional data points:

  • Reproducible on very small files (at about same rate of ~10%)
  • Appears to be related to the IntelDeflater. when running with JDK deflater (--use-jdk-deflater) all 100/100 runs result in same sized bam

Any help would be much appreciated! This is actually a rather serious issue for us that might force us to temporarily revert back to the JDK deflater or the older GKL release if it looks like it might be difficult to diagnose / fix.

(CC @lbergelson)

@droazen droazen changed the title IntelDeflater intermittently fail to properly compress outputs with GKL 0.8.8 IntelDeflater intermittently fails to properly compress outputs with GKL 0.8.8 Jan 30, 2023
@kdhanala
Copy link
Contributor

Thank you for reaching out. I noticed between GKL 0.8.6 to 0.8.8, ISAL has been upgraded from 2.21 to 2.30 but since it is happening intermittently we will first try to reproduce it on our end using a small bam file(~9gb) like mentioned in the original issue. We will use a 7gb file from our old long reads list (PAE09121_dae79b.bam.raw) and iterate it for ~100 times to check any anomalies in output sizes.

@mateuszsnowak
Copy link
Contributor

ISA-L added a new configuration field (isal_zstream->hist_bits) between versions 2.21 and 2.30 which wasn't initialized to default value by isal_deflate_stateless_init call. This was fixed by a recent commit in ISA-L (intel/isa-l@9f2b68f).

When there were multiple simultaneous allocations of memory using malloc the OS kernel could provide a previously used memory page which contained a value different than default in place corresponding to hist_bits field. Values of hist_bits between 1 and 14 (especially lower ones) could reduce compression efficiency for this particular compression session. Other values (including 0) are replaced by ISA-l with 15 which is the default, most efficient setting.

Using calloc instead of malloc to allocate and fill with zeroes the isal_zstream struct also fixes this issue and prevents similar issues happening in the future.

@lbergelson
Copy link
Contributor

@mateuszsnowak Yay!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants