Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Crashes in unitigging, out of memory (?) #1355

Closed
maximilianpress opened this issue May 8, 2019 · 4 comments
Closed

Crashes in unitigging, out of memory (?) #1355

maximilianpress opened this issue May 8, 2019 · 4 comments

Comments

@maximilianpress
Copy link

I am having a job crash in bogart on what I suspect is a memory error, but I don't know for sure

I've gone through many iterations of this, including these:

canu-1.7/Linux-amd64/bin/canu -assemble maxMemory=900g maxThreads=4 utgovlMemory=225g utgovlThreads=4 -d canu_data/tetra_canu -p tetra_canu genomeSize=1.3g batMemory=225g -pacbio-corrected canu_data/tetra_canu/trimmedReads.fasta.gz

canu-1.7/Linux-amd64/bin/canu -assemble maxMemory=300g maxThreads=16 utgovlMemory=200g utgovlThreads=8 -d canu_data/tetra_canu -p tetra_canu genomeSize=1.3g -pacbio-corrected canu_data/tetra_canu/trimmedReads.fasta.gz

All of the various parameter combinations I've tried result in the exact same output (other than differences from CLI parameters, e.g. thread number), even after removing the 4-unitigger/ dir.

I am running this on Amazon Linux.

I am attaching a unitigger.err file, which shows as much as I can tell about the error message here:

==> MERGE ORPHANS.

computeErrorProfiles()-- Computing error profiles for 145380 tigs, with 16 threads.
ERROR: stdDev is full; can't insert() new value.

This looks to me like some sort of allocation error, but I'm not enough of an expert to get much from going through the source code.

unitigger.log just has these lines, which I think might be a downstream phenomenon:

Running job 1 based on command line options.
./unitigger.sh: line 82: ../tetra_canu.ctgStore/seqDB.v001.sizes.txt: No such file or directory

../tetra_canu.ctgStore/ does not exist.

tetra_canu.005.mergeOrphans.thr004.num000.log, the only mergeOrphans log file in evidence, has a bunch of innocuous-looking lines:

WARNING:  tig 19049 length 9064 nReads 393906 has 4873434470 overlaps.
WARNING:    read 2955067    7992-0      
WARNING:    read 1498208     890-6341   
WARNING:    read 10125626     961-6908   
WARNING:    read 10277223    1039-7041   
WARNING:    read 7556593    7051-1064   
WARNING:    read 3855054    6880-1087   
# etc.

I have increased available memory quite substantially (using ~1TB) and restarted the run multiple times with different thread/memory arguments but I think I'm misunderstanding something.

Do you have any suggestions?

Many thanks, max

@brianwalenz
Copy link
Member

Wow!!

This line

WARNING:  tig 19049 length 9064 nReads 393906 has 4873434470 overlaps.

is saying you have a 10 Kbp contig with 393,906 reads in it, which seems suspicious.

Grab some reads from there and see if they make sense:

cd canu_data/tetra_canu
sqStoreDumpFASTQ -S tetra_canu.seqStore -fasta -o - -r 2955067 > 2955067.fasta
(etc)

If they look bogus, I'd suggest filtering them from trimmedReads.fasta then restarting a new assembly from there (canu ... -assemble -pacbio-corrected trimmedReads.filtered.fasta). It would also be possible to just drop the reads in this one contig, instead of filtering by identity (it'd take a little bit of scripting).

Another option, non-intuitive as it sounds, is to decrease the memory limit. This will cause the lower quality and/or short repeat overlaps to be skipped. Something like 90 GB is a good first try. The start of unitigger.err has a table of how many overlaps per read it is loading:

OverlapCache()--               reads loading olaps          olaps               memory
OverlapCache()--   olaps/read       all      some          loaded                 free
OverlapCache()--   ----------   -------   -------     ----------- -------     --------
OverlapCache()--          714   16816257    806301       645898334   0.62%     182382 MB
OverlapCache()--        15538   16948647    673911      2549482021  10.74%      22264 MB
OverlapCache()--        17703   16956563    665995      4000067549  12.14%        130 MB
OverlapCache()--        17715   16956609    665949      4008059260  12.15%          8 MB

With 40x coverage, 700 overlaps per read is a decent value to target.

Save the 4-unitigger directory from this run, we can compare contig sizes to decide if the reduced memory is impacting assembly -- the various *.sizes files in there show contig sizes at various steps in the algorithm.

@maximilianpress
Copy link
Author

Ah, ok. That line did give me a pause, but then I remembered that I don't understand how genomics works and shrugged.

I might try your suggestion to reduce memory and see if I can filter the contig rather than the reads. I suspect a bit that there will be other contigs like this in the assembly as it's a bit pathological, and I'm actually a bit low on coverage due to tetraploidy.

I'll see what happens and get back to you-

Thanks for quick response!

@maximilianpress
Copy link
Author

Weirdly, reducing the memory worked. canu finished and I have an assembly. I'm still investigating but it looks similar to what wtdbg2 gave me in terms of size and contiguity.

I ran the 90g run initially but stupidly overwrote it with an experimental 150g run, which also worked (see unitigger.err in archive).

I am also uploading the *.sizes files.

(I'll quickly note also that the previous unitigger.err genome size estimate was probably half what it should be due to ploidy issues- the current run is probably (more) accurate.)

@brianwalenz
Copy link
Member

Based on that deep contig, I'd guess there is a 10 Kbp (tandem?) repeat in this genome that is preventing any better assembly. If so, the ends of contigs should have pieces of the repeat sequence, and the unitig graph should be very very connected.

The logs show 20x coverage in corrected, and I'd guess you have about 25x in raw reads. More data could possibly help, both in better corrected reads and more chance of spanning a repeat.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants