Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Graphtyper copying the same files multiple times if using "--region_file" #159

Open
sroener opened this issue Jan 7, 2025 · 0 comments
Open

Comments

@sroener
Copy link

sroener commented Jan 7, 2025

Hi,

thank you for writing and maintaining graphtyper.

I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:

[2024-11-22 00:29:36.836] SV genotyping region chr2:1010000-1221700
[2024-11-22 00:29:36.836] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:29:36.836] Running with up to 72 threads.
[2024-11-22 00:29:36.836] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:29:36.836] Temporary folder is /tmp/graphtyper_241122_002936_chr2_001010000.iWGl68
[2024-11-22 00:29:36.836] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:29:39.496] Genotype calling step starting.
[2024-11-22 00:29:39.497] Padded region is: chr2:1009000-1422700
[2024-11-22 00:29:39.497] Constructing graph.
[2024-11-22 00:29:39.520] Calculating contig offsets.
[2024-11-22 00:30:47.770] Finished calling. Thread work: 5/2/4/3/3/3/4/2/3/4/3/3/3/3/2/3/4/4/4/3/3/3/2/3/3/3/2/3/4/4/2/2/3/3/4/4/3/4/3/2/3/2/4/4/3/3/3/2/3/4/4/2/2/3/3/4/4/3/2/2/3/2/3/2/2/3/3/2/3/3/3/2
[2024-11-22 00:30:47.770] Merging output VCFs.
[2024-11-22 00:30:49.878] Cleaning up temporary files.
[2024-11-22 00:30:50.219] Finished! Output written at: batch1/chr2/001010000-001221700.vcf.gz

[2024-11-22 00:30:50.219] SV genotyping region chr2:1223900-1594700
[2024-11-22 00:30:50.219] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:30:50.219] Running with up to 72 threads.
[2024-11-22 00:30:50.219] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:30:50.219] Temporary folder is /tmp/graphtyper_241122_003050_chr2_001223900.wcbtZp
[2024-11-22 00:30:50.219] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:30:52.815] Genotype calling step starting.
[2024-11-22 00:30:52.815] Padded region is: chr2:1222900-1795700
[2024-11-22 00:30:52.815] Constructing graph.
[2024-11-22 00:30:52.853] Calculating contig offsets.
[2024-11-22 00:32:04.971] Finished calling. Thread work: 4/3/3/3/3/4/3/3/3/3/4/3/4/3/3/4/3/3/3/3/2/3/3/4/3/4/3/3/3/3/3/3/3/2/3/3/4/4/3/3/3/3/3/3/4/3/3/3/4/4/3/2/3/3/3/3/2/3/3/3/3/3/3/2/2/2/2/2/2/3/2/2
[2024-11-22 00:32:04.972] Merging output VCFs.
[2024-11-22 00:32:10.079] Cleaning up temporary files.

I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.

The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?

My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.

I assume the changes would have to be done in genotype_sv.cpp.

Please let me know if my suggestions are feasible.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant