You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:
[2024-11-22 00:29:36.836] SV genotyping region chr2:1010000-1221700
[2024-11-22 00:29:36.836] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:29:36.836] Running with up to 72 threads.
[2024-11-22 00:29:36.836] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:29:36.836] Temporary folder is /tmp/graphtyper_241122_002936_chr2_001010000.iWGl68
[2024-11-22 00:29:36.836] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:29:39.496] Genotype calling step starting.
[2024-11-22 00:29:39.497] Padded region is: chr2:1009000-1422700
[2024-11-22 00:29:39.497] Constructing graph.
[2024-11-22 00:29:39.520] Calculating contig offsets.
[2024-11-22 00:30:47.770] Finished calling. Thread work: 5/2/4/3/3/3/4/2/3/4/3/3/3/3/2/3/4/4/4/3/3/3/2/3/3/3/2/3/4/4/2/2/3/3/4/4/3/4/3/2/3/2/4/4/3/3/3/2/3/4/4/2/2/3/3/4/4/3/2/2/3/2/3/2/2/3/3/2/3/3/3/2
[2024-11-22 00:30:47.770] Merging output VCFs.
[2024-11-22 00:30:49.878] Cleaning up temporary files.
[2024-11-22 00:30:50.219] Finished! Output written at: batch1/chr2/001010000-001221700.vcf.gz
[2024-11-22 00:30:50.219] SV genotyping region chr2:1223900-1594700
[2024-11-22 00:30:50.219] Path to genome is 'GRCh38_full_analysis_set_plus_decoy_hla.fa'
[2024-11-22 00:30:50.219] Running with up to 72 threads.
[2024-11-22 00:30:50.219] Copying data from 288 input SAM/BAM/CRAMs to local disk.
[2024-11-22 00:30:50.219] Temporary folder is /tmp/graphtyper_241122_003050_chr2_001223900.wcbtZp
[2024-11-22 00:30:50.219] Copying reference genome FASTA and its index to temporary folder.
[2024-11-22 00:30:52.815] Genotype calling step starting.
[2024-11-22 00:30:52.815] Padded region is: chr2:1222900-1795700
[2024-11-22 00:30:52.815] Constructing graph.
[2024-11-22 00:30:52.853] Calculating contig offsets.
[2024-11-22 00:32:04.971] Finished calling. Thread work: 4/3/3/3/3/4/3/3/3/3/4/3/4/3/3/4/3/3/3/3/2/3/3/4/3/4/3/3/3/3/3/3/3/2/3/3/4/4/3/3/3/3/3/3/4/3/3/3/4/4/3/2/3/3/3/3/2/3/3/3/3/3/3/2/2/2/2/2/2/3/2/2
[2024-11-22 00:32:04.972] Merging output VCFs.
[2024-11-22 00:32:10.079] Cleaning up temporary files.
I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.
The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?
My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.
I assume the changes would have to be done in genotype_sv.cpp.
Please let me know if my suggestions are feasible.
The text was updated successfully, but these errors were encountered:
Hi,
thank you for writing and maintaining graphtyper.
I notices that graphtyper copies the same cram files multiple files if I use the "--region_file" option. From my log, I get messages similar to the following reports:
I interpret the logs as the software is iterating over the regions in the region_file and repeating the same steps over and over again. These steps include copying the input crams and the reference genome to a distinct temporary directory and cleaning it after calling. This creates a lot of IO overhead that doesn't seem necessary from my perspective. The cram files and the reference genome should not change between the different regions.
The main question here would be: does graphtyper internally copy the whole data/cram file or only parts of it. In the case of the latter case, would it be possible to first load the data for all regions and then start processing?
My suggestions would be moving the copying/cleaning of the temporary directory outside of the "loop". This could save a lot of IO and probably make the calling of multiple regions faster.
I assume the changes would have to be done in genotype_sv.cpp.
Please let me know if my suggestions are feasible.
The text was updated successfully, but these errors were encountered: