An efficient way to stage a long list of files as input using Fusion. #5852

zihhuafang · 2025-03-04T19:42:05Z

zihhuafang
Mar 4, 2025

Hi,

I have a process that need to ingest ~30k files as input and the jobs will be executed with google batch.

Example:

process concat_vcfs {
    tag "$index"
    publishDir ("${params.outdir}", mode: 'move', overwrite: true)

    input:
    tuple val(index), path(vcfs)

    output:
    tuple val(index), path("${index}.bcf")
    script:
    """

    ulimit -Sn 65536
    ls -1 ${vcfs} > vcf_files.list

    bcftools concat --file-list vcf_files.list -Ob -o ${index}.bcf

    """
}

I want to use Fusion so that I don’t have to copy ~30K files. However, I realized that nextflow takes a long time (>30min) to check the files—if I understand correctly why the workflow took so long to launch the process. Additionally, it took more than 1 hr for nextflow to stage the files, which is not cost efficient for using VM.

Since all the files are located in the same google bucket, I was wondering if I can just mount the entire google bucket and provide the list with the expected path, e.g. /fusion/gs/xxx/xxx/xxx.vcf to save time to stage the inputs?
Or can someone advice the best way to proceed with such process?

Thanks!

jordeu · 2025-03-06T16:57:40Z

jordeu
Mar 6, 2025
Collaborator

Fusion may improve your speed by downloading all the files in the background while the process is running. You do not need to change your paths; just enable Fusion, and Nextflow takes care of the paths.

Still, 30K objects means at least 30K HTTP requests need to be made to the Google storage. It would be faster if you could reduce the number of VCFs in some way (even if it's a very big one).

But the best way to know is by testing it, so I'd try to create a dummy pipeline that emulates something similar to what you are doing and test which setup is best for you.

1 reply

zihhuafang Mar 6, 2025
Author

Unfortunately, I was doing joint genotyping, so I need to use all the GVCFs, which amount to approximately 30K. I experimented with and without Fusion, and not using Fusion actually worked much better. Using Fusion caused the VM to become unresponsive after loading around 3K files (likely due to overloaded HTTP requests according to your explanation?).

When I set the working directory in the same Google bucket that hosts the ~30K files, I was able to use a list of file paths directly as /mnt/disks/bucket_path_of_file since the bucket is directly mounted to the vm.

Not sure if the files were downloaded, but the workflow and the process started right away since Nextflow didn't need to check ~30K files for each process as the input was just the list of the files (I did my own check for the files before running the workflow).
Thanks anyway!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An efficient way to stage a long list of files as input using Fusion. #5852

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

An efficient way to stage a long list of files as input using Fusion. #5852

zihhuafang Mar 4, 2025

Replies: 1 comment · 1 reply

jordeu Mar 6, 2025 Collaborator

zihhuafang Mar 6, 2025 Author

zihhuafang
Mar 4, 2025

Replies: 1 comment 1 reply

jordeu
Mar 6, 2025
Collaborator

zihhuafang Mar 6, 2025
Author