An efficient way to stage a long list of files as input using Fusion. #5852
Replies: 1 comment 1 reply
-
Fusion may improve your speed by downloading all the files in the background while the process is running. You do not need to change your paths; just enable Fusion, and Nextflow takes care of the paths. Still, 30K objects means at least 30K HTTP requests need to be made to the Google storage. It would be faster if you could reduce the number of VCFs in some way (even if it's a very big one). But the best way to know is by testing it, so I'd try to create a dummy pipeline that emulates something similar to what you are doing and test which setup is best for you. |
Beta Was this translation helpful? Give feedback.
1 reply
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
-
Hi,
I have a process that need to ingest ~30k files as input and the jobs will be executed with google batch.
Example:
I want to use Fusion so that I don’t have to copy ~30K files. However, I realized that nextflow takes a long time (>30min) to check the files—if I understand correctly why the workflow took so long to launch the process. Additionally, it took more than 1 hr for nextflow to stage the files, which is not cost efficient for using VM.
Since all the files are located in the same google bucket, I was wondering if I can just mount the entire google bucket and provide the list with the expected path, e.g. /fusion/gs/xxx/xxx/xxx.vcf to save time to stage the inputs?
Or can someone advice the best way to proceed with such process?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions