-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Can't associate a script task with the files it produces #106
Comments
PS. I'm loving redun, you have built an awesome workflow framework. Thank you! |
Thanks for posting this issue. You are correct that the task option
Yes, if all you want to do is run a script locally, then your suggested approach would be the most straight forward. We show something like that in the redun/examples/02_compile/make.py Lines 9 to 15 in 147c19d
The more common higher level approach to run script tasks is to use redun/examples/04_script/workflow.py Lines 59 to 99 in 147c19d
The main use case of redun/examples/05_aws_batch/workflow.py Lines 74 to 88 in 147c19d
Hopefully, these examples help you pick whats best in your use case.
Awesome! You're asking good questions. Keep them coming! |
Thanks for that, really useful to know. I'm on a learning curve here clearly, and really enjoying it! I had skipped over the file staging section in the docs as I'm working in an HPC environment with the same filesystems available on all nodes. I'll take some time to understand |
When a task has a file output, it is retriggered on subsequent runs if anything has happened to the file in the meantime. This is exactly what I need. However, it doesn't happen for script tasks, and I can't see how to fix that.
In my use case, I have script tasks which potentially generate very large files (MB or even GB), so I don't think it's reasonable for the script to produce that on stdout. Instead, it writes to a file, and outputs the path of the file written. And this is precisely what causes my problem.
I'll fully describe what I have done here, in case this isn't clear. Sorry, I am aware this is a long description!
I have a helper task
file_from_script_output
which turns a pathname on stdout into aFile
. In my example code below, I have a simple task which creates a file, and a script task which does the same.When I delete the simple file and rerun, the creating task gets triggered, and the file recreated, as expected. However, when I delete the file that was created by the script, the script itself doesn't get triggered, because its output as seen by redun has not changed, and redun doesn't know it's the task which actually creates the file. The helper task gets rerun, but uses cached output from
create_file_with_script
.I'm pasting in a lot below, hoping this provides clarity on what my problem is and what solution there may be.
Firstly, my workflow:
Here's a summary of first run:
Now, deleting the simple file and rerunning, behaves as expected, and the file is recreated:
And finally, deleting the file created by the script and rerunning. Observe that the helper task runs, but the actual task that creates the file is cached.
My conclusion is that in case of files too large to send over stdout, I must stick with plain tasks, since script tasks can't do what I'm attempting here. So using
subprocess.run
from Python to run my script is the way to go.Does that seem right?
The text was updated successfully, but these errors were encountered: