Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Can't associate a script task with the files it produces #106

Open
tesujimath opened this issue Jan 15, 2025 · 3 comments
Open

Can't associate a script task with the files it produces #106

tesujimath opened this issue Jan 15, 2025 · 3 comments

Comments

@tesujimath
Copy link

When a task has a file output, it is retriggered on subsequent runs if anything has happened to the file in the meantime. This is exactly what I need. However, it doesn't happen for script tasks, and I can't see how to fix that.

In my use case, I have script tasks which potentially generate very large files (MB or even GB), so I don't think it's reasonable for the script to produce that on stdout. Instead, it writes to a file, and outputs the path of the file written. And this is precisely what causes my problem.

I'll fully describe what I have done here, in case this isn't clear. Sorry, I am aware this is a long description!

I have a helper task file_from_script_output which turns a pathname on stdout into a File. In my example code below, I have a simple task which creates a file, and a script task which does the same.

When I delete the simple file and rerun, the creating task gets triggered, and the file recreated, as expected. However, when I delete the file that was created by the script, the script itself doesn't get triggered, because its output as seen by redun has not changed, and redun doesn't know it's the task which actually creates the file. The helper task gets rerun, but uses cached output from create_file_with_script.

I'm pasting in a lot below, hoping this provides clarity on what my problem is and what solution there may be.

Firstly, my workflow:

from redun import task, File
from typing import List

redun_namespace = "redun.examples.script_rerun"


@task(script=True)
def create_file_with_script(out_path: str, content: str) -> str:
    return f"""
        echo -n "{content}" >"{out_path}"
        echo "{out_path}"
    """


@task()
def create_file(out_path: str, content: str) -> File:
    with open(out_path, "w") as out_f:
        out_f.write(content)
    return File(out_path)


@task()
def file_from_script_output(path_bytes) -> File:
    """
    Return a single path from a redun script task.

    The script must print on stdout the absolute pathname.
    """
    assert isinstance(path_bytes, bytes), path_bytes
    path = path_bytes.decode("utf-8").strip()
    return File(path)


@task()
def main() -> List[File]:
    simple_file = create_file("out/freddy1", "Hello Freddy 1\n")
    script_file = file_from_script_output(
        create_file_with_script("out/freddy2", "Hello Freddy 2\n")
    )
    return [simple_file, script_file]

Here's a summary of first run:

[redun] | JOB STATUS 2025/01/15 15:43:53
[redun] | TASK                                                PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                                       0       0       0       0       4       4
[redun] | redun.examples.script_rerun.create_file                   0       0       0       0       1       1
[redun] | redun.examples.script_rerun.create_file_with_script       0       0       0       0       1       1
[redun] | redun.examples.script_rerun.file_from_script_output       0       0       0       0       1       1
[redun] | redun.examples.script_rerun.main                          0       0       0       0       1       1

[File(path=out/freddy1, hash=fca2d87a), File(path=out/freddy2, hash=f780510e)]

(redun) ls -l out
total 8
-rw-r--r-- 1 guestsi users 15 Jan 15 15:43 freddy1
-rw-r--r-- 1 guestsi users 15 Jan 15 15:43 freddy2

Now, deleting the simple file and rerunning, behaves as expected, and the file is recreated:

(redun) rm out/freddy1

(redun) redun run script-rerun/main.py main

[redun] | JOB STATUS 2025/01/15 15:44:32
[redun] | TASK                                                PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                                       0       0       0       3       1       4
[redun] | redun.examples.script_rerun.create_file                   0       0       0       0       1       1
[redun] | redun.examples.script_rerun.create_file_with_script       0       0       0       1       0       1
[redun] | redun.examples.script_rerun.file_from_script_output       0       0       0       1       0       1
[redun] | redun.examples.script_rerun.main                          0       0       0       1       0       1

[File(path=out/freddy1, hash=fca2d87a), File(path=out/freddy2, hash=f780510e)]
(redun) ls -l out
total 8
-rw-r--r-- 1 guestsi users 15 Jan 15 15:44 freddy1
-rw-r--r-- 1 guestsi users 15 Jan 15 15:43 freddy2

And finally, deleting the file created by the script and rerunning. Observe that the helper task runs, but the actual task that creates the file is cached.

(redun) rm out/freddy2

(redun) redun run script-rerun/main.py main

[redun] | JOB STATUS 2025/01/15 15:45:00
[redun] | TASK                                                PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] |
[redun] | ALL                                                       0       0       0       3       1       4
[redun] | redun.examples.script_rerun.create_file                   0       0       0       1       0       1
[redun] | redun.examples.script_rerun.create_file_with_script       0       0       0       1       0       1
[redun] | redun.examples.script_rerun.file_from_script_output       0       0       0       0       1       1
[redun] | redun.examples.script_rerun.main                          0       0       0       1       0       1

[File(path=out/freddy1, hash=fca2d87a), File(path=out/freddy2, hash=0c856ff4)]
(redun) ls -l out
total 4
-rw-r--r-- 1 guestsi users 15 Jan 15 15:44 freddy1

My conclusion is that in case of files too large to send over stdout, I must stick with plain tasks, since script tasks can't do what I'm attempting here. So using subprocess.run from Python to run my script is the way to go.

Does that seem right?

@tesujimath
Copy link
Author

PS. I'm loving redun, you have built an awesome workflow framework. Thank you!

@mattrasmus
Copy link
Collaborator

Thanks for posting this issue. You are correct that the task option script=True is not able to be reactive to file outputs. It's a pretty low level feature that most users will not work with directly. There is an example workflow in the tutorial that shows the more common high-level versions of defining script based workflows.
https://github.com/insitro/redun/tree/147c19dedf762ae1ebd76805e6d4328726a4ee12/examples/04_script/

My conclusion is that in case of files too large to send over stdout, I must stick with plain tasks, since script tasks can't do what I'm attempting here. So using subprocess.run from Python to run my script is the way to go.

Yes, if all you want to do is run a script locally, then your suggested approach would be the most straight forward. We show something like that in the 02_compile example here:

@task()
def compile(c_file: File) -> File:
"""
Compile one C file into an object file.
"""
os.system("gcc -c {c_file}".format(c_file=c_file.path))
return File(c_file.path.replace(".c", ".o"))

The more common higher level approach to run script tasks is to use script(), which provides a way of specifying the output Files explicitly (so that we can react to their changes). It uses script=True under the hood.

@task()
def count_colors_by_script(data: File, output_path: str) -> Dict[str, File]:
"""
Count colors using a multi-line script.
"""
# This example task uses a multi-line script.
# We also show how to isolate the script from other scripts that might be running.
# By using `tempdir=True`, we will run this script within a temporary directory.
# Use absolute paths for input and output because script will run in a new temp directory.
data = File(os.path.abspath(data.path))
output = File(os.path.abspath(output_path))
log_file = File(os.path.abspath(output_path) + ".log")
# Staging Files.
# The `File.stage()` method pairs a local path (relative to current working directory
# in the script) with a remote path (url or project-level path). This pairing is used
# to automatically copy files to and from the temporary directory.
#
# staged_file: StagingFile = File(remote_path).stage(local_path)
return script(
"""
echo 'sorting colors...' >> log.txt
cut -f3 data | sort > colors.sorted
echo 'counting colors...' >> log.txt
uniq -c colors.sorted | sort -nr > color-counts.txt
""",
# Use a temporary directory for running the script.
tempdir=True,
# Stage the input file to a local file called `data`.
inputs=[data.stage("data")],
# Unstage the output files to our project directory.
# Final return value of script() takes the shape of outputs, but with each StagingFile
# replaced by `File(remote_path)`.
outputs={
"colors-counts": output.stage("color-counts.txt"),
"log": log_file.stage("log.txt"),
},
)

The main use case of script() is when you are running a batch job and you want the job itself to be entirely based on shell script, for example if you can't assume redun is installed in the docker image of the job and therefore you can't use the normal python-based task. For jobs that run on a cloud-based batch system (like AWS Batch, Google Batch, Azure Batch), you typically also need to copy input files from cloud object storage into the job's local storage and copy output files back out again. We call such copying, file staging, which is the main feature that script() plus File.stage() helps with. An example of using script() with AWS Batch is here:

return script(
"""
echo 'sorting colors...' >> log.txt
cut -f3 data | sort > colors.sorted
echo 'counting colors...' >> log.txt
uniq -c colors.sorted | sort -nr > color-counts.txt
""",
executor="batch",
inputs=[data.stage("data")],
outputs={
"colors-counts": output.stage("color-counts.txt"),
"log": log_file.stage("log.txt"),
},
)

Hopefully, these examples help you pick whats best in your use case.

PS. I'm loving redun, you have built an awesome workflow framework. Thank you!

Awesome! You're asking good questions. Keep them coming!

@tesujimath
Copy link
Author

tesujimath commented Jan 15, 2025

Thanks for that, really useful to know. I'm on a learning curve here clearly, and really enjoying it!

I had skipped over the file staging section in the docs as I'm working in an HPC environment with the same filesystems available on all nodes. I'll take some time to understand script() now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

2 participants