Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error getting DrexelMetadata in episode 10 #10

Open
thompsonmj opened this issue Mar 16, 2023 · 5 comments
Open

Error getting DrexelMetadata in episode 10 #10

thompsonmj opened this issue Mar 16, 2023 · 5 comments

Comments

@thompsonmj
Copy link
Contributor

(/fs/ess/PAS2136/Workshops/Snakemake/conda_env) [thompsonmj@o0647 SnakemakeWorkflow]$ snak
emake -c1 --use-singularity DrexelMetadata/bj373514.json
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
generate_metadata        1              1              1
total                    1              1              1
Select jobs to execute...
[Thu Mar 16 14:26:09 2023]
rule generate_metadata:
    input: Images/bj373514.jpg
    output: DrexelMetadata/bj373514.json, Mask/bj373514_mask.png
    log: logs/generate_metadata_bj373514.log
    jobid: 0
    reason: Missing output files: DrexelMetadata/bj373514.json
    wildcards: image=bj373514
    resources: tmpdir=/tmp/slurmtmp.23961738
Activating singularity image /users/PAS2136/thompsonmj/SnakemakeWorkflow/.snakemake/singul
arity/48c2d571fde349f4656aa5ab95dccc30.simg
WARNING: Environment variable LD_PRELOAD already has value [], will not forward new value 
[/usr/local/xalt/xalt/lib64/libxalt_init.so] from parent process environment
Waiting at most 5 seconds for missing files.
MissingOutputException in rule generate_metadata in file https://raw.githubusercontent.com
/hdr-bgnn/BGNN_Core_Workflow/1.0.0/workflow/Snakefile, line 19:
Job 0 completed successfully, but some output files are missing. Missing files after 5 sec
onds. This might be due to filesystem latency. If that is the case, consider to increase t
he wait time with --latency-wait:
Mask/bj373514_mask.png
Removing output files of failed job generate_metadata since they might be corrupted:
DrexelMetadata/bj373514.json
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message```
@johnbradley
Copy link
Collaborator

@thompsonmj Could you check the log mentioned above to see if there is anything helpful in logs/generate_metadata_bj373514.log?
The error is that gen_metadata.py only created DrexelMetadata/bj373514.json but not Mask/bj373514_mask.png.
What does you Snakefile look like right now?

@johnbradley
Copy link
Collaborator

@thompsonmj Where you able to figure out what caused your problem? If not you could also check Images/bj373514.jpg to see if it's a valid image.

@johnbradley
Copy link
Collaborator

This problem could be caused by a typo in the download_image rule

rule download_image:
    params: url=get_image_url    
    output:"images/bj373514.jpg"
    shell: "wget -O {output} {params.url}"

If you used a lowercase -o the log of the download would be written to the output file.

@thompsonmj
Copy link
Contributor Author

I'll check on this today, I had accidentally overwritten my snakefile by copying your solution. I got it recovered so I'll check if it was that typo or something else. The solution does work fine though.

@thompsonmj
Copy link
Contributor Author

Here was the Snakefile I had built up after going through the episodes:

import pandas as pd

def get_image_url(wildcards):
        filename = config["filter_multimedia"]
        df = pd.read_csv(filename)
        row = df[df["arkID"] == wildcards.ark_id]
        url = row["accessURI"].item()
        return url

def get_image_filenames(wildcards):
	filename = config["filter_multimedia"]
	df = pd.read_csv(filename)
	ark_ids = df["arkID"].tolist()
	return expand("Images/{ark_id}.jpg", ark_id=ark_ids)

configfile: "config.yaml"

rule all:
	input: get_image_filenames

rule reduce:
	input: "multimedia.csv"
	params: rows="11"
	output: "reduce/multimedia.csv"
	resources:
		mem_mb=200
	shell: "head -n {params.rows} {input} > {output}"

rule download_image:
	input: config["filter_multimedia"]
	params: url=get_image_url
	output: "Images/{ark_id}.jpg"
	container: "docker://quay.io/biocontainers/gnu-wget:1.18--h60da905_7"
	shell: "wget -O {output} {params.url}"

checkpoint filter:
	input:
		script = "Scripts/FilterImages.R",
		fishes = config["reduce_multimedia"]
	output: config["filter_multimedia"]
	shell: "Rscript {input.script}"

module bgnn_core:
	snakefile:
		github("hdr-bgnn/BGNN_Core_Workflow", path="workflow/Snakefile", tag="1.0.0")

use rule generate_metadata from bgnn_core
use rule transform_metadata from bgnn_core
use rule crop_image from bgnn_core
use rule segment_image from bgnn_core

def get_summary_inputs(wildcards):
	filename = checkpoints.filter.get().output[0]
	df = pd.read_csv(filename)
	ark_ids = df["arkID"].tolist()
	return expand('Segmented/{arkID}_segmented.png', arkID=ark_ids)

rule summary:
	input:
		scripts="Scripts/SummaryReport.R",
		markdown="Scripts/Summary.Rmd",
		morphology=get_summary_inputs
	output: config["summary_report"]
	container: "docker://ghcr.io/rocker-org/tidyverse:4.2.2"
	shell: "Rscript {input.script}"

compared to the solution Snakefile:

import pandas as pd

configfile: "config.yaml"

rule all:
    input: config["summary_report"]

rule reduce:
    input: "multimedia.csv"
    params: rows="11"
    output: "reduce/multimedia.csv"
    shell: "head -n {params.rows} {input} > {output}"

checkpoint filter:
    input:
        script="Scripts/FilterImages.R",
        fishes=config["reduce_multimedia"]
    output: config["filter_multimedia"]
    shell: "Rscript {input.script}" 

def get_image_url(wildcards):
    filename = checkpoints.filter.get().output[0]
    df = pd.read_csv(filename)
    row = df[df["arkID"] == wildcards.ark_id]
    url = row["accessURI"].item()
    return url 

rule download_image:
    input: config["filter_multimedia"]
    params: url=get_image_url
    output: "Images/{ark_id}.jpg"
    container: "docker://quay.io/biocontainers/gnu-wget:1.18--h60da905_7"
    shell: "wget -O {output} {params.url}"


module bgnn_core:
    snakefile:
        github("hdr-bgnn/BGNN_Core_Workflow", path="workflow/Snakefile", tag="1.0.0")

use rule generate_metadata from bgnn_core
use rule transform_metadata from bgnn_core
use rule crop_image from bgnn_core
use rule segment_image from bgnn_core

def get_segmentation_files(wildcards):
    filename = checkpoints.filter.get().output[0]
    df = pd.read_csv(filename)
    ark_ids = df["arkID"].tolist()
    return expand("Segmented/{ark_id}_segmented.png", ark_id=ark_ids)

rule summary:
    input:
       script="Scripts/SummaryReport.R", 
       segmentation=get_segmentation_files
    output: config["summary_report"]
    container: "docker://ghcr.io/rocker-org/tidyverse:4.2.2"
    shell: "Rscript {input.script}"

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants