Skip to content

3 Inside the Pancancer Launcher

Solomon Shorser edited this page Nov 18, 2015 · 4 revisions

##Inside the Pancancer Launcher.

If you follow the directions above you will find yourself dropped into the docker container that has all our launcher tools. The prompt will look something like this (the hostname, "f27e86874dfb" in this case, and possibly the version number, will be different):

[LAUNCHER L4A] ubuntu@f27e86874dfb:~/arch3$

###Running workflows

This Pancancer Launcher can run the BWA workflows on VMs in AWS. To run the workflow, you need an INI file. The INI file contains all of the configuration details about what you want to happen for a specific run of a workflow, such as the names and URLs for input and output files.

The most basic INI file for BWA will look like this:

useGNOS=false
input_reference=${workflow_bundle_dir}/Workflow_Bundle_BWA/2.6.6/data/reference/bwa-0.6.2/genome.fa.gz

# Comma-separated list of S3 URLs to directories of BAMs. Leave unchanged to use test data.
input_file_urls=s3://bwa.test.download/4fb18a5a-9504-11e3-8d90-d1f1d69ccc24,s3://bwa.test.download/9c414428-9446-11e3-86c1-ab5c73f0e08b
# Comma-separated list of BAM files. Leave unchanged to use test data.
input_bam_paths=4fb18a5a-9504-11e3-8d90-d1f1d69ccc24/hg19.chr22.5x.normal2.bam,9c414428-9446-11e3-86c1-ab5c73f0e08b/hg19.chr22.5x.normal.bam
# S3 URL to output directory or bucket. YOU NEED TO CONFIGURE THIS!
output_file_url=s3://bwa.test.download/results/<MY RESULTS BUCKET>/

Here is a summary of these configuration settings that you should configure:

  • input_file_urls: This is a comma-separated list of URLs in S3 that refer to directories (not files) in buckets to download. The workflow will download the bucket and save it in a directory with the same name as the buck in the URL. For example, specifying the directory s3://bwa.test.download/4fb18a5a-9504-11e3-8d90-d1f1d69ccc24 will cause the workflow to create its own directory named 4fb18a5a-9504-11e3-8d90-d1f1d69ccc24, with the contents of the S3 directory.
  • input_bam_paths: This is a comma-separated list of relative paths to bam files. These BAMs in this list should be in the same order as they were in for input_file_urls.
  • output_file_url: This is the URL to where you want your results to be uploaded to, in S3. You should choose an S3 bucket that you have write-permission on. When the workflow finishes executing, check this bucket to see the uploaded results.

For more information about the BWA INI files, you can read about them here.

NOTE: You can use the default values for input_file_urls and input_bam_paths as shown above, if you want to run the workflow with existing test data.

####Generating an INI file To generate an INI file:

$ pancancer ini-gen

A new INI file will be generated in ~/ini-dir.

You will want to edit this file before generating job requests. Please make any workflow-specific changes now, before continuing.

####Running the worker NOTE: The workers launched by the Pancancer Launcher will be on-demand instances, by default. On-demand instances are more reliable and launch faster, but will cost more than spot #. If you wish to use spot #, read this section before proceeding.

To begin the process of provisioning a worker VM that will run your workflow, run this command:

$ pancancer run-workers

This command will cause the the Pancancer Launcher to being the process of provisioning one VM for every INI file that is in ~/ini-dir - up to the limit you specified when you started the launcher.

The process that provisiones VMs should detect this request within a couple of minutes and begin provisioning a new VM. Provisioning a new VM may take a while because we setup various infrastructure on these VMs using Ansible. The process was designed for the PanCancer workflows which can run for days or weeks so the startup time of the worker VMs has yet to be optimized.

####Monitoring Progress

There are a few ways that you can monitor progress. You can watch the progress using this command:

$ tail -f ~/arch3/logs/provisioner.out

Type Ctrl-C to terminate tail.

You can also monitor progress from the AWS EC2 console. Look for the new instance that is starting up, named instance_managed_by_<YOUR_FLEET_NAME>:

Once provisioning is complete, you should see output in ~/arch3/logs/provisioner.out that looks similar to this (the exact numbers for "ok" and "changed" may vary, but everything is OK as long as "unreachable" and "failed" are 0) in the provision.out file:

PLAY RECAP ********************************************************************
i-fb797f50                 : ok=125  changed=86   unreachable=0    failed=0

[2015/09/02 18:06:16] | Finishing configuring i-fb797f50

At this point, the job should begin executing on the new VM. You can check the status of all jobs using the command pancancer status jobs

$ pancancer status jobs
 status  | job_id |               job_uuid               |  workflow  |      create_timestamp      |      update_timestamp
---------+--------+--------------------------------------+------------+----------------------------+----------------------------
 RUNNING |      1 | a3a4da7b-2136-4431-a117-e903590c05d8 | BWA_2.6.6  | 2015-09-02 19:45:26.023313 | 2015-09-02 19:45:26.023313

When the job has completed successfully, you should see a status result that looks like this:

$ pancancer status jobs
 status  | job_id |               job_uuid               |  workflow  |      create_timestamp      |      update_timestamp
---------+--------+--------------------------------------+------------+----------------------------+----------------------------
 SUCCESS |      1 | a3a4da7b-2136-4431-a117-e903590c05d8 | HelloWorld | 2015-09-02 19:45:26.023313 | 2015-09-02 20:04:27.033118

To write the full results of a Worker to a file, you can use the status job_results command. It has this form:

$ cd ~/arch3
$ pancancer status job_results --type stdout  --job_id 1
Job results (stdout) have been written to /home/ubuntu/arch3/job_1.stdout
$ pancancer status job_results --type stderr  --job_id 1
Job results (stderr) have been written to /home/ubuntu/arch3/job_1.stderr

Worker VMs report the stdout and stderr from seqware back to your launcher's database. The command above can extract this data and write it to a text file to make it easier to use, if you are interested in seeing the details of the workflow's execution.

At this point, you have successfully installed the Pancancer Launcher, and used it to schedule and execute a workflow!

When looking at your AWS EC2 console, you will notice that when a workflow finishes successfully, the VM it was running on will have been automatically terminated. This is done to save on computing resources. The pancancer workflows write their data to a repository before their VM is terminated. The VM that is serving as your launcher will not be terminated until you choose to do so.

You can now verify that your workflow results have been uploaded to the URL that you put in your INI file for output_file_url.

If a workflow fails, you will see that its status is "FAILED". To see the output from the VM, you can use the pancancer status job_results command like this:

$ pancancer status job_results --job_id <THE JOB_ID OF THE JOB THAT FAILED> --type stdout 

This will write data to a file containing the standard output that SeqWare captured while running the workflow. You can also get the standard error messages by running the above command with stderr instead of stdout.

If this is not enough information to properly debug the failure, you can try using the --keep_failed option when running the pancancer run-workers command, as explained in the Troubleshooting section.

###What's Next?

Your next step, now that you have successfully run one workflow on one VM, could be to create several INI files and then execute them in a larger fleet. See here for instructions on how to reconfigure your launcher to set a larger value for the maximum fleet size.

For useful tips and troubleshooting help, click here

Clone this wiki locally