Skip to content

Latest commit

 

History

History
130 lines (92 loc) · 8.3 KB

README.md

File metadata and controls

130 lines (92 loc) · 8.3 KB

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

SeqUnwinder

SeqUnwinder is a framework for characterizing class-discriminative motifs in a collection of genomic loci that have several (overlapping) annotation labels.

Downloading Executables

The following webpage will maintain executable JAR files for major versions: http://mahonylab.org/software/sequnwinder

Building from Source

If you want to build the code yourself, you will need to first download and build the seqcode-core library (https://github.com/seqcode/seqcode-core) and add its build/classes and lib directories to your CLASSPATH.

Dependencies:

  1. SeqUnwinder requires Java 8+.
  2. SeqUnwinder implements a multi-threaded version of ADMM to train the model. Hence, when using large datasets (tens of thousands of genomic sites), it is advisable to run in a system that allows multiprocessing.
  3. SeqUnwinder depends on MEME (tested with MEME version 4.10.2).

Citation:

Kakumanu, Akshay, et al. "Deconvolving sequence features that discriminate between overlapping regulatory annotations." bioRxiv (2017): 100511.

Running SeqUnwinder

On a typical dataset (~20,000 sites and ~8 annotation labels) SeqUnwinder takes a couple of hours to run.

Running from a jar file:

java -Xmx20G -jar sequnwinder.jar <options - see below>

In the above, the “-Xmx20G” argument tells java to use up to 20GB of memory. If you have installed source code from github, and if all classes are in your CLASSPATH, you can run SeqUnwinder as follows:

java -Xmx20G org.seqcode.projects.sequnwinder.SeqUnwinder <options - see below>

Options (Required/important options are in bold.)

  1. General:
  • --out <prefix>: Output file prefix. All output will be put into a directory with the prefix name.
  • --threads <n>: Use n threads to train SeqUnwinder model. Default is 5 threads.
  • --debug: Flag to run in debug mode; prints extra output.
  • --memepath <path>: path to the meme bin dir (default: meme is in $PATH).
  1. Specifying the Genome:
  • --geninfo <genome info file>: This file should list the lengths of all chromosomes on separate lines using the format chrName<tab>chrLength. You can generate a suitable file from UCSC 2bit format genomes using the UCSC utility “twoBitInfo”. The chromosome names should be exactly the same as those used in your input list of genomic regions.

    The genome info files for some UCSC genome versions:
    | hg18 | hg19 | hg38 | mm8 | mm9 | mm10 | rn4 | rn5 | danRer6 | ce10 | dm3 | sacCer2 | sacCer3 |

  • --seq <path> : A directory containing fasta format files corresponding to every named chromosome is required.

  1. Input Genomic Regions:
  • --genregs <file>: Genomic regions with annotations filename OR --genseqs<file>: Sequences with annotations filename. A tab delimited file of a list of genomic points/sequences and corresponding annotations/labels. A simple example :
    GenRegs file:
    chr10:100076604	enhancer;shared
    chr6:100316177	promoter;celltypeA
    
    GenSeqs file:
    ATTGC....TTA	enhancer;shared
    CGTAA....GGT	promoter;celltypeA
    
  • --win <int>: Size of the genomic regions in bp. Default = 150.
  • --makerandregs: Flag to make random genomic regions as an extra outgroup class in classification (only applicable when genome is provided).
  1. SeqUnwinder Model Options:
  • --mink <int>: Minimum length of K-mer to consider. Default = 4.

  • --maxk <int>: Maximim length of K-mer to consider. Default = 5.

    For most SeqUnwinder analysis described in the manuscript, K-mers of lengths 4 and 5 showed optimal performance. However, with larger datasets (with more data instances for training), maxk can be increased to 6 or 7.

  • --r <value>: Regularization co-efficient in the model. For most SeqUnwinder applications, with ~20k genomic sites and ~6 labels and K-mers of 4 and 5, a value of 10.0 has been very effective. However, the optimal value could change with datasets. One might want to use a range of values and choose the one that performs best (in terms of test accuracy).

  • --x <int>: Number of folds for cross validation. Default = 3.

  • --minsubclass <int>: Minimum number of sites needed to consistitute a subclass. Default = 200.

  • --mergelow: Flag to merge subclasses with fewer than "minsubclass" sites with other relevant classes. By default, all subclasses with less than 200 sites are removed.

  1. Other SeqUnwinder options (Highly recommend using defaul options):
  • --minscanlen <value>: Minimum length of the window to scan K-mer models. Default=8.
  • --maxscanlen <value>: Maximum length of the window to scan K-mer models. Default=14.
  • --hillsthresh <value>: Scoring threshold to identify hills. Default=0.1.
  • --mememinw <value>: minw arg for MEME. Default=6.
  • --mememaxw <value>: maxw arg for MEME. Default=13. This value should always be less than "maxScanLen".
  • --memenmotifs <int>: Number of motifs MEME should find in each condition (default=3)
  • --memeargs <args> : Additional args for MEME (default: -dna -mod zoops -revcomp -nostatus)
  • --memesearchwin <value>: Window around hills to search for discriminative motifs. Default=16
  • --a <int>: Maximum number of allowed ADMM iterations. Default=400.

Example

This example runs SeqUnwinder v0.1.2 on simulated dataset. Simulated sequences to run this example can be found here.

Command:

java -Xmx20G -jar sequnwinder.jar --out example --threads 10 --debug --memepath path-to-meme --geninfo mm10.info --seq path-to-genomes/mm10/ --genseqs simulateOverlap.seqs --win 150 --mink 4 --maxk 5 --r 10 --x 3 --maxscanlen 15

Results can be found here

Contact

For queries, please contact Akshay (auk262@psu.edu) or Shaun Mahony (mahony@psu.edu).

Major History:

Version 0.1.5 (2021-08-03): Fixing bug in --genseqs sequence loading. Also added --minsubclass option.

Version 0.1.4 (2020-10-29): Fixing problems with mislabeling in the motif heatmap in the final report.

Version 0.1.3 (2018-02-03): Fixing an issue with the discriminative performance evaluation of MEME-derived interpretable motifs. The effect of this issue was that some discriminative motifs would not have been reported to the user in the final SeqUnwinder results.

Version 0.1.2 (2017-05-08): Several minor updates over the previous version. SeqUnwinder now automatically estimates "k" for k-means clustering of hills. Additional option provided to deal with sub-classes with very few training instances (see --mergelow). Several option names have been reformatted for consistency.

Version 0.1 (2016-12-09): Initial release to support manuscript submission.