Skip to content
Mason M Lai edited this page Dec 20, 2016 · 2 revisions

Overview

This repository contains a simple Java framework for processing bioinformatics data. Currently, the main components of the Bio framework are

  • classes for handling various file formats: sequence (FASTA and FASTQ), annotation (BED and BEDPE) and alignment (BAM)
  • parsers for reading (and writing, in the case of BAM files) the above file formats:
  • tree data structures for storing and efficient retrieval of annotation and alignment objects

You'll need the htsjdk library. The classes for handling BAM files and BAM records are simple wrapper classes around corresponding htsjdk classes.

Note: This isn't a complete guide. It's just a bit of information to get lab members familiar with previous codebases up-to-speed.

Packages

io

This package contains parsers for the various file formats. Each parser can be used as either an iterator or a stream. Examples in later sections below demonstrate both use cases.

datastructures

Implementations of interval trees. Only use GenomeTree for storing Annotated objects. The other classes are basic implementations that do not consider different references.

  • Interval: General interface for things that have a defined start and end.
  • IntervalTree: Used for storing Intervals. Addition of duplicate values is ignored.
  • IntervalSetTree: Used for storing Intervals. Each node is a List which can hold duplicate values.
  • GenomeTree: In actuality, not one tree but a collection of IntervalSetTrees. A new IntervalSetTree is created whenever an object with an unseen reference is inserted. This should happen transparently.

sequence

This package contains classes to represent unaligned sequence data. The only two classes are FastaSequence and FastqSequence, both of which implement Sequence.

Path fastqPath = Paths.get("/path/to/file.fastq");
try (FastqParser parser = new FastqParser(fastqPath, PhredEncoding.SANGER) {
    while (parser.hasNext()) {
        FastqSequence sequence = parser.next();
        doSomething(sequence);
    }
}

annotation

This package contains classes to represent genomic regions. The structure of this package is very similar to that in guttmanlab-core. The main differences are that

  • SingleInterval is now Block for brevity
  • the main classes implement Annotated rather than Annotation; Annotation is now the name of the abstract class that underlies the main classes
  • many of the constructors have been removed in favor of builder objects

The following constructs a two-block annotation with a builder, then filters a BED file for all overlapping records.

BlockedAnnotation overlappingAnnot = (new BlockedBuilder())
    .addBlock(new Block("chr2", 1300, 1350, Strand.POSITIVE))
    .addBlock(new Block("chr2", 1400, 1450, Strand.POSITIVE))
    .build();

Path bedPath = Paths.get("/path/to/bedfile.bed");

try (BedParser parser = new BedParser(bedPath)) {
    parser.stream()
          .filter(x -> overlappingAnnot.overlaps(x))
          .map(BedFileRecord::toFormattedString)
          .forEach(x -> System.out.println(x));
}

The builder classes are static nested classes of the class they build. For example, GeneBuilder is in Gene. Eclipse doesn't know how to import them automatically. Import GeneBuilder and create an instance with

import edu.caltech.lncrna.bio.annotation.Gene.GeneBuilder;
GeneBuilder gb = new GeneBuilder();

Alternatively, you can import Gene, which Eclipse will do for you, and get the builder with

import edu.caltech.lncrna.bio.annotation.Gene;
GeneBuilder gb = new Gene.GeneBuilder();

alignment

This package contains classes for dealing with alignments. There are two main interfaces which need to be explained:

  • Aligned: Reads which potentially have a valid alignment.
  • Alignment: Reads which have a valid alignment. This interface extends the Annotated interface.

(The other interfaces, like SamRecord, just provide methods for checking CIGAR strings, flags, etc.)

SingleRead and ReadPair implement Aligned. These are what you get when you parse a single-read or paired-end BAM file, respectively. They may be unmapped, and consequently, they don't necessarily have coordinates and can't be used in any sort of interval tree or interval operation. From an Aligned, one can get an Alignment. SingleRead#getAlignment will return an Optional<SingleReadAlignment>, and ReadPair#getAlignment will return an Optional<PairedEndAlignment>.

Reading a BAM file is similar to reading any of the other formats this framework handles. Writing is a bit different. Other formats come with a toFormattedString method that returns a string which can be send to STDOUT or written to a file in the standard manner. BAM files need a valid header, and they contain binary data. A BAM header can be extracted from an existing BAM file as a CoordinateSpace, or a readymade header such as CoordinateSpace.MM9 can be used. Alignments can be written to a BAM file with the BamFileWriter in the io package.

The code below reads a paired-end BAM file, filters the reads, and outputs the data into a new BAM file.

Path inputBamPath = Paths.get("/input/bam/file.bam");
Path outputBamPath = Paths.get("/output/bam/file.bam");

CoordinateSpace header = new CoordinateSpace(inputBamPath);

try (PairedEndBamParser reader = new PairedEndBamParser(inputBamPath);
     BamWriter writer = new BamWriter(outputBamPath, header)) {
    reader.stream()
          .map(ReadPair::getAlignment)                       // Get the alignment as an Optional<Alignment>
          .filter(Optional::isPresent)                       // Only get the ones that have an alignment
          .map(Optional::get)                                // Unwrap the Optional<Alignment> to an Alignment
          .filter(x -> x.getFirstReadInPair().getStrand()    // Apply some filters
                        .equals(Strand.POSITIVE))
          .filter(x -> x.getInsertSize < 100)
          .forEach(x -> writer.addAlignment(x));             // Write to output BAM file.
}

testing

This package contains a testing suite. The IO tests currently fail because they reference local files. They pass on my machine, though. This will be fixed.

Clone this wiki locally