Skip to content

Latest commit

 

History

History
82 lines (57 loc) · 3.25 KB

augustus.md

File metadata and controls

82 lines (57 loc) · 3.25 KB
layout title parent
default
Augustus
2. Program guides

Augustus

Augustus is a program that predicts genes in eukaryotic genomic sequences. It can be run online, with a server for smaller files and one for larger files, or locally. The local version of Augustus can be installed through conda. This project includes an example augustus conda environment.

Predicting genes in a eukaryotic FASTA nucleic acid file using augustus

augustus can be used to predict genes as follows:

$ augustus --species=species_name input_file.fna > output_file.gff

In this command...

  1. --species is used to specify the target species for gene predictions (species_name).
  2. input_file.fna is the input FASTA nucleic acid file (.fna).
  3. output_file.gff is the general feature format (GFF) genome annotation output file. Lines beginning with # are Augustus comments: these lines do not follow the GFF structure.

The following command gives the list of valid species names for use with Augustus:

$ augustus --species=help

Extracting the FASTA amino acid sequences of predicted genes from an Augustus annotation

The genome annotation file produced by augustus (output_file.gff) contains the amino acid sequences of predicted genes in comment lines. These amino acid sequences can be extracted to a FASTA file with the following command:

$ getAnnoFasta.pl output_file.gff

The amino acid sequences will be written to output_file.aa. This is a FASTA amino acid (.faa). The extension of this file can be changed from ".aa" to ".faa" with the following command:

$ mv output_file.aa output_file.faa

Removing comments from Augustus annotations

Genome annotations produced by Augustus follow the Generic Feature Format, with the addition of comment lines for amino acid sequences. These are the same FASTA amino acid sequences that are extracted using getAnnoFasta.pl. These lines begin with the character #, and removing them results a standard GFF file.

Here is one method for removing these amino acid lines, using grep -v to select lines which do not contain the # character:

$ grep -v "#" augustus_annotation.gff > clean_augustus_annotation.gff

Demonstration

In this video, augustus is used to predict genes in example_nucleotide_sequence.fasta. This results in a genome annotation file: augustus_example.gff. The script getAnnoFasta.pl is used to extract the amino acid sequences in this genome annotation file to a new FASTA amino acid file: augustus_example.aa. The mv command is used to change the extension of this file from ".aa" to ".faa".

asciicast

See also

References