Skip to content

02 Get started

Martha Kandziora edited this page May 2, 2021 · 7 revisions

1. Write your analysis file

To get started you need a python analysis file where you define your input, example scripts are in example_setups folder. Settings for the alignment building process and filtering options are being done in the configuration file - see 03 Settings options. For more examples see 04 advanced analyses.

To set up an analysis that will add sequences that are similar to the input. Sequences are retrieved from the database using the BLAST search and are being filtered before adding them to the alignment.

To write your own analysis file, check out the different examples provided. Generally, as input the following needs to be provided:

  • seqaln: give the path to your single sequence or alignment file, must be a single gene alignment
  • mattype: file format of your alignment - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
  • trfn: optional: give the path to the file containing the corresponding phylogeny, all tips must be represented in the alignment file as well.
  • schema_trf: file format of your phylogeny file - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
  • id_to_spn: path to a comma-delimited file where tip labels correspond to species names: example file can be found in tests/data/tiny_test_example/test_nicespl.csv. Species names must be provided with '_' rather than a 'whitespace', also for species epithets with a '-' in its name.
  • workdir: path to your working directory, the folder where intermediate and result files shall be stored.
  • configfi: path to your configuration file, configuration options are explained in "03 Settings options".

Note: Specified paths have to start either from your root directory (e.g. /home/USER/PhylUp/path/to/file) or can be relative from within the PhylUp main folder (./path/to/file).

Besides the standard definition, there are more input options, see below. Currently supported are:

  • ignore_acc_list: A list of sequences that shall not be added, or if it is part of the input it will be removed. This needs to be formatted as a python list containing the GenBank identifiers (e.g. [accession number, accession number]).
  • status_end: This is an option to tell the program how often to run the blast searches. If it is set to 0, it will only blast the input sequences once and adds those found to the alignment. If it is set to 1, it will blast the newly retrieved sequences from the initial blast search and adds those results to the alignment as well. As such the input plus the first round of newly found sequences is blasted. If it is set to 2, it will add one more 'round', etc...

2. start to update your alignment:

Start the analyses from your PhylUp main folder:

python3 ./path/to/file/analysis-file.py

The blasting takes unfortuntaly quite some time, especially for larger alignments (options for speed up - see configuration option). An alternative option is to run the analyses on a cluster.

3. Information about the examples

To get started I provide several example files in the folder example_setups. They are ordered by the complexity of the analyses.

The provided examples files are intended to show the user how to set up their own analyses and configuration files, and are not necessarily generating meaningful and completely sampled alignments of the clade of interest. Instead, they are made to quickly show the purpose. Needed data files are provided in data.

The analysis files itself are simple, the differences are in the configuration files.

  • The file 01_example_aln_simple.py updates an alignment, to include a maximum of 5 samples per species within the genus Senecio.
  • The file 01_example_singleseq_simple.py generates an alignment based on a single seed sequence, to include a maximum of 5 samples per species within the genus Senecio.
  • The file 02_example_aln_multiple_mrca.py updates an alignment to sequences belonging to different mrca as defined in the configuration file.
  • The file 02_example_aln_notree.py updates an alignment without providing a phylogeny.
  • The file 03_example_aln_addunpublishedsequences.py updates an alignment with sequences from a user-supplied database - see data/unpublished_seqs for an example of how it needs to look.
  • The file 03_example_aln_downto_genus.py updates an alignment, including 10 samples per genus for the tribe Senecioneae.
  • The file 04_example_different_rank_sampling.py updates an alignment, to include different numbers of samples per defined taxonomic rank: 1 sample per species within Senecio, 2 samples per genus within Senecioneae, and two samples per tribe for Asteroideae. This kind of setup for example is often used to generate alignments for molecular dating including fossils from different branches of the phylogeny.
  • The file 04_example_multiple_loci.py updates different loci for a given clade of interest using the preferred taxon option to increase sampling across loci. The different settings of the two 04_example can also be combined.

4. Navigating the output:

During a PhylUp run, several files are being written to the working directory:

Here is a short introduction to what they are:

  • all_new_seqs.updated: csv file with all information available about the newly added sequences
  • fulltree.raxml.: all the files generated by RAxML-NG if the tree updating was enabled
  • logfile: short summary of how many sequences were added/filtered during a PhylUp run
  • table.updated: all sequences that were considered to be added, but of which some have been removed because of the filtering steps - does not include sequences that were not added because they are not part of the mrca or because they were too short - for those see wrong_mrca.csv and wrong_seq_length.csv
  • updt_aln.fasta: updated alignment; there is also a relabeled version with tipnames as a combination of taxon name and accession number
  • updt_tre.tre: updated tree or unresolved tree if no phylogeny was provided as input; there is also a relabeled version with tipnames as a combination of taxon name and accession number
  • orig_inputaln.fasta: data as supplied in the input
  • orig_tre.tre: data as supplied in the input
  • blast: folder that holds all blast results
  • tmp: folder that holds taxonomic information, complete sequences and internally used files.

5. Concatenate different single-gene PhylUp runs:

Just as an example: After generating multiple single-gene PhylUp runs, the data can be combined using phylogenetic concatenation. See for an example example_concat_phylup.py within that repository for an example. You need to update the path to the single-gene working directories and set a name for the concatenated working directory before executing the file.