02 Get started

1. Write your analysis file

To get started you need a python analysis file where you define your input, example scripts can be found in the main folder. Further settings for the alignment building process and filtering options are being done in the configuration file - see 03 Settings options. For more examples see 04 advanced analyses.

To set up an analysis that will add sequences that are similar and long enough to the input as long as they are no subsequences of an already existing sequence until the taxon reached the threshold. The sequences retrieved from the BLAST search/during the updating are being filtered before a new blast round begins.

There is an example file in example_analysis.py, it comes with a tiny sample dataset in tests/data/tiny_example.

seqaln: give the path to your single sequence or alignment file, must be a single gene alignment
mattype: file format of your alignment - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
trfn: optional: give the path to the file containing the corresponding phylogeny, all tips must be represented in the alignment file as well.
schema_trf: file format of your phylogeny file - currently supported: “fasta”, “newick”, “nexus”, “nexml”, “phylip”
id_to_spn: path to a comma-delimited file where tip labels correspond to species names: example file can be found in tests/data/tiny_test_example/test_nicespl.csv. Species names must be provided with '_' rather than a 'whitespace', also for species epithets with a '-' in its name.
workdir: path to your working directory, the folder where intermediate and result files shall be stored.
configfi: path to your configuration file, configuration options are explained in "03 Settings options".

Note: Specified paths have to start either from your root directory (e.g. /home/USER/PhylUp/path/to/file) or can be relative from within the PhylUp main folder (./path/to/file).

Besides the standard definition, there are more input options, see below. Currently supported are:

ignore_acc_list: A list of sequences that shall not be added, or if it is part of the input it will be removed. This needs to be formatted as a python list containing the GenBank identifiers (e.g. [accession number, accession number]).
status_end: This is an option to tell the program how often to run the blast searches. If it is set to 1, it will only blast the input sequences once and adds those found to the alignment. If it is set to 2, it will blast the newly retrieved sequences from the initial blast search and adds those results to the alignment as well. As such the input plus the first round of newly found sequences is blasted. If it is set to 3, it will add one more 'round', etc...

2. start to update your alignment:

Start the analyses from your PhylUp main folder:

python3 ./path/to/file/analysis-file.py

The blasting takes unfortuntaly quite some time, especially for larger alignments (options for speed up - see configuration option). an option is to run the analyses on a cluster.

3. Concatenate different single-gene PhylUp runs:

After the single-gene PhylUp runs were updated, the data can be combined using phylogenetic concatenation, see for example example_concat.py.

4. Navigating the output:

During a PhylUp run, several files are being written out: Here is a short introduction to what they are:

all_new_seqs.updated: csv file with all information available about the newly added sequences
fulltree.raxml.: all the files generated by RAxML-NG if the tree updating was enabled
logfile: short summary of how many sequences where added/filtered during a PhylUp run
table.updated: all sequences that were considered to be added, but of which some have been removed because of the filtering steps - does not include sequences that were not added because they are not part of the mrca or because they were too short - for those see wrong_mrca.csv and wrong_seq_length.csv
updt_aln.fasta: updated alignment; there is also a relabeled version with tipnames as a combination of taxon name and accession number
updt_tre.tre: updated tree or unresolved tree if no phylogeny was provided as input; there is also a relabeled version with tipnames as a combination of taxon name and accession number
orig_inputaln.fasta: data as supplied in the input
orig_tre.tre: data as supplied in the input
blast: folder that holds all blast results
tmp: folder that holds taxonomic information, complete sequences and internally used files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02 Get started

1. Write your analysis file

2. start to update your alignment:

3. Concatenate different single-gene PhylUp runs:

4. Navigating the output:

Clone this wiki locally