Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Align assemblies autothreading #72

Merged
merged 20 commits into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
1c59673
intial changes, not at all finished
lynnjo Dec 6, 2023
6e11f9c
code addded that calculated number of threads/runs
lynnjo Dec 7, 2023
0dbf8ff
more details to the calculate threads/run function
lynnjo Dec 7, 2023
333ef25
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 7, 2023
c701ef1
futher attempts at determining available mem, plus debug
lynnjo Dec 8, 2023
2197f20
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 8, 2023
d85eff6
add correct call to get system memory
lynnjo Dec 13, 2023
185ada1
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 13, 2023
a83b7ef
udpate junit tests
lynnjo Dec 13, 2023
e23158d
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 14, 2023
90a03de
cleanup before pull request
lynnjo Dec 14, 2023
5f1b83b
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 14, 2023
157a96b
added edge case junits to up code coverage
lynnjo Dec 14, 2023
8bb6e2c
adding documentation updates
lynnjo Dec 14, 2023
6222348
adding detailed info on annotate-fastas, some typo corrections elsewhere
lynnjo Dec 14, 2023
b2f92a7
additional update to build_and_load.md
lynnjo Dec 14, 2023
ede4723
Merge branch 'main' into align_assemblies_autothreading
lynnjo Dec 14, 2023
672574c
additional updates to annotate-fastas documentation.
lynnjo Dec 14, 2023
8292319
minor edits per pull request comments on the documentation
lynnjo Dec 15, 2023
c4d1e26
fixed annotate fasta in documentation
lynnjo Dec 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 136 additions & 74 deletions docs/build_and_load.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,15 @@ In this document, we will discuss the steps needed to:
--gff anchors.gff \
--reference-file /my/ref.fasta \
--assemblies assemblies_list.txt \
--total-threads 20 \
--in-parallel 4 \
-o /path/for/generated_files
```
* Update FASTA headers with sample information:
```shell
phg annotate-fastas \
--keyfile /path/to/keyfile \
--output-dir /path/to/annotated/fastas \
--threads 10
```
* Compress FASTA files
```shell
phg agc-compress \
Expand Down Expand Up @@ -325,6 +330,12 @@ This command uses several parameters:
`LineB`. Since these are located in a subdirectory called `data`
relative to my working directory, I will also add that to the path.

* `-o` - The name of the directory for the alignment outputs.


In addition to the above parameters, there are two optional parameters. When values are not specified for these parameters,
default values are calculated by the software based on the system processor and memory configuration.

* `--total-threads` - How many threads would you like to allocate for
the alignment step? More information about this step and the
`--in-parallel` step can be found in the following **Details -
Expand All @@ -333,7 +344,7 @@ This command uses several parameters:
More information about this step and the `--total-threads` step can
be found in the following **Details - threads and parallelization**
section.
* `-o` - The name of the directory for the alignment outputs.


> [!WARNING]
> The directory that you specify in the output (`-o`) section must
Expand Down Expand Up @@ -407,7 +418,52 @@ or review the following code blocks:
> prior `-R` and `-Q` parameters.


[//]: # (#### Details - threads and parallelization)
#### Details - threads and parallelization
Aligning with anchorwave is memory intensive and can be slow. Processing speed may be increased by
using multiple threads for each alignment, and/or by running multiple alignments in parallel. The amount of memory
each thread takes is dependent on the processor type. The table below shows the memory usage for a single
alignment based on processor type:

| Processor | peak memory (GB) | wall time |
|-----------|------------------|-----------|
| SSE2 | 20.1 | 26:47:17 |
| SSE4.1 | 20.6 | 24:05:07 |
| AVX2 | 20.1 | 21:40:00 |
| AVX512 | 20.1 | 18:31:39 |
| ARM | 34.2 | 18:08:57 |

The `--total-threads` parameter indicates the total number of threads available for system use. The `--in-parallel`
parameter controls the number of alignments run in parallel. When these values are not specified, the software will
compute the optimal values based on the system processor and memory configuration.

The number of threads that may be run in parallel is limited by the amount of memory available. The
system is queried for memory and processor information. The number of threads that may be run in parallel
is determined by "system memory" / "peak memory" from the table above. To generalize the calculation, we divide
memory available (GiB) by 21 and round down to the nearest integer.

For example, if the system has 512 GB of memory, 80 processors and 5 assemblies that need aligning,
the maximum number of threads that could be run in parallel is 24 (512/21). The number of potential parallel
alignments with the threads allocated for each is shown in the table below


| Alignments in parallel | Threads per alignment |
|------------------------|-----------------------|
| 5 | 4 |
| 4 | 6 |
| 3 | 8 |
| 2 | 12 |
| 1 | 24 |

The software will select a pairing that maximizes the number of alignments run in parallel while utilizing multiple threads,
opting for a value in the middle. In the case above with 5 assemblies and a possibility of 24 concurrent threads,
the system will choose to run 3 alignments in parallel with 8 threads each. The total number of threads used
will be 24 (3 * 8).

User defined values for in-parallel and total-threads are considered along with the number of assemblies to
align and system capacities, when determining the anchorwave setup.




[//]: # ()
[//]: # (***WIP*** - revisit once we can agree on naming and parameter conventions)
Expand Down Expand Up @@ -442,64 +498,14 @@ or review the following code blocks:

[//]: # (| ARM | 34.2 | 18:08:57 |)


### Compress FASTA files
For optimal storage of sequence information, we can convert our
"plain-text" FASTA files into a more compact representation using
compression. For PHGv2, we can use a command called `agc-compress`,
which is a wrapper for the
[Assembled Genomes Compressor](https://github.com/refresh-bio/agc)
(AGC). AGC provides performant and efficient compression ratios for
our assembly genomes. Like AnchorWave, AGC is also installed during
the Conda environment setup phase, so there is no need to install
this manually.

To run the compression step, we can call the `align-assemblies`
command:

```shell
./phg agc-compress \
--db-path vcf_dbs \
--fasta-list data/assemblies_list.txt \
--reference-file data/Ref.fa
```

This command takes in 3 parameters:
* `--db-path` - path to directory storing the TileDB instances. The
AGC compressed genomes will be placed here on completion.

> [!NOTE]
> The directory specified here should be the same directory used to
> initialize the TileDB instances in the database initialization
> (`initdb`) step.

* `--fasta-list` - List of assembly FASTA genomes to compress.

> [!NOTE]
> The list specified in `--fasta-list` _can_ be the same list used
> in the alignment (`align-assemblies`) step, but make sure header
> information in each FASTA contains sample IDs. See the
> [**"Important information regarding `agc` compression"**](#warning-important-information-regarding-agc-compression-warning)
> section for further information and how to remedy this.

* `--reference-file` - Reference FASTA genome.

After compression is finished, we can navigate to the directory
containing the TileDB instances. In my case, this would be the
subdirectory, `vcf_dbs`. Here, you will see a new file created:
`assemblies.agc`. This is the compressed AGC file containing our
assemblies. This file will be used later to query for haplotype
sequence regions and composite genome creation.

#### :warning: Important information regarding `agc` compression :warning:

### Annotate Fastas
As of the current date of this document, the return methods of
AGC will not keep track of sample IDs when returning
sequence information from the compressed file **unless you explicitly
state the sample information in the header lines of the FASTA files
you wish to compress for the PHGv2 databases**. To explain this
further, let's imagine that we have two FASTA files: `LineA.fa` and
`LineB.fa`. These files contain information for only chromosome 1 and
AGC will not keep track of sample IDs when returning
sequence information from the compressed file **unless you explicitly
state the sample information in the header lines of the FASTA files
you wish to compress for the PHGv2 databases**. To explain this
further, let's imagine that we have two FASTA files: `LineA.fa` and
`LineB.fa`. These files contain information for only chromosome 1 and
have a simple header that denotes this sequence name (e.g. `chr1`):

```
Expand All @@ -512,12 +518,12 @@ $ head LineB.fa
ATGCGTTCGCCTTCCG
```

After we compress this information using the `agc-compress` command,
After we compress this information using the `agc-compress` command,
we will have a file called `assemblies.agc`. PHGv2 leverages this
compressed sequence file when creating VCF data (_see
the [**"Create VCF files"**](#create-vcf-files) section for further
the [**"Create VCF files"**](#create-vcf-files) section for further
details_). The issue arrives when we query the compressed data using
[AGC's `getctg`](https://github.com/refresh-bio/agc#extract-contigs-from-the-archive)
[AGC's `getctg`](https://github.com/refresh-bio/agc#extract-contigs-from-the-archive)
command. When we query this hypothetical archive, we will get the
following information:

Expand All @@ -531,7 +537,7 @@ ATGCGTACGCGCACCG
ATGCGTTCGCCTTCCG
```

As you can see from the above hypothetical output, we now have no
As you can see from the above hypothetical output, we now have no
means to efficiently track where each sequence is coming from due
to the lack of header information from our prior FASTA files. We
can remedy this by adding sample information to the headers:
Expand Down Expand Up @@ -559,7 +565,7 @@ ATGCGTACGCGCACCG
ATGCGTTCGCCTTCCG
```

While we can either manually modify the header lines of our FASTA
While we can either manually modify the header lines of our FASTA
file, this can become tedious and prone to a new set of downstream
errors. To automate this, PHGv2 provides an optional command to
append sample information to the headers of each FASTA file called
Expand All @@ -574,35 +580,37 @@ phg annotate-fastas \

This command takes 3 parameters:

* `--keyfile` - A [tab-delimited](https://en.wikipedia.org/wiki/Tab-separated_values)
* `--keyfile` - A [tab-delimited](https://en.wikipedia.org/wiki/Tab-separated_values)
keyfile containing two columns:

| Column | Value |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Path to FASTA file you would like annotated (this is similar to the text files used to point to the FASTA file paths in the [`agc-compress`](#compress-fasta-files) and [`align-assemblies`](#align-assemblies) commands). |
* | 2 | Name of the sample that will be appended to each header line |
| Column | Value |
|--------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Path to FASTA file you would like annotated (this is similar to the text files used to point to the FASTA file paths in the [`agc-compress`](#compress-fasta-files) and [`align-assemblies`](#align-assemblies) commands). |
| 2 | Name of the sample that will be appended to each header line |

+ My example `annotation_keyfile.txt` file would look like this:

```
data/LineA.fa LineA
data/LineB.fa LineB
```
* `--threads` - Optional number of threads to annotate multiple
FASTA files in parallel. _Defaults to `1`_.
* `-o` - Output directory for the newly annotated FASTA files
* `-o` - Output directory for the newly annotated FASTA files

> [!WARNING]
> This step must be performed before the `agc-compress` step.

> [!WARNING]
> Sample IDs in the keyfile must match with what is found in the
> Sample IDs in the keyfile must match with what is found in the
> TileDB instances.

> [!NOTE]
> FASTA files can be either uncompressed or compressed. If
> compressed, the extension should be `*.gz`.

Once finished, this command will produce FASTA files with the name
of the sample from the keyfile appended to each header line. For
of the sample from the keyfile appended to each header line. For
example, in our hypothetical FASTA files, our headers go from this:

```
Expand Down Expand Up @@ -635,6 +643,60 @@ ATGCGTACGCGCACCG
>ATGCGTACGCGCACCG
>```





### Compress FASTA files
For optimal storage of sequence information, we can convert our
"plain-text" FASTA files into a more compact representation using
compression. For PHGv2, we can use a command called `agc-compress`,
which is a wrapper for the
[Assembled Genomes Compressor](https://github.com/refresh-bio/agc)
(AGC). AGC provides performant and efficient compression ratios for
our assembly genomes. Like AnchorWave, AGC is also installed during
the Conda environment setup phase, so there is no need to install
this manually.

To run the compression step, we can call the `align-assemblies`
command:

```shell
./phg agc-compress \
--db-path vcf_dbs \
--fasta-list data/assemblies_list.txt \
--reference-file data/Ref.fa
```

This command takes in 3 parameters:
* `--db-path` - path to directory storing the TileDB instances. The
AGC compressed genomes will be placed here on completion.

> [!NOTE]
> The directory specified here should be the same directory used to
> initialize the TileDB instances in the database initialization
> (`initdb`) step.

* `--fasta-list` - List of assembly FASTA genomes to compress.

> [!NOTE]
> The list specified in `--fasta-list` _can_ be the same list used
> in the alignment (`align-assemblies`) step, but make sure header
> information in each FASTA contains sample IDs. See the
> [**"Important information regarding `agc` compression"**](#warning-important-information-regarding-agc-compression-warning)
> section for further information and how to remedy this.

* `--reference-file` - Reference FASTA genome.

After compression is finished, we can navigate to the directory
containing the TileDB instances. In my case, this would be the
subdirectory, `vcf_dbs`. Here, you will see a new file created:
`assemblies.agc`. This is the compressed AGC file containing our
assemblies. This file will be used later to query for haplotype
sequence regions and composite genome creation.



### Create VCF files
Now that we have (1) created alignments of our assemblies against a
single reference genome and (2) created compressed representations
Expand Down
Loading