Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update GTDB download and formatting #366

Merged
merged 13 commits into from
Dec 10, 2024
Merged

Update GTDB download and formatting #366

merged 13 commits into from
Dec 10, 2024

Conversation

chasemc
Copy link
Member

@chasemc chasemc commented Oct 29, 2024

No description provided.

moves code to entirely python-based download and setup
moves code to entirely python-based download and setup
Removes documentation about autometa-setup-gtdb, which can be revisited later.

That pathway makes it hard to ensure reproducibility since the files, as downloaded, don't have version info. For now, for the sake of ensuring reproducibility, the only thing accepted are Autometa-downloaded files.
@chasemc chasemc requested a review from jason-c-kwan October 29, 2024 12:47
@chasemc chasemc added bug Something isn't working enhancement New feature or request python Python related issues/code labels Oct 29, 2024
@chasemc chasemc removed the request for review from jason-c-kwan October 29, 2024 13:03
@chasemc chasemc marked this pull request as draft October 29, 2024 13:03
@chasemc chasemc requested a review from jason-c-kwan October 29, 2024 13:19
@chasemc chasemc marked this pull request as ready for review October 29, 2024 13:19
@jason-c-kwan
Copy link
Collaborator

When running:

autometa-config \
    --section databases --option gtdb \
    --value <path/to/your/gtdb/database/directory>

Currently it does not create the directory if it doesn't already exist, leading to an error when the autometa-update-databases is run. Can you add some handling for this?

@jason-c-kwan
Copy link
Collaborator

Also, would it not be faster to download from data.ace.uq.edu.au rather than data.gtdb.ecogenomic.org? In my tests at least from home it seems to be at least 10x faster.

@chasemc
Copy link
Member Author

chasemc commented Dec 6, 2024

Also, would it not be faster to download from data.ace.uq.edu.au rather than data.gtdb.ecogenomic.org? In my tests at least from home it seems to be at least 10x faster.

The day I tried they were both downloading the exact same rate so I left it. I think the world mirror is still in Australia.
Changed the default to mirror (rest of world) data.ace.uq.edu.au though it should be noted that data.gtdb.ecogenomic.org was what was used before this PR.

@chasemc
Copy link
Member Author

chasemc commented Dec 6, 2024

The URL structure is different, give me a minute

@chasemc
Copy link
Member Author

chasemc commented Dec 6, 2024

fixed

@jason-c-kwan
Copy link
Collaborator

You still have a mistake in the URL. Got the error:

[12/06/2024 10:56:21 AM ERROR] autometa.taxonomy.download_gtdb_files: Failed to fetch MD5SUM.txt: 404 Client Error: Not Found for url: https://data.ace.uq.edu.au/public/gtdb/data/public/gtdb/data/releases/release207/207.0/MD5SUM.txt

The mistake is that the second "public" directory doesn't exist. It should be https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/MD5SUM.txt

@jason-c-kwan
Copy link
Collaborator

Something is wrong with the creation of the autometa-formatted version of faa.gz. I get this error:

[12/06/2024 01:21:58 PM DEBUG] autometa.common.external.diamond: diamond makedb --in /home/jason/Downloads/databases/autometa_formatted_gtdb-version-207.0.faa.gz --db /home/jason/Downloads/databases/autometa_formatted_gtdb-version-207.0.dmnd -p 120
Traceback (most recent call last):
  File "/home/jason/mambaforge/envs/autometa/bin/autometa-update-databases", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jason/mambaforge/envs/autometa/lib/python3.12/site-packages/autometa/config/databases.py", line 860, in main
    diamond.makedatabase(
  File "/home/jason/mambaforge/envs/autometa/lib/python3.12/site-packages/autometa/common/external/diamond.py", line 51, in makedatabase
    subprocess.run(
  File "/home/jason/mambaforge/envs/autometa/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['diamond', 'makedb', '--in', '/home/jason/Downloads/databases/autometa_formatted_gtdb-version-207.0.faa.gz', '--db', '/home/jason/Downloads/databases/autometa_formatted_gtdb-version-207.0.dmnd', '-p', '120']' returned non-zero exit status 1.

When trying to run the diamond command in the terminal, I get:

diamond v2.1.10.164 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 120
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /home/jason/Downloads/databases/autometa_formatted_gtdb-version-207.0.faa.gz
Opening the database file... Error: Error detecting input file format. First line seems to be blank.

And indeed with zcat, the file autometa_formatted_gtdb-version-207.0.faa.gz is blank.

@chasemc
Copy link
Member Author

chasemc commented Dec 8, 2024

It worked from scratch for me. My guess is a partial download from a previous attempt caused an error. Try clearing the files and attempting again.

If that results in the error again can you post the commands used

@jason-c-kwan
Copy link
Collaborator

Just deleted my mamba environment and installed again from scratch and got the same result. Can you try again with gtdb version 207, which is what I was downloading? The commands I ran were as follows:

# After pulling git repo etc.
make create_environment
mamba activate autometa
make install
autometa-config --section databases --option gtdb --value ~/Downloads/databases
autometa-config --section gtdb --option release --value 207
autometa-update-databases --update-gtdb

@chasemc
Copy link
Member Author

chasemc commented Dec 9, 2024

Running your exact code on the server now but will take time to download.

Did you clear the files from ~/Downloads/databases before retrying?

@jason-c-kwan
Copy link
Collaborator

Yes, I did.

@chasemc
Copy link
Member Author

chasemc commented Dec 9, 2024

The code and documentation was written for v220 or higher because GTDB changed the file contents somewhere between 207 and 220.
The code is able to download the latest version only (maybe 214 would work as well) but I have added a check that only allows versions 220 or greater.

@jason-c-kwan
Copy link
Collaborator

OK, it works now. However, because we are limited to the latest release I can't test the ability to update in-place. We will have to test that later.

@jason-c-kwan jason-c-kwan merged commit ee499fe into dev Dec 10, 2024
2 of 3 checks passed
@chasemc chasemc deleted the gtdb branch December 10, 2024 17:06
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working enhancement New feature or request python Python related issues/code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants