Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Download does not work #444

Open
ChristophKnapp opened this issue Jan 15, 2025 · 9 comments
Open

Download does not work #444

ChristophKnapp opened this issue Jan 15, 2025 · 9 comments

Comments

@ChristophKnapp
Copy link

ChristophKnapp commented Jan 15, 2025

I try to download all genomes of a taxon. Ending up with empty output and a lot of warnings.

pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log

...

pyANI_download.log

Genbank and refseq downloads fail.

I'm at ubuntu 24.04

Regards

Christoph

@peterjc
Copy link
Collaborator

peterjc commented Jan 15, 2025

Sadly the log has no clues like an error message:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, citation=False, classfname='classes.txt', debug=False, disable_tqdm=False, dryrun=False, email='christoph.knapp01@gmail.com', force=False, func=<function subcmd_download at 0x71137dc88160>, kraken=False, labelfname='labels.txt', logfile=PosixPath('pyANI_download.log'), noclobber=False, outdir=PosixPath('tax326423'), retries=20, taxon='326423', timeout=10, verbose=True, version=False)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/miniforge3/envs/pyani_env/bin/pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: 	Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: 	& Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: 	food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: 	Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: 	MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: 	C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: 	for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: 	BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: 	Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: 	BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: 	BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: 	Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: 	a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 	25:3389-3402
[INFO] [pyani.scripts.pyani_script]: 	Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: 	Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: 	(2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: 	molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.pyani_script]: 	fastANI: Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis K, and
[INFO] [pyani.scripts.pyani_script]: 	Aluru S (2018) 'High throughput ANI analysis of 90K prokaryotic
[INFO] [pyani.scripts.pyani_script]: 	genomes reveals clear species boundaries.' Nature Communications 9, 5114
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading genomes from NCBI
[INFO] [pyani.scripts]: Creating output directory tax326423

You said there was an error - was there an error message, or did the tool seem to hang with no sign of any progress?

@ChristophKnapp
Copy link
Author

Sorry I did not realize that the log file does not contain the full output. It did not hang just saying that it ignores all 4 genomes as refseq or genebank. Attached the full output.

pyANI_complete_out.txt

Thanks for the fast reply

Regards

Christoph

@ChristophKnapp
Copy link
Author

I used the ftp urls with wget and there is no problem downloading the file. So its not a firewall issue.

@ChristophKnapp
Copy link
Author

I also tried a different taxon ID but the Problem remains.

@peterjc
Copy link
Collaborator

peterjc commented Jan 15, 2025

The key part of that looks to me: urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, \'Connection refused\')>\n - you imply that you can access the FTP URL from the same machine with another tool fine though?

The latest version of the code (looking at the master branch since you didn't say which version you were using) does the FTP download with urllib.request.urlopen(...) from the Python standard library, but does not look like it catches urllib.error.URLErrorto give a graceful failure message.

In the meantime, perhaps https://github.com/kblin/ncbi-genome-download or similar would get you going?

@ChristophKnapp
Copy link
Author

The version is
pyani version: 0.3.0-alpha

The key part of that looks to me: urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, \'Connection refused\')>\n - you imply that you can access the FTP URL from the same machine with another tool fine though?

Yes, I used wget with the url and it downloaded the file without problems.

The latest version of the code (looking at the master branch since you didn't say which version you were using) does the FTP download with urllib.request.urlopen(...) from the Python standard library, but does not look like it catches urllib.error.URLErrorto give a graceful failure message.

In the meantime, perhaps https://github.com/kblin/ncbi-genome-download or similar would get you going?

@widdowquinn
Copy link
Owner

widdowquinn commented Jan 16, 2025

Hi both,

I've just run the exact same command from the train into work. The download seems to have completed fine, for me. This is from the log file:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(version=False, citation=False, logfile=PosixPath('pyANI_download.log'), verbose=True, debug=False, disable_tqdm=False, outdir=PosixPath('tax326423'), taxon='326423', email='christoph.knapp01@gmail.com', api_keypath=PosixPath('~/.ncbi/api_key'), retries=20, batchsize=10000, timeout=10, force=False, noclobber=False, labelfname='labels.txt', classfname='classes.txt', kraken=False, dryrun=False, func=<function subcmd_download at 0x172208860>)
[INFO] [pyani.scripts.pyani_script]: command-line: /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[...]
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading genomes from NCBI
[INFO] [pyani.scripts]: Creating output directory tax326423
[INFO] [pyani.scripts.subcommands.subcmd_download]: Setting Entrez email address: christoph.knapp01@gmail.com
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/lpritc/.ncbi/api_key not a valid file. Not using API key.
[INFO] [pyani.scripts.subcommands.subcmd_download]: Taxon IDs received: ['326423']
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading contigs for Taxon ID ['4693281', '36888']
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 4693281
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_000015785.2_ASM1578v2
[INFO] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check passed
[INFO] [pyani.scripts.subcommands.subcmd_download]: Label and class file entries
	Label: 02e4e33060a6b8c42b3c63600c354115	GCF_000015785.2_ASM1578v2_genomic	B. velezensis BGSC:10A6; DSM:23117
	Class: 02e4e33060a6b8c42b3c63600c354115	GCF_000015785.2_ASM1578v2_genomic	Bacillus velezensis
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 36888
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_000015785.1_ASM1578v1
[INFO] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check passed
[INFO] [pyani.scripts.subcommands.subcmd_download]: Label and class file entries
	Label: 3b7d1157eb13be26d8c67410b61a7188	GCF_000015785.1_ASM1578v1_genomic	B. velezensis FZB42
	Class: 3b7d1157eb13be26d8c67410b61a7188	GCF_000015785.1_ASM1578v1_genomic	Bacillus velezensis
[INFO] [pyani.scripts.subcommands.subcmd_download]: Writing classes file to tax326423/classes.txt
[INFO] [pyani.scripts.subcommands.subcmd_download]: Writing labels file to tax326423/labels.txt
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 231.308

However, the first time I ran the command we went into a tunnel and I lost connection. This gave a URLError as might be expected for an interrupted connection.

Is the error you see persistent and reproducible @ChristophKnapp ? If it is, then as the code and command appears to work as intended on other machines and in other situations, I would suspect a local issue on that machine, but I don't have enough information to diagnose it, I'm afraid.

Cheers,

L.

@ChristophKnapp
Copy link
Author

Thank you for your efforts. It is consistent in terms that it fails. I tried the exact example here in this git in the morning and it partially downloaded the expected results. It downloaded some genomes which are not part of the expected output though. I assume this is because the taxon changed over time. Nevertheless I don't have md5 files or fasta or hashfiles for most of them. The classes and labels files contain only 2 entries. Somehow the connection seems flaky, cutting in and out. I think I did not see that because the taxons I tried before were a lot smaller.

I think it has something to do with my setup here.

@peterjc
Copy link
Collaborator

peterjc commented Jan 16, 2025

That is unfortunate - that it works sometimes or at least for smaller genomes, does point to the connection being weak. I'm not sure there is much we can do - they only thing I can think of trying is switching from FTP to HTTPS which many of the NCBI resources support.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants