Download does not work #444

ChristophKnapp · 2025-01-15T11:59:21Z

I try to download all genomes of a taxon. Ending up with empty output and a lot of warnings.

pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log

...

pyANI_download.log

Genbank and refseq downloads fail.

I'm at ubuntu 24.04

Regards

Christoph

peterjc · 2025-01-15T12:26:11Z

Sadly the log has no clues like an error message:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, citation=False, classfname='classes.txt', debug=False, disable_tqdm=False, dryrun=False, email='christoph.knapp01@gmail.com', force=False, func=<function subcmd_download at 0x71137dc88160>, kraken=False, labelfname='labels.txt', logfile=PosixPath('pyANI_download.log'), noclobber=False, outdir=PosixPath('tax326423'), retries=20, taxon='326423', timeout=10, verbose=True, version=False)
[INFO] [pyani.scripts.pyani_script]: command-line: /opt/miniforge3/envs/pyani_env/bin/pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[INFO] [pyani.scripts.pyani_script]: CITATION INFO
[INFO] [pyani.scripts.pyani_script]: If you use pyani in your work, please cite the following publication:
[INFO] [pyani.scripts.pyani_script]: 	Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G.,
[INFO] [pyani.scripts.pyani_script]: 	& Toth, I.K. (2016) 'Genomics and taxonomy in diagnostics for
[INFO] [pyani.scripts.pyani_script]: 	food security: soft-rotting enterobacterial plant pathogens.'
[INFO] [pyani.scripts.pyani_script]: 	Analytical Methods, 8(1), 12–24. http://doi.org/10.1039/C5AY02550H
[INFO] [pyani.scripts.pyani_script]: DEPENDENCIES
[INFO] [pyani.scripts.pyani_script]: The authors of pyani gratefully acknowledge its dependence on
[INFO] [pyani.scripts.pyani_script]: the following bioinformatics software:
[INFO] [pyani.scripts.pyani_script]: 	MUMmer3: S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway,
[INFO] [pyani.scripts.pyani_script]: 	C. Antonescu, and S.L. Salzberg (2004), 'Versatile and open software
[INFO] [pyani.scripts.pyani_script]: 	for comparing large genomes' Genome Biology 5:R12
[INFO] [pyani.scripts.pyani_script]: 	BLAST+: Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J.,
[INFO] [pyani.scripts.pyani_script]: 	Bealer K., & Madden T.L. (2008) 'BLAST+: architecture and applications.'
[INFO] [pyani.scripts.pyani_script]: 	BMC Bioinformatics 10:421.
[INFO] [pyani.scripts.pyani_script]: 	BLAST: Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J.,
[INFO] [pyani.scripts.pyani_script]: 	Zhang, Z., Miller, W. & Lipman, D.J. (1997) 'Gapped BLAST and PSI-BLAST:
[INFO] [pyani.scripts.pyani_script]: 	a new generation of protein database search programs.' Nucleic Acids Res.
[INFO] [pyani.scripts.pyani_script]: 	25:3389-3402
[INFO] [pyani.scripts.pyani_script]: 	Biopython: Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A,
[INFO] [pyani.scripts.pyani_script]: 	Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL
[INFO] [pyani.scripts.pyani_script]: 	(2009) Biopython: freely available Python tools for computational
[INFO] [pyani.scripts.pyani_script]: 	molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423
[INFO] [pyani.scripts.pyani_script]: 	fastANI: Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis K, and
[INFO] [pyani.scripts.pyani_script]: 	Aluru S (2018) 'High throughput ANI analysis of 90K prokaryotic
[INFO] [pyani.scripts.pyani_script]: 	genomes reveals clear species boundaries.' Nature Communications 9, 5114
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading genomes from NCBI
[INFO] [pyani.scripts]: Creating output directory tax326423

You said there was an error - was there an error message, or did the tool seem to hang with no sign of any progress?

ChristophKnapp · 2025-01-15T12:32:44Z

Sorry I did not realize that the log file does not contain the full output. It did not hang just saying that it ignores all 4 genomes as refseq or genebank. Attached the full output.

pyANI_complete_out.txt

Thanks for the fast reply

Regards

Christoph

ChristophKnapp · 2025-01-15T12:42:35Z

I used the ftp urls with wget and there is no problem downloading the file. So its not a firewall issue.

ChristophKnapp · 2025-01-15T12:45:47Z

I also tried a different taxon ID but the Problem remains.

peterjc · 2025-01-15T13:03:54Z

The key part of that looks to me: urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, \'Connection refused\')>\n - you imply that you can access the FTP URL from the same machine with another tool fine though?

The latest version of the code (looking at the master branch since you didn't say which version you were using) does the FTP download with urllib.request.urlopen(...) from the Python standard library, but does not look like it catches urllib.error.URLErrorto give a graceful failure message.

In the meantime, perhaps https://github.com/kblin/ncbi-genome-download or similar would get you going?

ChristophKnapp · 2025-01-16T07:46:20Z

The version is
pyani version: 0.3.0-alpha

The key part of that looks to me: urllib.error.URLError: <urlopen error ftp error: ConnectionRefusedError(111, \'Connection refused\')>\n - you imply that you can access the FTP URL from the same machine with another tool fine though?

Yes, I used wget with the url and it downloaded the file without problems.

The latest version of the code (looking at the master branch since you didn't say which version you were using) does the FTP download with urllib.request.urlopen(...) from the Python standard library, but does not look like it catches urllib.error.URLErrorto give a graceful failure message.

In the meantime, perhaps https://github.com/kblin/ncbi-genome-download or similar would get you going?

widdowquinn · 2025-01-16T08:28:50Z

Hi both,

I've just run the exact same command from the train into work. The download seems to have completed fine, for me. This is from the log file:

[INFO] [pyani.scripts.pyani_script]: Processed arguments: Namespace(version=False, citation=False, logfile=PosixPath('pyANI_download.log'), verbose=True, debug=False, disable_tqdm=False, outdir=PosixPath('tax326423'), taxon='326423', email='christoph.knapp01@gmail.com', api_keypath=PosixPath('~/.ncbi/api_key'), retries=20, batchsize=10000, timeout=10, force=False, noclobber=False, labelfname='labels.txt', classfname='classes.txt', kraken=False, dryrun=False, func=<function subcmd_download at 0x172208860>)
[INFO] [pyani.scripts.pyani_script]: command-line: /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/pyani download -o tax326423 --email christoph.knapp01@gmail.com -t 326423 -v -l pyANI_download.log
[INFO] [pyani.scripts.pyani_script]: pyani version: 0.3.0-alpha
[...]
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading genomes from NCBI
[INFO] [pyani.scripts]: Creating output directory tax326423
[INFO] [pyani.scripts.subcommands.subcmd_download]: Setting Entrez email address: christoph.knapp01@gmail.com
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/lpritc/.ncbi/api_key not a valid file. Not using API key.
[INFO] [pyani.scripts.subcommands.subcmd_download]: Taxon IDs received: ['326423']
[INFO] [pyani.scripts.subcommands.subcmd_download]: Downloading contigs for Taxon ID ['4693281', '36888']
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 4693281
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_000015785.2_ASM1578v2
[INFO] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check passed
[INFO] [pyani.scripts.subcommands.subcmd_download]: Label and class file entries
	Label: 02e4e33060a6b8c42b3c63600c354115	GCF_000015785.2_ASM1578v2_genomic	B. velezensis BGSC:10A6; DSM:23117
	Class: 02e4e33060a6b8c42b3c63600c354115	GCF_000015785.2_ASM1578v2_genomic	Bacillus velezensis
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving eSummary information for UID 36888
[INFO] [pyani.scripts.subcommands.subcmd_download]: Retrieving URLs for GCF_000015785.1_ASM1578v1
[INFO] [pyani.scripts.subcommands.subcmd_download]: MD5 hash check passed
[INFO] [pyani.scripts.subcommands.subcmd_download]: Label and class file entries
	Label: 3b7d1157eb13be26d8c67410b61a7188	GCF_000015785.1_ASM1578v1_genomic	B. velezensis FZB42
	Class: 3b7d1157eb13be26d8c67410b61a7188	GCF_000015785.1_ASM1578v1_genomic	Bacillus velezensis
[INFO] [pyani.scripts.subcommands.subcmd_download]: Writing classes file to tax326423/classes.txt
[INFO] [pyani.scripts.subcommands.subcmd_download]: Writing labels file to tax326423/labels.txt
[INFO] [pyani.scripts.pyani_script]: Completed. Time taken: 231.308

However, the first time I ran the command we went into a tunnel and I lost connection. This gave a URLError as might be expected for an interrupted connection.

Is the error you see persistent and reproducible @ChristophKnapp ? If it is, then as the code and command appears to work as intended on other machines and in other situations, I would suspect a local issue on that machine, but I don't have enough information to diagnose it, I'm afraid.

Cheers,

L.

ChristophKnapp · 2025-01-16T09:35:45Z

Thank you for your efforts. It is consistent in terms that it fails. I tried the exact example here in this git in the morning and it partially downloaded the expected results. It downloaded some genomes which are not part of the expected output though. I assume this is because the taxon changed over time. Nevertheless I don't have md5 files or fasta or hashfiles for most of them. The classes and labels files contain only 2 entries. Somehow the connection seems flaky, cutting in and out. I think I did not see that because the taxons I tried before were a lot smaller.

I think it has something to do with my setup here.

peterjc · 2025-01-16T11:19:17Z

That is unfortunate - that it works sometimes or at least for smaller genomes, does point to the connection being weak. I'm not sure there is much we can do - they only thing I can think of trying is switching from FTP to HTTPS which many of the NCBI resources support.

ChristophKnapp mentioned this issue Jan 21, 2025

At least one NUCmer comparison failed. Please investigate (exiting) #445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download does not work #444

Download does not work #444

ChristophKnapp commented Jan 15, 2025 •

edited

Loading

peterjc commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

peterjc commented Jan 15, 2025

ChristophKnapp commented Jan 16, 2025

widdowquinn commented Jan 16, 2025 •

edited

Loading

ChristophKnapp commented Jan 16, 2025

peterjc commented Jan 16, 2025

Download does not work #444

Download does not work #444

Comments

ChristophKnapp commented Jan 15, 2025 • edited Loading

peterjc commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

ChristophKnapp commented Jan 15, 2025

peterjc commented Jan 15, 2025

ChristophKnapp commented Jan 16, 2025

widdowquinn commented Jan 16, 2025 • edited Loading

ChristophKnapp commented Jan 16, 2025

peterjc commented Jan 16, 2025

ChristophKnapp commented Jan 15, 2025 •

edited

Loading

widdowquinn commented Jan 16, 2025 •

edited

Loading