-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
pyani download
is blocked if downloaded file cannot be uncompressed.
#383
Comments
I think this is different to #70 |
I believe this is also causing tests to fail, specifically these: tests/test_subcmd_01_download.py::test_download_dry_run FAILED [ 70%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia FAILED [ 71%]
tests/test_subcmd_01_download.py::test_download_kraken FAILED [ 72%]
dryrun_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../tmp/pytest-of-baileythegreen/pytest-17/test_download_dry_run0/C_blochmannia'), retries=20, taxon='203804', timeout=10)
def test_download_dry_run(dryrun_namespace):
"""Dry run of C. blochmannia download."""
> subcommands.subcmd_download(dryrun_namespace)
tests/test_subcmd_01_download.py:128:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x127a6b280>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:360 Dry run only: will not overwrite or download
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
Query: 203804
asm count: 9
UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
______________________________________________________ test_download_c_blochmannia ______________________________________________________
base_download_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr...ytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia'), retries=20, taxon='203804', timeout=10)
def test_download_c_blochmannia(base_download_namespace):
"""Test C. blochmannia download."""
> subcommands.subcmd_download(base_download_namespace)
tests/test_subcmd_01_download.py:133:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x1275e7160>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
INFO pyani.scripts:__init__.py:39 Creating output directory /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia
WARNING pyani.scripts:__init__.py:42 Output directory overwrite forced
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
Query: 203804
asm count: 9
UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:139 eSummary information (GCF_014857065.1_ASM1485706v1):
Species Taxid: 2681987
TaxID: 2681987
Accession: GCF_014857065.1
Name: ASM1485706v1
Organism: Blochmannia endosymbiont of Colobopsis nipponica
Genus: Blochmannia
Species: endosymbiont of Colobopsis nipponica
Strain:
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:239 Retrieving URLs for GCF_014857065.1_ASM1485706v1
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:292 Downloaded from URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/857/065/GCF_014857065.1_ASM1485706v1/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:293 Wrote assembly to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:294 Wrote MD5 hashes to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_hashes.txt
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:298 Local MD5 hash: fbd87dfdbb889fad197db147c90790f8
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:299 NCBI MD5 hash: fbd87dfdbb889fad197db147c90790f8
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:301 MD5 hash check passed
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:184 Extracting archive /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:211 Creating local MD5 hash for /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:214 Writing hash to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.md5
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:161 Label and class file entries
Label: fb08eedc0cf49e1cf44a95539ae4fd7c GCF_014857065.1_ASM1485706v1_genomic B. endosymbiont of Colobopsis nipponica
Class: fb08eedc0cf49e1cf44a95539ae4fd7c GCF_014857065.1_ASM1485706v1_genomic Blochmannia endosymbiont of Colobopsis nipponica
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 5431901
_________________________________________________________ test_download_kraken __________________________________________________________
kraken_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../private/tmp/pytest-of-baileythegreen/pytest-17/test_download_kraken0/kraken'), retries=20, taxon='203804', timeout=10)
def test_download_kraken(kraken_namespace):
"""C. blochmannia download in Kraken format."""
> subcommands.subcmd_download(kraken_namespace)
tests/test_subcmd_01_download.py:138:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x1039aeb80>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError |
I think we need to investigate why these tests are now failing due to the uncompression, when they were previously working. Is there something about the download that has changed? |
If you are able to reproduce the failures, you are welcome to try. I no longer seem to be able to. I did find this github issue, part of which seemed to indicate something like this could be caused by a temporary issue, but I can't say if that's what happened here. The traceback I copied above is the only example I have of those tests failing locally. |
The DTD file issue is different (I've encountered it before). With the issue I originally raised, several |
On testing the above command again today (2022-03-15) the downloads proceed without error. I'm calling this as a transitory issue, possibly a fault at NCBI's end, and closing the issue. |
Summary:
pyani
downloads are blocked if a downloaded file cannot be uncompressed.Description:
Using
pyani download
sometimes recovers corrupt compressed files from NCBI. If these throw an error withgunzip
, the whole download halts.What should happen is that the error is noted, and
pyani
continues with the remaining downloads.Reproducible Steps:
Three attempts, same error:
Current Output:
Expected Output:
The equivalent of the below, for the downloaded genome:
pyani Version:
v0.3-alpha
Python Version:
3.9
Operating System:
macOS 12.2.1
The text was updated successfully, but these errors were encountered: