Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

load_train_test(): UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #20

Closed
pbenner opened this issue Apr 30, 2023 · 1 comment · Fixed by #21
Labels
bug Something isn't working

Comments

@pbenner
Copy link
Collaborator

pbenner commented Apr 30, 2023

Running the following script fails:

>>> from matbench_discovery.data import load_train_test
>>> load_train_test('mp_computed_structure_entries')
Downloading 'mp_computed_structure_entries' from https://figshare.com/ndownloader/files/40344436
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pbenner/Source/tmp/matbench-discovery/matbench_discovery/data.py", line 95, in load_train_test
    df = reader(url)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 733, in read_json
    json_reader = JsonReader(
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 819, in __init__
    self.data = self._preprocess_data(data)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/pandas/io/json/_json.py", line 831, in _preprocess_data
    data = data.read()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The only files for which the file download works are 'wbm_summary' and 'mp_energies'.

@janosh janosh added the bug Something isn't working label Apr 30, 2023
@janosh
Copy link
Owner

janosh commented Apr 30, 2023

Ah, that error is due to pandas being unable to infer the file is compressed JSON since we're only passing it a Figshare URL https://figshare.com/ndownloader/files/40344436. Should be easy to fix.

@janosh janosh closed this as completed in 5d7c620 Apr 30, 2023
janosh added a commit that referenced this issue Jun 20, 2023
* fix load_train_test() for compressed figshare data (closes #20)

* load_train_test() only accept answer 'y' or 'n' (as orig intended) (close #17)

* add test covering load_train_test() with compressed JSON file from URL

* mv run-scripts.yml test-scripts.yml

* add slow-tests.yml for running slow tests only on PR merges (to save CI budget)
janosh added a commit that referenced this issue Jun 20, 2023
* fix load_train_test() for compressed figshare data (closes #20)

* load_train_test() only accept answer 'y' or 'n' (as orig intended) (close #17)

* add test covering load_train_test() with compressed JSON file from URL

* mv run-scripts.yml test-scripts.yml

* add slow-tests.yml for running slow tests only on PR merges (to save CI budget)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants