Not able to pre-tokenize the input #15

makeshn · 2020-05-13T00:12:15Z

After downloading OpenWebText corpus, I extract it using the tar xvf openwebtext.tar.gz command. When I try running the python -m lmtuners.utils.tokenize_and_cache_data data/ data_tokenized_128/ --tokenizer_path bert-base-uncased-vocab.txt --max_length=64 command, I get an error saying

skipping urlsf_subset16-730_data.xz
0 tokens, 0 examples: 12% 2423/20610 [00:01<00:09, 1893.57it/s]'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

for every file. Could you please help me overcome this issue? @shoarora

The text was updated successfully, but these errors were encountered:

shoarora · 2020-05-14T02:54:15Z

@makeshn I’m not able to try to reproduce this right now, but I believe the openwebtext corpus was compressed twice. I see you’re running the script over *.xz files. I think you should be able to decompress each *.xz file into a bunch of .txt files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to pre-tokenize the input #15

Not able to pre-tokenize the input #15

makeshn commented May 13, 2020 •

edited

Loading

shoarora commented May 14, 2020

Not able to pre-tokenize the input #15

Not able to pre-tokenize the input #15

Comments

makeshn commented May 13, 2020 • edited Loading

shoarora commented May 14, 2020

makeshn commented May 13, 2020 •

edited

Loading