Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Not able to pre-tokenize the input #15

Open
makeshn opened this issue May 13, 2020 · 1 comment
Open

Not able to pre-tokenize the input #15

makeshn opened this issue May 13, 2020 · 1 comment

Comments

@makeshn
Copy link

makeshn commented May 13, 2020

After downloading OpenWebText corpus, I extract it using the tar xvf openwebtext.tar.gz command. When I try running the python -m lmtuners.utils.tokenize_and_cache_data data/ data_tokenized_128/ --tokenizer_path bert-base-uncased-vocab.txt --max_length=64 command, I get an error saying

skipping urlsf_subset16-730_data.xz
0 tokens, 0 examples: 12% 2423/20610 [00:01<00:09, 1893.57it/s]'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

for every file. Could you please help me overcome this issue? @shoarora

@shoarora
Copy link
Owner

@makeshn I’m not able to try to reproduce this right now, but I believe the openwebtext corpus was compressed twice. I see you’re running the script over *.xz files. I think you should be able to decompress each *.xz file into a bunch of .txt files

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants