-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Example .txt file decoding issue #88
Comments
Update:I managed to resolve the Here's the updated code snippet: # Open the file with 'utf-8' encoding
# memgpt/utils.py/chunk_file
# Line 133
file = open("your_file.txt", "r", encoding="utf-8") |
For PDF files, you would generally need to use a PDF parsing library like PyPDF2 to extract text content, and then you could try to guess the encoding, but it's a more complex process.
Untested pseudo code but maybe something along those lines per type |
FYI we are starting to migrate towards Llama Index for data ingestion (#146), so we can maybe use their PDF connector ) so resolve this (assuming it doesn't have the same issue) |
honestly the encoding thing has been a thorn in the side of all the language processing projects I've worked with. Given the inherent complexity of the potential actual edges cases (see) https://github.com/karpathy/llama2.c/blob/d9862069e7ef665fe6309e3c17398ded2f121bf5/run.c#L487 is why the chardet option is so ubiquitous and the pattern I posted "a best option given all the other possible horrific options" ;) Most likely a chardet pass or similar will end up in there somewhere. Upstream or down |
Can we close this now that we've moved to Llama Index? @shihabsalah try the new CLI workflow for loading data here: |
UnicodeDecodeError when reading file in CP1252 encoding
Bug
When running the
main.py
script with the--archival_storage_files_compute_embeddings
flag, I encountered aUnicodeDecodeError
. The error message indicates that the 'charmap' codec can't decode byte 0x90 in position 422, which maps to an undefined character.To Reproduce
Steps to reproduce the behavior:
main.py
with the following command:python main.py --archival_storage_files_compute_embeddings="memgpt/personas/examples/preload_archival/uber.txt" --persona=memgpt_doc --human=basic
Error
Environment:
Additional context
The file I'm trying to process is encoded in UTF-8, but it seems like the script is trying to read it in CP1252 encoding. I've tried deleting and reinstalling the Conda environment with a different Python version, but the issue persists.
Any help would be appreciated!
The text was updated successfully, but these errors were encountered: