Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Example .txt file decoding issue #88

Closed
shihabsalah opened this issue Oct 22, 2023 · 7 comments
Closed

Example .txt file decoding issue #88

shihabsalah opened this issue Oct 22, 2023 · 7 comments

Comments

@shihabsalah
Copy link

shihabsalah commented Oct 22, 2023

UnicodeDecodeError when reading file in CP1252 encoding

Bug
When running the main.py script with the --archival_storage_files_compute_embeddings flag, I encountered a UnicodeDecodeError. The error message indicates that the 'charmap' codec can't decode byte 0x90 in position 422, which maps to an undefined character.

To Reproduce
Steps to reproduce the behavior:

  1. Run main.py with the following command:
    python main.py --archival_storage_files_compute_embeddings="memgpt/personas/examples/preload_archival/uber.txt" --persona=memgpt_doc --human=basic
  2. Answer 'y' when asked to compute embeddings.
  3. See error

Error

(memgpt) <project_path>\MemGPT>python main.py --archival_storage_files_compute_embeddings="<project_path>/memgpt/personas/examples/preload_archival/*.txt" --persona=memgpt_doc --human=basic
<project_path>\MemGPT\main.py:328: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()
Running... [exit by typing '/exit']
Computing embeddings over 1 files. This will cost ~$0.03. Continue? [y/n] y
Traceback (most recent call last):
  File "<project_path>\MemGPT\main.py", line 331, in <module>
    app.run(run)
  File "<conda_env_path>\lib\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "<conda_env_path>\lib\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "<project_path>\MemGPT\main.py", line 329, in run
    loop.run_until_complete(main())
  File "<conda_env_path>\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "<project_path>\MemGPT\main.py", line 142, in main
    faiss_save_dir = await utils.prepare_archival_index_from_files_compute_embeddings(FLAGS.archival_storage_files_compute_embeddings)
  File "<project_path>\MemGPT\memgpt\utils.py", line 234, in prepare_archival_index_from_files_compute_embeddings
    archival_database = chunk_files(files, tkns_per_chunk, model)
  File "<project_path>\MemGPT\memgpt\utils.py", line 177, in chunk_files
    chunks = [c for c in chunk_file(file, tkns_per_chunk, model)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 177, in <listcomp>
    chunks = [c for c in chunk_file(file, tkns_per_chunk, model)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 141, in chunk_file
    lines = [l for l in read_in_chunks(f, tkns_per_chunk*4)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 141, in <listcomp>
    lines = [l for l in read_in_chunks(f, tkns_per_chunk*4)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 98, in read_in_chunks
    data = file_object.read(chunk_size)
  File "<conda_env_path>\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 422: character maps to <undefined>

Environment:

  • OS: Windows 11
  • Python version: 3.10.9
  • Conda version: 23.7.4

Additional context
The file I'm trying to process is encoded in UTF-8, but it seems like the script is trying to read it in CP1252 encoding. I've tried deleting and reinstalling the Conda environment with a different Python version, but the issue persists.

Any help would be appreciated!

@shihabsalah
Copy link
Author

shihabsalah commented Oct 22, 2023

Update:

I managed to resolve the UnicodeDecodeError by specifying the encoding as 'utf-8' when opening the file, but I'm not sure if this is the best way to resolve this issue.

Here's the updated code snippet:

# Open the file with 'utf-8' encoding
# memgpt/utils.py/chunk_file
# Line 133
file = open("your_file.txt", "r", encoding="utf-8")

@shihabsalah shihabsalah changed the title Example .txt file decoding Example .txt file decoding issue Oct 22, 2023
@twobob
Copy link

twobob commented Oct 27, 2023

image
yup same issues.

Ideally the header

# -*- coding: utf-8 -*-

would be honoured.

NOTE:

Ugly hack is now at line 152, untested on PDF and CSV, YMMV.
with open(file, "r", encoding="utf-8") as f:

image

@twobob
Copy link

twobob commented Oct 27, 2023

import chardet

def detect_file_encoding(file_path):
    try:
        with open(file_path, 'rb') as f:
            result = chardet.detect(f.read())
            return result['encoding']
    except Exception as e:
        return f"Error: {str(e)}"

# Test it
print(detect_file_encoding("some_file.py"))
print(detect_file_encoding("some_file.csv"))

For PDF files, you would generally need to use a PDF parsing library like PyPDF2 to extract text content, and then you could try to guess the encoding, but it's a more complex process.

from PyPDF2 import PdfFileReader
import chardet

def detect_pdf_encoding(pdf_path):
    try:
        pdf = PdfFileReader(open(pdf_path, "rb"))
        text_content = ''
        
        # Extract text from each page
        for i in range(0, pdf.getNumPages()):
            text_content += pdf.getPage(i).extractText()
        
        # Convert text to bytes
        text_bytes = text_content.encode()
        
        # Detect encoding
        result = chardet.detect(text_bytes)
        return result['encoding']
    except Exception as e:
        return f"Error: {str(e)}"

Untested pseudo code but maybe something along those lines per type

@sarahwooders
Copy link
Collaborator

FYI we are starting to migrate towards Llama Index for data ingestion (#146), so we can maybe use their PDF connector ) so resolve this (assuming it doesn't have the same issue)

@twobob
Copy link

twobob commented Oct 27, 2023

quick eyeball says it might.
Id have to read around it
image

@twobob
Copy link

twobob commented Oct 27, 2023

honestly the encoding thing has been a thorn in the side of all the language processing projects I've worked with. Given the inherent complexity of the potential actual edges cases (see) https://github.com/karpathy/llama2.c/blob/d9862069e7ef665fe6309e3c17398ded2f121bf5/run.c#L487 is why the chardet option is so ubiquitous and the pattern I posted "a best option given all the other possible horrific options" ;)

Most likely a chardet pass or similar will end up in there somewhere. Upstream or down

@vivi
Copy link
Contributor

vivi commented Nov 3, 2023

Can we close this now that we've moved to Llama Index? @shihabsalah try the new CLI workflow for loading data here:

https://github.com/cpacker/MemGPT#loading-data

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants