Example .txt file decoding issue #88

shihabsalah · 2023-10-22T11:53:03Z

UnicodeDecodeError when reading file in CP1252 encoding

Bug
When running the main.py script with the --archival_storage_files_compute_embeddings flag, I encountered a UnicodeDecodeError. The error message indicates that the 'charmap' codec can't decode byte 0x90 in position 422, which maps to an undefined character.

To Reproduce
Steps to reproduce the behavior:

Run main.py with the following command:
python main.py --archival_storage_files_compute_embeddings="memgpt/personas/examples/preload_archival/uber.txt" --persona=memgpt_doc --human=basic
Answer 'y' when asked to compute embeddings.
See error

Error

(memgpt) <project_path>\MemGPT>python main.py --archival_storage_files_compute_embeddings="<project_path>/memgpt/personas/examples/preload_archival/*.txt" --persona=memgpt_doc --human=basic
<project_path>\MemGPT\main.py:328: DeprecationWarning: There is no current event loop
  loop = asyncio.get_event_loop()
Running... [exit by typing '/exit']
Computing embeddings over 1 files. This will cost ~$0.03. Continue? [y/n] y
Traceback (most recent call last):
  File "<project_path>\MemGPT\main.py", line 331, in <module>
    app.run(run)
  File "<conda_env_path>\lib\site-packages\absl\app.py", line 308, in run
    _run_main(main, args)
  File "<conda_env_path>\lib\site-packages\absl\app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "<project_path>\MemGPT\main.py", line 329, in run
    loop.run_until_complete(main())
  File "<conda_env_path>\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "<project_path>\MemGPT\main.py", line 142, in main
    faiss_save_dir = await utils.prepare_archival_index_from_files_compute_embeddings(FLAGS.archival_storage_files_compute_embeddings)
  File "<project_path>\MemGPT\memgpt\utils.py", line 234, in prepare_archival_index_from_files_compute_embeddings
    archival_database = chunk_files(files, tkns_per_chunk, model)
  File "<project_path>\MemGPT\memgpt\utils.py", line 177, in chunk_files
    chunks = [c for c in chunk_file(file, tkns_per_chunk, model)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 177, in <listcomp>
    chunks = [c for c in chunk_file(file, tkns_per_chunk, model)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 141, in chunk_file
    lines = [l for l in read_in_chunks(f, tkns_per_chunk*4)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 141, in <listcomp>
    lines = [l for l in read_in_chunks(f, tkns_per_chunk*4)]
  File "<project_path>\MemGPT\memgpt\utils.py", line 98, in read_in_chunks
    data = file_object.read(chunk_size)
  File "<conda_env_path>\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 422: character maps to <undefined>

Environment:

OS: Windows 11
Python version: 3.10.9
Conda version: 23.7.4

Additional context
The file I'm trying to process is encoded in UTF-8, but it seems like the script is trying to read it in CP1252 encoding. I've tried deleting and reinstalling the Conda environment with a different Python version, but the issue persists.

Any help would be appreciated!

The text was updated successfully, but these errors were encountered:

shihabsalah · 2023-10-22T12:23:37Z

Update:

I managed to resolve the UnicodeDecodeError by specifying the encoding as 'utf-8' when opening the file, but I'm not sure if this is the best way to resolve this issue.

Here's the updated code snippet:

# Open the file with 'utf-8' encoding
# memgpt/utils.py/chunk_file
# Line 133
file = open("your_file.txt", "r", encoding="utf-8")

twobob · 2023-10-27T12:59:09Z

yup same issues.

Ideally the header

# -*- coding: utf-8 -*-

would be honoured.

NOTE:

Ugly hack is now at line 152, untested on PDF and CSV, YMMV.
with open(file, "r", encoding="utf-8") as f:

twobob · 2023-10-27T13:15:15Z

import chardet

def detect_file_encoding(file_path):
    try:
        with open(file_path, 'rb') as f:
            result = chardet.detect(f.read())
            return result['encoding']
    except Exception as e:
        return f"Error: {str(e)}"

# Test it
print(detect_file_encoding("some_file.py"))
print(detect_file_encoding("some_file.csv"))

For PDF files, you would generally need to use a PDF parsing library like PyPDF2 to extract text content, and then you could try to guess the encoding, but it's a more complex process.

from PyPDF2 import PdfFileReader
import chardet

def detect_pdf_encoding(pdf_path):
    try:
        pdf = PdfFileReader(open(pdf_path, "rb"))
        text_content = ''
        
        # Extract text from each page
        for i in range(0, pdf.getNumPages()):
            text_content += pdf.getPage(i).extractText()
        
        # Convert text to bytes
        text_bytes = text_content.encode()
        
        # Detect encoding
        result = chardet.detect(text_bytes)
        return result['encoding']
    except Exception as e:
        return f"Error: {str(e)}"

Untested pseudo code but maybe something along those lines per type

sarahwooders · 2023-10-27T15:51:05Z

FYI we are starting to migrate towards Llama Index for data ingestion (#146), so we can maybe use their PDF connector ) so resolve this (assuming it doesn't have the same issue)

twobob · 2023-10-27T18:42:32Z

quick eyeball says it might.
Id have to read around it

twobob · 2023-10-27T18:51:51Z

honestly the encoding thing has been a thorn in the side of all the language processing projects I've worked with. Given the inherent complexity of the potential actual edges cases (see) https://github.com/karpathy/llama2.c/blob/d9862069e7ef665fe6309e3c17398ded2f121bf5/run.c#L487 is why the chardet option is so ubiquitous and the pattern I posted "a best option given all the other possible horrific options" ;)

Most likely a chardet pass or similar will end up in there somewhere. Upstream or down

vivi · 2023-11-03T07:31:54Z

Can we close this now that we've moved to Llama Index? @shihabsalah try the new CLI workflow for loading data here:

https://github.com/cpacker/MemGPT#loading-data

shihabsalah changed the title ~~Example .txt file decoding~~ Example .txt file decoding issue Oct 22, 2023

twobob mentioned this issue Oct 31, 2023

option to perform OCR on pdfs that need it to work with the database BBC-Esq/VectorDB-Plugin#21

Closed

shihabsalah closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example .txt file decoding issue #88

Example .txt file decoding issue #88

shihabsalah commented Oct 22, 2023 •

edited

Loading

shihabsalah commented Oct 22, 2023 •

edited

Loading

twobob commented Oct 27, 2023 •

edited

Loading

twobob commented Oct 27, 2023

sarahwooders commented Oct 27, 2023

twobob commented Oct 27, 2023

twobob commented Oct 27, 2023

vivi commented Nov 3, 2023

Example .txt file decoding issue #88

Example .txt file decoding issue #88

Comments

shihabsalah commented Oct 22, 2023 • edited Loading

UnicodeDecodeError when reading file in CP1252 encoding

shihabsalah commented Oct 22, 2023 • edited Loading

Update:

twobob commented Oct 27, 2023 • edited Loading

twobob commented Oct 27, 2023

sarahwooders commented Oct 27, 2023

twobob commented Oct 27, 2023

twobob commented Oct 27, 2023

vivi commented Nov 3, 2023

shihabsalah commented Oct 22, 2023 •

edited

Loading

shihabsalah commented Oct 22, 2023 •

edited

Loading

twobob commented Oct 27, 2023 •

edited

Loading