Note: Start with the Tutorial Notebooks in the Tutorials folder here.
Multimodal Document Analysis with RAG and Code Execution: using Text, Images and Data Tables with GPT4-V, TaskWeaver, and Assistants API:
- The work focuses on processing multi-modal analytical documents by extracting text, images, and data tables to maximize data representation and information extraction, utilizing formats like Python code, Markdown, and Mermaid script for compatibility with GPT-4 models.
- Text is programmatically extracted from documents, processed to improve structure and tag extraction for better searchability, and numerical data is captured through generated Python code for later use.
- Images and data tables are processed to generate multiple text-based representations (including detailed text descriptions, Mermaid, and Python code for images, and various formats for tables) to ensure information is searchable and usable for calculations, forecasts, and applying machine learning models using Code Interpreter capabilities.
- As of today with conventional techniques, to be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database
- This process now is purely concerned with text:
- If the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text
- Retrieving unstructured table data through RAG will lead to very low accuracy answers
- LLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes
- Ingest and interact with multi-modal analytics documents with lots of graphs, numbers and tables
- Extract structured information from some elements in documents which wasn’t possible before:
- Images
- Graphs
- Tables
- Use the Code Interpreter to formulate answers where calculations are needed based on search results
- Analyze Investment opportunity documents for Private Equity deals
- Analyze tables from tax documents for audit purposes
- Analyze financial statements and perform initial computations
- Analyze and interact with multi-modal Manufacturing documents
- Process academic and research papers
- Ingest and interact with textbooks, manuals and guides
- Analyze traffic and city planning documents
- GPT-4-Turbo is a great help with its large 128k token window
- GPT-4-Turbo with Vision is great at extracting tables from unstructured document formats
- GPT-4 models can understand a wide variety of formats (Python, Markdown, Mermaid, GraphViz DOT, etc..) which was essential in maximizing information extraction
- A new approach to vector index searching based on tags was needed because the Generation Prompts were very lengthy compared to the usual user queries
- Taskweaver’s and Assistants API’s Code Interpreters were introduced to conduct open-ended analytics questions
Please check our Enterprise Deployment guide for how to deploy this in a secure manner to a client's tenant. For local development or testing the solution, please use the tutorial notebooks or the Chainlit app described below.
Please start with the Tutorial notebooks here. These notebooks illustrate a series of concepts that have been used in this repo.
To run the web app locally, please execute in your conda environment the following:
# cd into the app folder
cd app
# run the chainlit app
chainlit run test-app.py
- Configure properly your
.env
file. Refer to the.env.sample
file included in this solution. - Use
cmd index
to set the index name and the ingestion directory. - Use
cmd upload
to upload the documents you need ingestion. As of today, this solution works ONLY with PDF files. - If the document(s) is/are large, then you can try multi-threading, by using
cmd threads
. This will use multiple Azure OpenAI resources in multiple regions to speed up the ingestion ofthe document(s). - Use
cmd ingest
to start the ingestion process. Please wait until the process is complete and confirmation that the document has been ingested is printed. - Try different settings. For example, if this is a clean digital PDF (e.g. MS Word document saved as PDF), then for
text_processing
andimage_detection
, it is ok to leave their values asPDF
. However, if this is a PDF of a Powerpoint presentation with lots of vector graphics in it, it's recommended that both of these settings are set toGPT
, along with settingOCR
toTrue
. - Then type any query in the input field which will search the field. Choose your Code Interpreter, either
Taskweaver
orAssistantsAPI
.
Code Interpreters Available in this Solution:
- Assistants API: It is the default code interpreter. OpenAI AssistantsAPI is supported for now. The Azure version will soon follow when it's released.
- Taskweaver: is optional to install and use, and is fully supported
TaskWeaver requires Python >= 3.10. It can be installed by running the following command from the project root folder. Please follow the below commands very carefully and start by creating a new conda environment:
# create the conda environment
conda create -n mmdoc python=3.10
# activate the conda environment
conda activate mmdoc
# install the project requirements
pip install -r requirements.txt
# clone the repository
git clone https://github.com/microsoft/TaskWeaver.git
# cd into Taskweaver
cd TaskWeaver
# install the Taskweaver requirements
pip install -r requirements.txt
# copy the Taskweaver project directory into the root folder and name it 'test_project'
cp -r project ../test_project/
Note: Inside the
test_project
directory, there's a file calledtaskweaver_config.json
which needs to be populated. Please refer to thetaskweaver_config.sample.json
file in the root folder of this repo, fill in the Azure OpenAI model values for GPT-4-Turbo, rename it totaskweaver_config.json
, and then copy it insidetest_project
(or overwrite existing).
Note: Similiarly, there are a number of test notebooks in this solution that use Autogen. If the user wants to experiment with Autogen, then in this case, the file
OAI_CONFIG_LIST
in thecode
folder needs to be configured. Please refer toOAI_CONFIG_LIST.sample
, populate it with the right values, and then rename it toOAI_CONFIG_LIST
.