diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 41670d73..6cc8ce0c 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -110,6 +110,8 @@ title: Smol Multimodal RAG, Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU - local: fine_tuning_vlm_dpo_smolvlm_instruct title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU + - local: structured_generation_vision_language_models + title: Structured Generation from Images or Documents Using Vision Language Models - title: Search Recipes isExpanded: false diff --git a/notebooks/en/index.md b/notebooks/en/index.md index c5f9a624..96a206b5 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -8,11 +8,10 @@ applications and solving various machine learning tasks using open-source tools Check out the recently added notebooks: - [Post-training an LLM using GRPO with TRL](fine_tuning_llm_grpo_trl) +- [Structured Generation from Images or Documents Using Vision Language Models](structured_generation_vision_language_models) - [Vector Search on Hugging Face with the Hub as Backend](vector_search_with_hub_as_backend) - [Multi-Agent Order Management System with MongoDB](mongodb_smolagents_multi_micro_agents) - [Scaling Test-Time Compute for Longer Thinking in LLMs](search_and_learn) -- [Signature-Aware Model Serving from MLflow with Ray Serve](mlflow_ray_serve) - You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). diff --git a/notebooks/en/structured_generation_vision_language_models.ipynb b/notebooks/en/structured_generation_vision_language_models.ipynb new file mode 100644 index 00000000..1548ba9a --- /dev/null +++ b/notebooks/en/structured_generation_vision_language_models.ipynb @@ -0,0 +1,402 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Structured Generation from Images or Documents Using Vision Language Models\n", + "\n", + "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents. We will run the VLM using the Hugging Face Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n", + "\n", + "> This approach is based on a [Outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/).\n", + "\n", + "## Dependencies and imports\n", + "\n", + "First, let's install the necessary libraries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's continue with importing the necessary libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import outlines\n", + "import torch\n", + "\n", + "from datasets import load_dataset\n", + "from outlines.models.transformers_vision import transformers_vision\n", + "from transformers import AutoModelForImageTextToText, AutoProcessor\n", + "from pydantic import BaseModel" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialising our model\n", + "\n", + "We will start by initialising our model from [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct). Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the [Hub repo files](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/tree/main), and import those classes directly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n", + "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n" + ] + } + ], + "source": [ + "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n", + "\n", + "\n", + "def get_model_and_processor_class(model_name: str):\n", + " model = AutoModelForImageTextToText.from_pretrained(model_name)\n", + " processor = AutoProcessor.from_pretrained(model_name)\n", + " classes = model.__class__, processor.__class__\n", + " del model, processor\n", + " return classes\n", + "\n", + "\n", + "model_class, processor_class = get_model_and_processor_class(model_name)\n", + "\n", + "if torch.cuda.is_available():\n", + " device = \"cuda\"\n", + "elif torch.backends.mps.is_available():\n", + " device = \"mps\"\n", + "else:\n", + " device = \"cpu\"\n", + "\n", + "model = transformers_vision(\n", + " model_name,\n", + " model_class=model_class,\n", + " device=device,\n", + " model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n", + " processor_kwargs={\"device\": device},\n", + " processor_class=processor_class,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Structured Generation\n", + "\n", + "Now, we are going to define a function that will define how the output of our model will be structured. We will be using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image. " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "class ImageData(BaseModel):\n", + " quality: str\n", + " description: str\n", + " question: str\n", + "\n", + "structured_generator = outlines.generate.json(model, ImageData)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's come up with an extraction prompt." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "prompt = \"\"\"\n", + "You are an image analysis assisant.\n", + "\n", + "Provide a quality tag, a description and a question.\n", + "\n", + "The quality can either be \"good\", \"okay\" or \"bad\".\n", + "The question should be concise and objective.\n", + "\n", + "Return your response as a valid JSON object.\n", + "\"\"\".strip()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's load our image dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path'],\n", + " num_rows: 10\n", + "})" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset = load_dataset(\"openbmb/RLAIF-V-Dataset\", split=\"train[:10]\")\n", + "dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's define a function that will extract the structured information from the image. We will format the prompt using the `apply_chat_template` method and pass it to the model along with the image after that." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot locate reference to .\n", + " StockPickler.save(self, obj, save_persistent_id)\n", + "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot pickle : __main__.ImageData has recursive self-references that trigger a RecursionError.\n", + " StockPickler.save(self, obj, save_persistent_id)\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e1d431b922334b0297195415a11cf68a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Map: 0%| | 0/10 [00:00" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We've seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like `pdf2image` to convert the document to images and running information extraction on each image pdf of the page.\n", + "\n", + "```python\n", + "pdf_path = \"path/to/your/pdf/file.pdf\"\n", + "pages = convert_from_path(pdf_path)\n", + "for page in pages:\n", + " extract_objects = extract_objects(page, prompt)\n", + "```\n", + "\n", + "## Next Steps\n", + "\n", + "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n", + "- Explore extraction on your own usecase with your own model.\n", + "- Use a different method of extracting structured information from documents." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}