From 70c7f74cc8fac8bd39a5bee7dee0ac2619109da8 Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Fri, 24 Jan 2025 12:41:54 +0100
Subject: [PATCH 1/7] Enhance documentation and add new notebook for structured
 generation using Vision Language Models

- Updated the table of contents to include a new section for "Structured Generation from Documents Using Vision Language Models".
- Added a new Jupyter notebook that demonstrates how to extract structured information from documents using the SmolVLM-500M-Instruct model, including installation instructions, model initialization, and example usage.
---
 notebooks/en/_toctree.yml                     |   4 +-
 ...eneration-vision-languag-models-copy.ipynb | 244 ++++++++++++++++++
 2 files changed, 247 insertions(+), 1 deletion(-)
 create mode 100644 notebooks/en/structured-generation-vision-languag-models-copy.ipynb

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 9ea0410f..33181565 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -72,7 +72,7 @@
           title: Phoenix Observability Dashboard on HF Spaces
         - local: search_and_learn
           title: Scaling Test-Time Compute for Longer Thinking in LLMs
-          
+
     - title: Computer Vision Recipes
       isExpanded: false
       sections:
@@ -108,6 +108,8 @@
           title: Smol Multimodal RAG, Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU
         - local: fine_tuning_vlm_dpo_smolvlm_instruct
           title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
+        - local: structured_generation_vision_languag_models
+          title: Structured Generation from Documents Using Vision Language Models
 
     - title: Search Recipes
       isExpanded: false
diff --git a/notebooks/en/structured-generation-vision-languag-models-copy.ipynb b/notebooks/en/structured-generation-vision-languag-models-copy.ipynb
new file mode 100644
index 00000000..2188e86b
--- /dev/null
+++ b/notebooks/en/structured-generation-vision-languag-models-copy.ipynb
@@ -0,0 +1,244 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Structured Generation from Documents Using Vision Language Models\n",
+    "\n",
+    "We will be using the SmolVLM-500M-Instruct model from HuggingFaceTB to extract structured information from documents. We will do so using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. We will also use the Gradio library to create a simple UI for uploading and extracting structured information from documents.\n",
+    "\n",
+    "## Dependencies and imports\n",
+    "\n",
+    "First, let's install the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install outlines transformers torch flash-attn outlines datasets sentencepiece gradio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's continue with importing the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import outlines\n",
+    "import torch\n",
+    "\n",
+    "from io import BytesIO\n",
+    "from urllib.request import urlopen\n",
+    "from PIL import Image\n",
+    "from outlines.models.transformers_vision import transformers_vision\n",
+    "from transformers import AutoModelForImageTextToText, AutoProcessor\n",
+    "from pydantic import BaseModel, Field\n",
+    "from typing import List\n",
+    "from enum import StrEnum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialising our model\n",
+    "\n",
+    "We will start by initialising our model from [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct). Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the [Hub repo files](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/tree/main), and import those classes directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"  # original magnet model is able to be loaded without issue\n",
+    "\n",
+    "\n",
+    "def get_model_and_processor_class(model_name: str):\n",
+    "    model = AutoModelForImageTextToText.from_pretrained(model_name)\n",
+    "    processor = AutoProcessor.from_pretrained(model_name)\n",
+    "    classes = model.__class__, processor.__class__\n",
+    "    del model, processor\n",
+    "    return classes\n",
+    "\n",
+    "\n",
+    "model_class, processor_class = get_model_and_processor_class(model_name)\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    device = \"cuda\"\n",
+    "elif torch.backends.mps.is_available():\n",
+    "    device = \"mps\"\n",
+    "else:\n",
+    "    device = \"cpu\"\n",
+    "\n",
+    "model = transformers_vision(\n",
+    "    model_name,\n",
+    "    model_class=model_class,\n",
+    "    device=device,\n",
+    "    model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n",
+    "    processor_kwargs={\"device\": device},\n",
+    "    processor_class=processor_class,\n",
+    ")\n",
+    "model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we are going to define a function that will define how the output of our model will be structured. We will want to extract Tags for object in the image along with a string and a confidence score."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class TagType(StrEnum):\n",
+    "    ENTITY = \"Entity\"\n",
+    "    RELATIONSHIP = \"Relationship\"\n",
+    "    STYLE = \"Style\"\n",
+    "    ATTRIBUTE = \"Attribute\"\n",
+    "    COMPOSITION = \"Composition\"\n",
+    "    CONTEXTUAL = \"Contextual\"\n",
+    "    TECHNICAL = \"Technical\"\n",
+    "    SEMANTIC = \"Semantic\"\n",
+    "\n",
+    "class ImageTag(BaseModel):\n",
+    "    tag_name: str\n",
+    "    tag_description: str\n",
+    "    tag_type: TagType\n",
+    "    confidence_score: float\n",
+    "\n",
+    "\n",
+    "class ImageData(BaseModel):\n",
+    "    tags_list: List[ImageTag] = Field(min_items=1)\n",
+    "    short_caption: str\n",
+    "\n",
+    "\n",
+    "image_objects_generator = outlines.generate.json(model, ImageData)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's come up with an extraction prompt. We will want to extract Tags for object in the image along with a string and a confidence score and provide some guidance to the model about the different tags and structrue."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = \"\"\"\n",
+    "You are a structured image analysis assitant. Generate comprehensive tag list for an image classification system. Use at least 1 tag per type. Return the results as a valid JSON object.\n",
+    "\"\"\".strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 95,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ImageData(tags_list=[ImageTag(tag_name='spacecraft', tag_description='You are an EVA astronaut standing on the moon', tag_type=<TagType.STYLE: 'Style'>, confidence_score=0.9471130702150571), ImageTag(tag_name='tire track', tag_description='You think tike this used to lead your way here', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=1.0), ImageTag(tag_name='space helmet', tag_description='Ozone spacesuit with white metal visor', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9737292349276361), ImageTag(tag_name='space suit', tag_description='White Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9749979480665247), ImageTag(tag_name='astronaut', tag_description='Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.8412833526756263)], short_caption=\"An astronaut from space sits on the lunar surface at around 200 feet below him, over a tan lunar ground with bays leading to his original path and some rocks oncrete having a shiny armor. Both left and right have a sphere that is used for eyes and protection. Left is wearing a baseball with playing field across, and other articles, the heavy one having a shiny metal visor drum on top. The astronaut's grin can be seen over the helmet as he comes out with his right arm out of the sat gadget and leaves it as leaving the shining metal bars as he is from the center of the image.\")"
+      ]
+     },
+     "execution_count": 95,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def img_from_url(url):\n",
+    "    img_byte_stream = BytesIO(urlopen(url).read())\n",
+    "    return Image.open(img_byte_stream).convert(\"RGB\")\n",
+    "\n",
+    "\n",
+    "image_url = (\n",
+    "    \"https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg\"\n",
+    ")\n",
+    "image = img_from_url(image_url)\n",
+    "\n",
+    "\n",
+    "def extract_objects(image, prompt):\n",
+    "    messages = [\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [{\"type\": \"image\"}, {\"type\": \"text\", \"text\": prompt}],\n",
+    "        },\n",
+    "    ]\n",
+    "\n",
+    "    formatted_prompt = model.processor.apply_chat_template(\n",
+    "        messages, add_generation_prompt=True\n",
+    "    )\n",
+    "\n",
+    "    result = image_objects_generator(formatted_prompt, [image])\n",
+    "    return result\n",
+    "\n",
+    "\n",
+    "extract_objects(image, prompt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We've seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like `pdf2image` to convert the document to images and running information extraction on each image pdf of the page.\n",
+    "\n",
+    "```python\n",
+    "pdf_path = \"path/to/your/pdf/file.pdf\"\n",
+    "pages = convert_from_path(pdf_path)\n",
+    "for page in pages:\n",
+    "    extract_objects = extract_objects(page, prompt)\n",
+    "```\n",
+    "\n",
+    "## Next Steps\n",
+    "\n",
+    "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n",
+    "- Explore extraction on your own usecase.\n",
+    "- Use a different method of extracting structured information from documents."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

From 8d35eddc7f84a5606f16088984581a1f1c32f07f Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Fri, 24 Jan 2025 12:42:06 +0100
Subject: [PATCH 2/7] Add new Jupyter notebook for structured generation using
 Vision Language Models

- Introduced a new notebook demonstrating the extraction of structured information from documents using the SmolVLM-500M-Instruct model.
- Included installation instructions, model initialization, and example usage with a focus on generating structured tags and confidence scores from images.
- Added detailed markdown explanations for each step of the process.
---
 ....ipynb => structured_generation_vision_languag_models.ipynb} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename notebooks/en/{structured-generation-vision-languag-models-copy.ipynb => structured_generation_vision_languag_models.ipynb} (98%)

diff --git a/notebooks/en/structured-generation-vision-languag-models-copy.ipynb b/notebooks/en/structured_generation_vision_languag_models.ipynb
similarity index 98%
rename from notebooks/en/structured-generation-vision-languag-models-copy.ipynb
rename to notebooks/en/structured_generation_vision_languag_models.ipynb
index 2188e86b..bd4a0589 100644
--- a/notebooks/en/structured-generation-vision-languag-models-copy.ipynb
+++ b/notebooks/en/structured_generation_vision_languag_models.ipynb
@@ -19,7 +19,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install outlines transformers torch flash-attn outlines datasets sentencepiece gradio"
+    "%pip install accelerate outlines transformers torch flash-attn outlines datasets sentencepiece gradio"
    ]
   },
   {

From b2402e2abbba377deac93360756720aa409c3327 Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Fri, 24 Jan 2025 12:47:05 +0100
Subject: [PATCH 3/7] Refactor Jupyter notebook for structured generation using
 Vision Language Models

- Updated the description to clarify the use of the SmolVLM-Instruct model and its integration with the HuggingFace Transformers and Outlines libraries.
- Added a reference to an outlines tutorial for better guidance.
- Modified the installation command to remove the Gradio library, streamlining the dependencies.
---
 .../en/structured_generation_vision_languag_models.ipynb     | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/structured_generation_vision_languag_models.ipynb b/notebooks/en/structured_generation_vision_languag_models.ipynb
index bd4a0589..3e3b331d 100644
--- a/notebooks/en/structured_generation_vision_languag_models.ipynb
+++ b/notebooks/en/structured_generation_vision_languag_models.ipynb
@@ -6,7 +6,8 @@
    "source": [
     "# Structured Generation from Documents Using Vision Language Models\n",
     "\n",
-    "We will be using the SmolVLM-500M-Instruct model from HuggingFaceTB to extract structured information from documents. We will do so using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. We will also use the Gradio library to create a simple UI for uploading and extracting structured information from documents.\n",
+    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. \n",
+    "This approach is based on a [outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/) library.\n",
     "\n",
     "## Dependencies and imports\n",
     "\n",
@@ -19,7 +20,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install accelerate outlines transformers torch flash-attn outlines datasets sentencepiece gradio"
+    "%pip install accelerate outlines transformers torch flash-attn outlines datasets sentencepiece"
    ]
   },
   {

From 1a4a376eeb1286b82e8b0bab9c987dc35a36f861 Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Wed, 29 Jan 2025 09:58:40 +0100
Subject: [PATCH 4/7] Enhance structured generation notebook with dataset
 generation and synthetic data extraction

- Updated notebook to use the RLAIF-V-Dataset for structured information extraction
- Implemented a function to generate synthetic questions, descriptions, and quality tags for images
- Added code to push the augmented dataset to the Hugging Face Hub
- Simplified the notebook's imports and removed unused code
- Updated markdown sections to provide clearer context and explanation
---
 ...red_generation_vision_languag_models.ipynb | 263 ++++++++++++++----
 1 file changed, 205 insertions(+), 58 deletions(-)

diff --git a/notebooks/en/structured_generation_vision_languag_models.ipynb b/notebooks/en/structured_generation_vision_languag_models.ipynb
index 3e3b331d..18a95213 100644
--- a/notebooks/en/structured_generation_vision_languag_models.ipynb
+++ b/notebooks/en/structured_generation_vision_languag_models.ipynb
@@ -6,8 +6,9 @@
    "source": [
     "# Structured Generation from Documents Using Vision Language Models\n",
     "\n",
-    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. \n",
-    "This approach is based on a [outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/) library.\n",
+    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n",
+    "\n",
+    "> This approach is based on a [Outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/).\n",
     "\n",
     "## Dependencies and imports\n",
     "\n",
@@ -20,7 +21,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install accelerate outlines transformers torch flash-attn outlines datasets sentencepiece"
+    "%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece"
    ]
   },
   {
@@ -39,14 +40,10 @@
     "import outlines\n",
     "import torch\n",
     "\n",
-    "from io import BytesIO\n",
-    "from urllib.request import urlopen\n",
-    "from PIL import Image\n",
+    "from datasets import load_dataset\n",
     "from outlines.models.transformers_vision import transformers_vision\n",
     "from transformers import AutoModelForImageTextToText, AutoProcessor\n",
-    "from pydantic import BaseModel, Field\n",
-    "from typing import List\n",
-    "from enum import StrEnum"
+    "from pydantic import BaseModel"
    ]
   },
   {
@@ -62,9 +59,18 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n",
+      "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n"
+     ]
+    }
+   ],
    "source": [
-    "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"  # original magnet model is able to be loaded without issue\n",
+    "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n",
     "\n",
     "\n",
     "def get_model_and_processor_class(model_name: str):\n",
@@ -91,95 +97,130 @@
     "    model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n",
     "    processor_kwargs={\"device\": device},\n",
     "    processor_class=processor_class,\n",
-    ")\n",
-    "model"
+    ")"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we are going to define a function that will define how the output of our model will be structured. We will want to extract Tags for object in the image along with a string and a confidence score."
+    "## Structured Generation\n",
+    "\n",
+    "Now, we are going to define a function that will define how the output of our model will be structured. We will be using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 93,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
-    "class TagType(StrEnum):\n",
-    "    ENTITY = \"Entity\"\n",
-    "    RELATIONSHIP = \"Relationship\"\n",
-    "    STYLE = \"Style\"\n",
-    "    ATTRIBUTE = \"Attribute\"\n",
-    "    COMPOSITION = \"Composition\"\n",
-    "    CONTEXTUAL = \"Contextual\"\n",
-    "    TECHNICAL = \"Technical\"\n",
-    "    SEMANTIC = \"Semantic\"\n",
-    "\n",
-    "class ImageTag(BaseModel):\n",
-    "    tag_name: str\n",
-    "    tag_description: str\n",
-    "    tag_type: TagType\n",
-    "    confidence_score: float\n",
-    "\n",
-    "\n",
     "class ImageData(BaseModel):\n",
-    "    tags_list: List[ImageTag] = Field(min_items=1)\n",
-    "    short_caption: str\n",
-    "\n",
+    "    quality: str\n",
+    "    description: str\n",
+    "    question: str\n",
     "\n",
-    "image_objects_generator = outlines.generate.json(model, ImageData)"
+    "structured_generator = outlines.generate.json(model, ImageData)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, let's come up with an extraction prompt. We will want to extract Tags for object in the image along with a string and a confidence score and provide some guidance to the model about the different tags and structrue."
+    "Now, let's come up with an extraction prompt."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 96,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
     "prompt = \"\"\"\n",
-    "You are a structured image analysis assitant. Generate comprehensive tag list for an image classification system. Use at least 1 tag per type. Return the results as a valid JSON object.\n",
+    "You are an image analysis assisant.\n",
+    "\n",
+    "Provide a quality tag, a description and a question.\n",
+    "\n",
+    "The quality can either be \"good\", \"okay\" or \"bad\".\n",
+    "The question should be concise and objective.\n",
+    "\n",
+    "Return your response as a valid JSON object.\n",
     "\"\"\".strip()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's load our image dataset."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 95,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "ImageData(tags_list=[ImageTag(tag_name='spacecraft', tag_description='You are an EVA astronaut standing on the moon', tag_type=<TagType.STYLE: 'Style'>, confidence_score=0.9471130702150571), ImageTag(tag_name='tire track', tag_description='You think tike this used to lead your way here', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=1.0), ImageTag(tag_name='space helmet', tag_description='Ozone spacesuit with white metal visor', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9737292349276361), ImageTag(tag_name='space suit', tag_description='White Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9749979480665247), ImageTag(tag_name='astronaut', tag_description='Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.8412833526756263)], short_caption=\"An astronaut from space sits on the lunar surface at around 200 feet below him, over a tan lunar ground with bays leading to his original path and some rocks oncrete having a shiny armor. Both left and right have a sphere that is used for eyes and protection. Left is wearing a baseball with playing field across, and other articles, the heavy one having a shiny metal visor drum on top. The astronaut's grin can be seen over the helmet as he comes out with his right arm out of the sat gadget and leaves it as leaving the shining metal bars as he is from the center of the image.\")"
+       "Dataset({\n",
+       "    features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path'],\n",
+       "    num_rows: 10\n",
+       "})"
       ]
      },
-     "execution_count": 95,
+     "execution_count": 15,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "def img_from_url(url):\n",
-    "    img_byte_stream = BytesIO(urlopen(url).read())\n",
-    "    return Image.open(img_byte_stream).convert(\"RGB\")\n",
-    "\n",
-    "\n",
-    "image_url = (\n",
-    "    \"https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg\"\n",
-    ")\n",
-    "image = img_from_url(image_url)\n",
-    "\n",
-    "\n",
-    "def extract_objects(image, prompt):\n",
+    "dataset = load_dataset(\"openbmb/RLAIF-V-Dataset\", split=\"train[:10]\")\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's define a function that will extract the structured information from the image. We will format the prompt using the `apply_chat_template` method and pass it to the model along with the image after that."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1caa96c32bc7416ea43c192c0cd88c20",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/10 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path', 'synthetic_question', 'synthetic_description', 'synthetic_quality'],\n",
+       "    num_rows: 10\n",
+       "})"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def extract(row):\n",
     "    messages = [\n",
     "        {\n",
     "            \"role\": \"user\",\n",
@@ -191,11 +232,117 @@
     "        messages, add_generation_prompt=True\n",
     "    )\n",
     "\n",
-    "    result = image_objects_generator(formatted_prompt, [image])\n",
-    "    return result\n",
+    "    result = structured_generator(formatted_prompt, [row[\"image\"]])\n",
+    "    row['synthetic_question'] = result.question\n",
+    "    row['synthetic_description'] = result.description\n",
+    "    row['synthetic_quality'] = result.quality\n",
+    "    return row\n",
     "\n",
     "\n",
-    "extract_objects(image, prompt)"
+    "dataset = dataset.map(lambda x: extract(x))\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's now push our new dataset to the Hub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "843b9c88cab54402812f1b936a2dc6e0",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "57e5dea4ae504866b2d93863bcfa4408",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/10 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b811febb7c044100bb74bf67016f0d0d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1fa44296ea00459b8cbb22e56739117c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "README.md:   0%|          | 0.00/719 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/commit/f72002df2d9aef403afeaf6e27f4407ddd82c89c', commit_message='Upload dataset', commit_description='', oid='f72002df2d9aef403afeaf6e27f4407ddd82c89c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset'), pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset.push_to_hub(\"davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset\", split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<iframe\n",
+    "  src=\"https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/embed/viewer/default/train?row=3\"\n",
+    "  frameborder=\"0\"\n",
+    "  width=\"100%\"\n",
+    "  height=\"560px\"\n",
+    "></iframe>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!"
    ]
   },
   {
@@ -216,7 +363,7 @@
     "## Next Steps\n",
     "\n",
     "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n",
-    "- Explore extraction on your own usecase.\n",
+    "- Explore extraction on your own usecase with your own model.\n",
     "- Use a different method of extracting structured information from documents."
    ]
   }

From 9cb4d5f5d56811a73f6dc6916a46a6c00eff4571 Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Wed, 29 Jan 2025 10:00:53 +0100
Subject: [PATCH 5/7] Fix typo in notebook filename and update index

- Corrected the filename from `structured_generation_vision_languag_models.ipynb` to `structured_generation_vision_language_models.ipynb`
- Updated the index.md to reflect the corrected notebook title and link
- Updated the _toctree.yml to use the corrected notebook filename
---
 notebooks/en/_toctree.yml                     |  2 +-
 notebooks/en/index.md                         |  3 +-
 ...d_generation_vision_language_models.ipynb} | 34 ++++++++++++-------
 3 files changed, 24 insertions(+), 15 deletions(-)
 rename notebooks/en/{structured_generation_vision_languag_models.ipynb => structured_generation_vision_language_models.ipynb} (89%)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 33181565..4b365756 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -108,7 +108,7 @@
           title: Smol Multimodal RAG, Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU
         - local: fine_tuning_vlm_dpo_smolvlm_instruct
           title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
-        - local: structured_generation_vision_languag_models
+        - local: structured_generation_vision_language_models
           title: Structured Generation from Documents Using Vision Language Models
 
     - title: Search Recipes
diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index 0552dbbd..f7a8d475 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -7,12 +7,11 @@ applications and solving various machine learning tasks using open-source tools
 
 Check out the recently added notebooks:
 
+- [Structured Generation from Images or Documents Using Vision Language Models](structured_generation_vision_language_models)
 - [Multi-Agent Order Management System with MongoDB](mongodb_smolagents_multi_micro_agents)
 - [Scaling Test-Time Compute for Longer Thinking in LLMs](search_and_learn)
 - [Signature-Aware Model Serving from MLflow with Ray Serve](mlflow_ray_serve)
 - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct)
-- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
-
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 
diff --git a/notebooks/en/structured_generation_vision_languag_models.ipynb b/notebooks/en/structured_generation_vision_language_models.ipynb
similarity index 89%
rename from notebooks/en/structured_generation_vision_languag_models.ipynb
rename to notebooks/en/structured_generation_vision_language_models.ipynb
index 18a95213..805ca0ab 100644
--- a/notebooks/en/structured_generation_vision_languag_models.ipynb
+++ b/notebooks/en/structured_generation_vision_language_models.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Structured Generation from Documents Using Vision Language Models\n",
+    "# Structured Generation from Images or Documents Using Vision Language Models\n",
     "\n",
     "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n",
     "\n",
@@ -132,7 +132,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -188,13 +188,23 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot locate reference to <class '__main__.ImageData'>.\n",
+      "  StockPickler.save(self, obj, save_persistent_id)\n",
+      "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot pickle <class '__main__.ImageData'>: __main__.ImageData has recursive self-references that trigger a RecursionError.\n",
+      "  StockPickler.save(self, obj, save_persistent_id)\n"
+     ]
+    },
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "1caa96c32bc7416ea43c192c0cd88c20",
+       "model_id": "e1d431b922334b0297195415a11cf68a",
        "version_major": 2,
        "version_minor": 0
       },
@@ -214,7 +224,7 @@
        "})"
       ]
      },
-     "execution_count": 17,
+     "execution_count": 20,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -252,13 +262,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "843b9c88cab54402812f1b936a2dc6e0",
+       "model_id": "ab88b1b3bb1441498788bdc2c2b4cf30",
        "version_major": 2,
        "version_minor": 0
       },
@@ -272,7 +282,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "57e5dea4ae504866b2d93863bcfa4408",
+       "model_id": "e5e359d02ede43959e92a9e5626f9ffd",
        "version_major": 2,
        "version_minor": 0
       },
@@ -286,7 +296,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "b811febb7c044100bb74bf67016f0d0d",
+       "model_id": "9f7f07dad09f47c5a8dfdeba403845f6",
        "version_major": 2,
        "version_minor": 0
       },
@@ -300,7 +310,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "1fa44296ea00459b8cbb22e56739117c",
+       "model_id": "e47600c765b64b55aa6f93e9cf5d077e",
        "version_major": 2,
        "version_minor": 0
       },
@@ -314,10 +324,10 @@
     {
      "data": {
       "text/plain": [
-       "CommitInfo(commit_url='https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/commit/f72002df2d9aef403afeaf6e27f4407ddd82c89c', commit_message='Upload dataset', commit_description='', oid='f72002df2d9aef403afeaf6e27f4407ddd82c89c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset'), pr_revision=None, pr_num=None)"
+       "CommitInfo(commit_url='https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/commit/373d6a25e8301077773fc6a37899b1598cf6f8cd', commit_message='Upload dataset', commit_description='', oid='373d6a25e8301077773fc6a37899b1598cf6f8cd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset'), pr_revision=None, pr_num=None)"
       ]
      },
-     "execution_count": 18,
+     "execution_count": 21,
      "metadata": {},
      "output_type": "execute_result"
     }

From 0f5be22182d2cc06c1e26f0078976511eea335cc Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Thu, 30 Jan 2025 18:01:02 +0100
Subject: [PATCH 6/7] docs: Minor text correction in VLM structured generation
 notebook

- Fixed a small punctuation error in the introduction paragraph
- Corrected "HuggingFaceTB" to "Hugging Face"
---
 notebooks/en/structured_generation_vision_language_models.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/en/structured_generation_vision_language_models.ipynb b/notebooks/en/structured_generation_vision_language_models.ipynb
index 805ca0ab..1548ba9a 100644
--- a/notebooks/en/structured_generation_vision_language_models.ipynb
+++ b/notebooks/en/structured_generation_vision_language_models.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Structured Generation from Images or Documents Using Vision Language Models\n",
     "\n",
-    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n",
+    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents. We will run the VLM using the Hugging Face Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n",
     "\n",
     "> This approach is based on a [Outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/).\n",
     "\n",

From 0f3d86bb0c3f74eb539d3043fd3e22bf6941a43b Mon Sep 17 00:00:00 2001
From: davidberenstein1957 <david.m.berenstein@gmail.com>
Date: Fri, 31 Jan 2025 09:24:25 +0100
Subject: [PATCH 7/7] docs: Update VLM structured generation notebook title

- Expanded notebook title to clarify generation from both images and documents
- Minor enhancement to improve clarity of the notebook's scope
---
 notebooks/en/_toctree.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 545878b8..c637c5bd 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -109,7 +109,7 @@
         - local: fine_tuning_vlm_dpo_smolvlm_instruct
           title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
         - local: structured_generation_vision_language_models
-          title: Structured Generation from Documents Using Vision Language Models
+          title: Structured Generation from Images or Documents Using Vision Language Models
 
     - title: Search Recipes
       isExpanded: false