diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 41670d73..6cc8ce0c 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -110,6 +110,8 @@
           title: Smol Multimodal RAG, Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU
         - local: fine_tuning_vlm_dpo_smolvlm_instruct
           title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
+        - local: structured_generation_vision_language_models
+          title: Structured Generation from Images or Documents Using Vision Language Models
 
     - title: Search Recipes
       isExpanded: false
diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index c5f9a624..96a206b5 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -8,11 +8,10 @@ applications and solving various machine learning tasks using open-source tools
 Check out the recently added notebooks:
 
 - [Post-training an LLM using GRPO with TRL](fine_tuning_llm_grpo_trl)
+- [Structured Generation from Images or Documents Using Vision Language Models](structured_generation_vision_language_models)
 - [Vector Search on Hugging Face with the Hub as Backend](vector_search_with_hub_as_backend)
 - [Multi-Agent Order Management System with MongoDB](mongodb_smolagents_multi_micro_agents)
 - [Scaling Test-Time Compute for Longer Thinking in LLMs](search_and_learn)
-- [Signature-Aware Model Serving from MLflow with Ray Serve](mlflow_ray_serve)
-
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 
diff --git a/notebooks/en/structured_generation_vision_language_models.ipynb b/notebooks/en/structured_generation_vision_language_models.ipynb
new file mode 100644
index 00000000..1548ba9a
--- /dev/null
+++ b/notebooks/en/structured_generation_vision_language_models.ipynb
@@ -0,0 +1,402 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Structured Generation from Images or Documents Using Vision Language Models\n",
+    "\n",
+    "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents. We will run the VLM using the Hugging Face Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n",
+    "\n",
+    "> This approach is based on a [Outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/).\n",
+    "\n",
+    "## Dependencies and imports\n",
+    "\n",
+    "First, let's install the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's continue with importing the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import outlines\n",
+    "import torch\n",
+    "\n",
+    "from datasets import load_dataset\n",
+    "from outlines.models.transformers_vision import transformers_vision\n",
+    "from transformers import AutoModelForImageTextToText, AutoProcessor\n",
+    "from pydantic import BaseModel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialising our model\n",
+    "\n",
+    "We will start by initialising our model from [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct). Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the [Hub repo files](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/tree/main), and import those classes directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n",
+      "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n"
+     ]
+    }
+   ],
+   "source": [
+    "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n",
+    "\n",
+    "\n",
+    "def get_model_and_processor_class(model_name: str):\n",
+    "    model = AutoModelForImageTextToText.from_pretrained(model_name)\n",
+    "    processor = AutoProcessor.from_pretrained(model_name)\n",
+    "    classes = model.__class__, processor.__class__\n",
+    "    del model, processor\n",
+    "    return classes\n",
+    "\n",
+    "\n",
+    "model_class, processor_class = get_model_and_processor_class(model_name)\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    device = \"cuda\"\n",
+    "elif torch.backends.mps.is_available():\n",
+    "    device = \"mps\"\n",
+    "else:\n",
+    "    device = \"cpu\"\n",
+    "\n",
+    "model = transformers_vision(\n",
+    "    model_name,\n",
+    "    model_class=model_class,\n",
+    "    device=device,\n",
+    "    model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n",
+    "    processor_kwargs={\"device\": device},\n",
+    "    processor_class=processor_class,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Structured Generation\n",
+    "\n",
+    "Now, we are going to define a function that will define how the output of our model will be structured. We will be using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class ImageData(BaseModel):\n",
+    "    quality: str\n",
+    "    description: str\n",
+    "    question: str\n",
+    "\n",
+    "structured_generator = outlines.generate.json(model, ImageData)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's come up with an extraction prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = \"\"\"\n",
+    "You are an image analysis assisant.\n",
+    "\n",
+    "Provide a quality tag, a description and a question.\n",
+    "\n",
+    "The quality can either be \"good\", \"okay\" or \"bad\".\n",
+    "The question should be concise and objective.\n",
+    "\n",
+    "Return your response as a valid JSON object.\n",
+    "\"\"\".strip()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's load our image dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path'],\n",
+       "    num_rows: 10\n",
+       "})"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset = load_dataset(\"openbmb/RLAIF-V-Dataset\", split=\"train[:10]\")\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's define a function that will extract the structured information from the image. We will format the prompt using the `apply_chat_template` method and pass it to the model along with the image after that."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot locate reference to <class '__main__.ImageData'>.\n",
+      "  StockPickler.save(self, obj, save_persistent_id)\n",
+      "/Users/davidberenstein/Documents/programming/huggingface/cookbook/.venv/lib/python3.11/site-packages/dill/_dill.py:414: PicklingWarning: Cannot pickle <class '__main__.ImageData'>: __main__.ImageData has recursive self-references that trigger a RecursionError.\n",
+      "  StockPickler.save(self, obj, save_persistent_id)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e1d431b922334b0297195415a11cf68a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/10 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path', 'synthetic_question', 'synthetic_description', 'synthetic_quality'],\n",
+       "    num_rows: 10\n",
+       "})"
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def extract(row):\n",
+    "    messages = [\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [{\"type\": \"image\"}, {\"type\": \"text\", \"text\": prompt}],\n",
+    "        },\n",
+    "    ]\n",
+    "\n",
+    "    formatted_prompt = model.processor.apply_chat_template(\n",
+    "        messages, add_generation_prompt=True\n",
+    "    )\n",
+    "\n",
+    "    result = structured_generator(formatted_prompt, [row[\"image\"]])\n",
+    "    row['synthetic_question'] = result.question\n",
+    "    row['synthetic_description'] = result.description\n",
+    "    row['synthetic_quality'] = result.quality\n",
+    "    return row\n",
+    "\n",
+    "\n",
+    "dataset = dataset.map(lambda x: extract(x))\n",
+    "dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's now push our new dataset to the Hub."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ab88b1b3bb1441498788bdc2c2b4cf30",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e5e359d02ede43959e92a9e5626f9ffd",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/10 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9f7f07dad09f47c5a8dfdeba403845f6",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "e47600c765b64b55aa6f93e9cf5d077e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "README.md:   0%|          | 0.00/719 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/commit/373d6a25e8301077773fc6a37899b1598cf6f8cd', commit_message='Upload dataset', commit_description='', oid='373d6a25e8301077773fc6a37899b1598cf6f8cd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset'), pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "dataset.push_to_hub(\"davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset\", split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<iframe\n",
+    "  src=\"https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/embed/viewer/default/train?row=3\"\n",
+    "  frameborder=\"0\"\n",
+    "  width=\"100%\"\n",
+    "  height=\"560px\"\n",
+    "></iframe>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We've seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like `pdf2image` to convert the document to images and running information extraction on each image pdf of the page.\n",
+    "\n",
+    "```python\n",
+    "pdf_path = \"path/to/your/pdf/file.pdf\"\n",
+    "pages = convert_from_path(pdf_path)\n",
+    "for page in pages:\n",
+    "    extract_objects = extract_objects(page, prompt)\n",
+    "```\n",
+    "\n",
+    "## Next Steps\n",
+    "\n",
+    "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n",
+    "- Explore extraction on your own usecase with your own model.\n",
+    "- Use a different method of extracting structured information from documents."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}