Skip to content

Commit

Permalink
llm extraction
Browse files Browse the repository at this point in the history
  • Loading branch information
fscelliott committed Feb 28, 2025
1 parent 6b5169a commit 9ddb747
Show file tree
Hide file tree
Showing 8 changed files with 72 additions and 26 deletions.
9 changes: 9 additions & 0 deletions readme-sync/assets/v0/diagrams_mermaid/llm_extraction.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
graph TD;
style A fill:#fafaf8,stroke:#000,stroke-width:1px;
style B fill:#fafaf8,stroke:#000,stroke-width:1px;
style C fill:#fafaf8,stroke:#000,stroke-width:1px;
style D fill:#fafaf8,stroke:#000,stroke-width:1px;

A["'What's the largest checking transaction?'"] -->|source_ids| B["Look in an extracted field"]
A -->|searchBySummarization| C["locate context using page summaries"]
A -->|default| D["Locate context using page chunks"]
7 changes: 7 additions & 0 deletions readme-sync/assets/v0/diagrams_mermaid/readme.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
- I create these in
https://mermaid.live/
- theme: corporate

fill color for boxes: #fafaf8 (Sensible's light gray branding color)

my screen's at 100% zoom
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ hidden: false

Extract free text from unstructured documents using large language model (LLM)-based SenseML methods. For example, extract information from legal paragraphs in contracts and leases, or results from research papers.

The following LLM-based methods are alternatives to [layout-based methods](doc:layout-based-methods) for structured documents, for example, tax documents or insurance forms.
The following LLM-based methods are alternatives to [layout-based methods](doc:layout-based-methods).

| Method | Example use case | Chained-prompt example<sup>1</sup> | Notes |
| ------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,32 +1,19 @@
---
hidden: true
title: "Configure/troubleshoot LLMs"
title: "How LLM-based extraction works"
---

- see also: troubleshoot-llms

- prompt tips stuff in each LLM topic
- prompt tips in /prompt
- other stuff?
- BLOG post on chaining prompts?



TO DOs: -- search by summarization is NOT global; NLP table don't support it.

## HOW IT WORKS




## How LLM-based extraction works

The following is an overview of how Sensible's LLM-based methods work. Use this overview to understand your configuration and troubleshooting options.

### Overview
### Overview: understanding prompt *context*

Sensible supports LLM-based data extraction methods from documents. For example, for an insurance declaration document, you can submit the prompt `when does the insurance coverage start?`, and the LLM returns `08-14-24`.

LLMs' input token limits are important constraints in this scenario. Because of these limits, Sensible must generally submit an excerpt of the document rather than the whole document to the LLM. This relevant excerpt is called *context*. For example, for the prompt `when does the insurance coverage start?`, the abbreviated context can be something like:
LLMs' input token limits are important constraints for document data extraction. Because of these limits, Sensible must generally submit an excerpt of the document rather than the whole document to the LLM. This relevant excerpt is called *context*.

For example, for the prompt `when does the insurance coverage start?`, the context can look like:

````txt
Tel: 1-800-851-2000 Declarations Page
Expand All @@ -42,7 +29,7 @@ Note that context doesn't have to be limited to contiguous pages in the document

### Example: Full prompt with context

See the following image for an example of a *full prompt* that Sensible inputs to an LLM for the [Query Group](doc:query-group) method using the default embeddings scoring approach (TODO link to that section). When you write a prompt using an LLM-based method, Sensible creates a full prompt using the following:
See the following image for an example of a *full prompt* that Sensible inputs to an LLM for the [Query Group](doc:query-group) method. When you write a prompt using an LLM-based method, by default (embeddings scoring approach (TODO link to that section). ) Sensible creates a full prompt using the following:

- A prompt introduction
- "context", made up of chunks excerpted from the document and of page metadata.
Expand All @@ -59,10 +46,12 @@ See the following image for an example of a *full prompt* that Sensible inputs t

Sensible provides configuration options for ensuring correct and complete contexts. For more information, see the following section.

// TODO: add a mermaid chart here??? https://docs.readme.com/main/docs/creating-mermaid-diagrams
// TODO: add a mermaid chart here for the different options??? https://docs.readme.com/main/docs/creating-mermaid-diagrams

## Options for locating context

![Click to enlarge](https://raw.githubusercontent.com/sensible-hq/sensible-docs/main/readme-sync/assets/v0/images/final/llm_extraction.png)

#### (Most determinate) Use other extracted fields as context

You can prompt an LLM to answer questions about other [fields](doc:field-query-object)' extracted data. In this case, the context is predetermined: it's the output from the other fields. For example, say you use the Text Table method to extract the following data in a `snacks_rank` table:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
hidden: true
title: "LLM features overview"
---

- *see also: troubleshoot-llms*

- *prompt tips stuff in each LLM topic*
- *prompt tips in /prompt*
- *other stuff?*
- *BLOG post on chaining prompts?*

*TO DOs: -- search by summarization is NOT global; NLP table don't support it.* *needs fixed for Query group but List is already fixed*

Sensible supports large language model (LLM)-based document automation workflows. In particular, Sensible supports:

- document data extraction, within the document types you define in your Sensible account. For more information, see LLM-based methods TODO link.
- classification of documents by the document types you define in your Sensible account. Fore more information, see Classifying documents by type TODO link.

#### document extraction features

Here are some of the the things you can do w/ document data extraction:

- Extract document primitives like tables, lists, and short facts. TODO: short little image of a doc showing these things?? TODO add those same images to the LLM-based methods index page?
- Extract multimodal stuff (TODO copy from \author)
- Chained-prompts (TODO)
- Search by summary for TODO WHY?
- configurable LLM engine; choose from Anthropic or OpenAI. (that's it right?) for TODO WHY?
- confidence signals for qualify LLM accuracy (TODO link)
- advance configuration for stuff like:
- context completness
- context location
- troubleshoot prompts

#### document classification features

Here are some of the things you can do with document classification:

- classify any document that belongs to one of the types you've defined in your account. you can configure LLM classification w/ a document type description, or leave it as-is
- Segment a portfolio document into multiple document types coplete w page ranges and then treat each document in the protfiol separately.



2 changes: 0 additions & 2 deletions readme-sync/v0/welcome/1000 - overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ title: "Overview"
hidden: false
---



Welcome! Sensible is a developer-first platform for extracting structured data from documents, for example, business forms in PDF format. Use Sensible to build document-automation features into your vertical SaaS products.

With Sensible's SenseML language, you can write extraction queries for any type of document:
Expand Down
4 changes: 2 additions & 2 deletions readme-sync/v0/welcome/6000 - author.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ See the following table for an overview of the pros and cons of LLMs versus layo
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Technical expertise required | For nontechnical users. Describe what you want to extract in a prompt to an LLM. For example, "the policy period" or "total amount invoiced". | Offers powerful extraction configuration for technical users based on spatial layout. For example, grab the text in a rectangular region relative to the word "Addendums" |
| Workflow automation | Suited to workflows that include [human review](doc:human-review) or that are fault-tolerant. | Suited to automated workflows that require predictable results and validation. |
| Document variability | Suited to documents that are unstructured or that have a large number of layout variations or revisions. | Suited to structured documents with a finite number of variations, where you know the layout of the document in advance. |
| [Document variability](doc:document-variations) | Suited to documents that are unstructured or that have a large number of layout variations or revisions. | Suited to structured documents with a finite number of variations, where you know the layout of the document in advance. |
| Deterministic | No | Yes. Find the information in the document using anchoring text and layout data. |
| Handles repeating layouts | Use [List](doc:list) method. | Use [sections](doc:sections) for highly complex repeating substructures, for example, [loss runs](doc:sections). |
| Handles repeating layouts | Use [List](doc:list) method. | Use [sections](doc:sections) for highly complex repeating substructures, for example, [loss runs](doc:sections). |
| Handles non-text images (photos, illustrations, charts, etc) | To extract data about images (`"is the building in this picture multistory?"`, use [Query Group](doc:query-group) method with the Multimodal Engine parameter configured | No |
| Performance | Data extraction takes a few seconds for each LLM-based method. | Offers faster performance in general. For more information, see [Optimizing extraction performance](doc:performance). |

0 comments on commit 9ddb747

Please # to comment.