llm extraction

sensible-hq · Feb 28, 2025 · 9ddb747 · 9ddb747
1 parent 6b5169a
commit 9ddb747
Show file tree

Hide file tree

Showing 8 changed files with 72 additions and 26 deletions.
diff --git a/readme-sync/assets/v0/diagrams_mermaid/llm_extraction.txt b/readme-sync/assets/v0/diagrams_mermaid/llm_extraction.txt
@@ -0,0 +1,9 @@
+graph TD;
+    style A fill:#fafaf8,stroke:#000,stroke-width:1px;
+    style B fill:#fafaf8,stroke:#000,stroke-width:1px;
+    style C fill:#fafaf8,stroke:#000,stroke-width:1px;
+    style D fill:#fafaf8,stroke:#000,stroke-width:1px;
+
+    A["'What's the largest checking transaction?'"] -->|source_ids| B["Look in an extracted field"]
+    A -->|searchBySummarization| C["locate context using page summaries"]
+    A -->|default| D["Locate context using page chunks"]
diff --git a/readme-sync/assets/v0/diagrams_mermaid/readme.txt b/readme-sync/assets/v0/diagrams_mermaid/readme.txt
@@ -0,0 +1,7 @@
+- I create these in 
+https://mermaid.live/
+- theme: corporate
+
+fill color for boxes: #fafaf8 (Sensible's light gray branding color)
+
+my screen's at 100% zoom 
diff --git a/readme-sync/assets/v0/images/screenshots/llm_extraction.png b/readme-sync/assets/v0/images/screenshots/llm_extraction.png
diff --git a/readme-sync/v0/senseml-reference/3000 - llm-based-methods/index.md b/readme-sync/v0/senseml-reference/3000 - llm-based-methods/index.md
@@ -5,7 +5,7 @@ hidden: false
 
 Extract free text from unstructured documents using large language model (LLM)-based SenseML methods. For example, extract information from legal paragraphs in contracts and leases, or results from research papers.
 
-The following LLM-based methods are alternatives to [layout-based methods](doc:layout-based-methods) for structured documents, for example, tax documents or insurance forms. 
+The following LLM-based methods are alternatives to [layout-based methods](doc:layout-based-methods). 
 
 | Method                                | Example use case                                             | Chained-prompt example<sup>1</sup>                           | Notes                                                        |
 | ------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |

diff --git a/...l-reference/8000 - concepts/draft-llms.md → ...e/8000 - concepts/draft-llm-extraction.md b/...l-reference/8000 - concepts/draft-llms.md → ...e/8000 - concepts/draft-llm-extraction.md
@@ -1,32 +1,19 @@
 ---
 hidden: true
-title: "Configure/troubleshoot LLMs"
+title: "How LLM-based extraction works"
 ---
 
-- see also: troubleshoot-llms
-
-- prompt tips stuff in each LLM topic
-- prompt tips in /prompt
-- other stuff?
-- BLOG post on chaining prompts?
-
-
-
-TO DOs: -- search by summarization is NOT global; NLP table don't support it.
-
-## HOW IT WORKS
-
-
-
-
+## How LLM-based extraction works
 
 The following is an overview of how Sensible's LLM-based methods work. Use this overview to understand your configuration and troubleshooting options.
 
-### Overview
+### Overview: understanding prompt *context*
 
 Sensible supports LLM-based data extraction methods from documents. For example, for an insurance declaration document, you can submit the prompt `when does the insurance coverage start?`, and the LLM returns `08-14-24`. 
 
-LLMs' input token limits are important constraints in this scenario. Because of these limits, Sensible must generally submit an excerpt of the document rather than the whole document to the LLM. This relevant excerpt is called *context*. For example, for the prompt  `when does the insurance coverage start?`, the abbreviated context can be something like:
+LLMs' input token limits are important constraints for document data extraction. Because of these limits, Sensible must generally submit an excerpt of the document rather than the whole document to the LLM. This relevant excerpt is called *context*. 
+
+For example, for the prompt  `when does the insurance coverage start?`, the context can look like:
 
 ````txt
 Tel: 1-800-851-2000    Declarations Page
@@ -42,7 +29,7 @@ Note that context doesn't have to be limited to contiguous pages in the document
 
 ### Example: Full prompt with context
 
-See the following image for an example of a *full prompt* that Sensible inputs to an LLM for the [Query Group](doc:query-group) method using the default embeddings scoring approach (TODO link to that section).  When you write a prompt using an LLM-based method, Sensible creates a full prompt using the following:
+See the following image for an example of a *full prompt* that Sensible inputs to an LLM for the [Query Group](doc:query-group) method. When you write a prompt using an LLM-based method, by default (embeddings scoring approach (TODO link to that section). ) Sensible creates a full prompt using the following:
 
 - A prompt introduction
 - "context", made up of chunks excerpted from the document and of page metadata. 
@@ -59,10 +46,12 @@ See the following image for an example of a *full prompt* that Sensible inputs t
 
 Sensible provides configuration options for ensuring correct and complete contexts. For more information, see the following section.
 
-// TODO: add a mermaid chart here??? https://docs.readme.com/main/docs/creating-mermaid-diagrams 
+// TODO: add a mermaid chart here for the different options??? https://docs.readme.com/main/docs/creating-mermaid-diagrams 
 
 ## Options for locating context
 
+![Click to enlarge](https://raw.githubusercontent.com/sensible-hq/sensible-docs/main/readme-sync/assets/v0/images/final/llm_extraction.png)
+
 ####  (Most determinate) Use other extracted fields as context
 
 You can prompt an LLM to answer questions about other [fields](doc:field-query-object)' extracted data.  In this case, the context is predetermined: it's the output from the other fields. For example, say you use the Text Table method to extract the following data in a `snacks_rank` table: 

diff --git a/readme-sync/v0/senseml-reference/8000 - concepts/draft-llm-features.md b/readme-sync/v0/senseml-reference/8000 - concepts/draft-llm-features.md
@@ -0,0 +1,43 @@
+---
+hidden: true
+title: "LLM features overview"
+---
+
+- *see also: troubleshoot-llms*
+
+- *prompt tips stuff in each LLM topic*
+- *prompt tips in /prompt*
+- *other stuff?*
+- *BLOG post on chaining prompts?*
+
+*TO DOs: -- search by summarization is NOT global; NLP table don't support it.* *needs fixed for Query group but List is already fixed*
+
+Sensible supports large language model (LLM)-based document automation workflows. In particular, Sensible supports:
+
+- document data extraction, within the document types you define in your Sensible account. For more information, see LLM-based methods TODO link.
+- classification of documents by the document types you define in your Sensible account. Fore more information, see Classifying documents by type TODO link.
+
+#### document extraction features
+
+Here are some of the the things you can do w/ document data extraction:
+
+- Extract document primitives like tables, lists, and short facts. TODO: short little image of a doc showing these things??  TODO add those same images to the LLM-based methods index page?
+- Extract multimodal stuff (TODO copy from \author)
+- Chained-prompts (TODO)
+- Search by summary for TODO WHY?
+- configurable LLM engine; choose from Anthropic or OpenAI. (that's it right?) for TODO WHY?
+- confidence signals for qualify LLM accuracy (TODO link)
+- advance configuration for stuff like:
+  - context completness
+  - context location
+  - troubleshoot prompts
+
+#### document classification features
+
+Here are some of the things you can do with document classification:
+
+- classify any document that belongs to one of the types you've defined in your account. you can configure LLM classification w/ a document type description, or leave it as-is
+- Segment a portfolio document into multiple document types coplete w page ranges and then treat each document in the protfiol separately.
+
+
+
diff --git a/readme-sync/v0/welcome/1000 - overview.md b/readme-sync/v0/welcome/1000 - overview.md
@@ -3,8 +3,6 @@ title: "Overview"
 hidden: false
 ---
 
-
-
 Welcome! Sensible is a developer-first platform for extracting structured data from documents, for example, business forms in PDF format. Use Sensible to build document-automation features into your vertical SaaS products. 
 
 With Sensible's SenseML language, you can write extraction queries for any type of document:

diff --git a/readme-sync/v0/welcome/6000 - author.md b/readme-sync/v0/welcome/6000 - author.md
@@ -25,9 +25,9 @@ See the following table for an overview of the pros and cons of LLMs versus layo
 | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
 | Technical expertise required                                 | For nontechnical users. Describe what you want to extract in a prompt to an LLM.  For example, "the policy period" or "total amount invoiced". | Offers powerful extraction configuration for technical users based on spatial layout. For example, grab the text in a rectangular region relative to the word "Addendums" |
 | Workflow automation                                          | Suited to workflows that include [human review](doc:human-review) or that are fault-tolerant. | Suited to automated workflows that require predictable results and validation. |
-| Document variability                                         | Suited to documents that are unstructured or that have a large number of layout variations or revisions. | Suited to structured documents with a finite number of variations, where you know the layout of the document in advance. |
+| [Document variability](doc:document-variations)              | Suited to documents that are unstructured or that have a large number of layout variations or revisions. | Suited to structured documents with a finite number of variations, where you know the layout of the document in advance. |
 | Deterministic                                                | No                                                           | Yes. Find the information in the document using anchoring text and layout data. |
-| Handles repeating layouts                                    | Use [List](doc:list) method.                            | Use [sections](doc:sections) for highly complex repeating substructures, for example, [loss runs](doc:sections). |
+| Handles repeating layouts                                    | Use [List](doc:list) method.                                 | Use [sections](doc:sections) for highly complex repeating substructures, for example, [loss runs](doc:sections). |
 | Handles non-text images (photos, illustrations, charts, etc) | To extract data about images (`"is the building in this picture multistory?"`, use [Query Group](doc:query-group) method with the Multimodal Engine parameter configured | No                                                           |
 | Performance                                                  | Data extraction takes a few seconds for each LLM-based method. | Offers faster performance in general. For more information, see [Optimizing extraction performance](doc:performance). |
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,8 +3,6 @@ title: "Overview" @@
     hidden: false
     ---
     Welcome! Sensible is a developer-first platform for extracting structured data from documents, for example, business forms in PDF format. Use Sensible to build document-automation features into your vertical SaaS products.
     With Sensible's SenseML language, you can write extraction queries for any type of document:
@@ Expand Down @@