Skip to content

Commit

Permalink
portfolio fingerprints
Browse files Browse the repository at this point in the history
  • Loading branch information
fscelliott committed Feb 15, 2024
1 parent d3a0235 commit ad9f664
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,12 @@ Standalone documents
Parameters
---

A fingerprint consists of an array of tests, where each test is a string, a Match object, or array of Match objects. For more information, see [Match object](doc:match).
A fingerprint consists of an array of tests, where each test is a string, a Match object, or array of Match objects. For more information, see [Match object](doc:match):

```json
```



Behind the scenes, Sensible automatically expands this simple syntax to syntax for portfolio fingerprints using `"page" : "any"`.

Expand Down Expand Up @@ -58,13 +63,50 @@ Portfolios
Parameters
---

A fingerprint consists of an array of tests. The following table shows parameters for each test:
A fingerprint consists of an array of tests, where each test contains a Page parameter and a Match parameter:

```json
"fingerprint": {
"tests": [
{
"page": "first",
"match": [
{
"text": "this text always shows up on the first page of the document",
"type": "startsWith"
}
]
},
{
"page": "last",
"match": [
{
"text": "this text always shows up on the last page of the document",
"type": "includes"
}
]
}
]
}
```

The following table shows parameters for each test:

| key | value | description |
| -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| match (**required**) | a string, a [Match object](doc:match), or array of Match objects. | Specifies the text to match for the test. |
| offset | integer | Specifies where to start or end the document segment, offset in pages relative to the first or last page defined by the Match parameter. For example, if you specify that the page that contains the phrase "A summary of your rights" is the first page of a segment, and Sensible finds a match for the first page on the zero-indexed page 3 of a portfolio:<br/>- specifying `"offset": -1` starts the document segment on page 2 of the portfolio.<br/>- specifying `"offset": 1` starts the document segment on page 4 of the portfolio. |
| page | `first`, `last`, `every`, `any` | For portfolios (multiple documents combined into one file, such as an invoice, a contract, and a tax form), tests for document starts and ends to segment the portfolio into documents. <br/>- Sensible discards orphaned `last` matches. In other words, if you specify `last`, then Sensible must find at least one other fingerprint of a different page type preceding the `last` match in order to recognize the document. For more information see [Multi-document extraction](doc:portfolio). <br/>- If you reuse the same config between portfolios and standalone documents, then for standalone document extractions, Sensible ignores the configured value of this parameter and treats it as `"page" : "any"`. This way, Sensible avoids strictly matching to extraneous front or back matter (for example, a fax cover page) in single documents. |
| page | `first`, `last`, `every`, `any` | A portfolio contains multiple documents combined into one file, such as an invoice, a contract, and a tax form. Sensible uses fingerprints to segment a portfolio into documents. Configure with the following enums:<br/>`first` - The first page of a document must meet the match criteria. <br/>`last` - The last page of a document must meet the match criteria. If you specify `last`, then Sensible must find at least one other fingerprint of a different page type preceding the `last` match in order to segment the document. <br/>`every` - Every page in the document must meet the match criteria. Sensible segments the document by searching for consecutive pages that each meet the criteria. <br/>`any`- Any page in the document can meet the criteria. If you define a match array in this test, each match must be present on the same page.<br/>**Notes:** For an example see [Multi-document extraction](doc:portfolio). <br/>If you reuse the same config between portfolios and standalone documents, then for standalone document extractions, Sensible ignores the configured value of this parameter and treats it as `"page" : "any"`. This way, Sensible avoids matching to extraneous front or back matter (for example, a fax cover page) in single-file documents. |

## Tips

Use the following tips when you define fingerprints for portfolios:

- Only use `"page": "first"` and `"page": "last"` if you're confident that these pages will never be omitted from the document.
- If using `"page": "first"` always pair it with another test type, such as `"page": "every"` or `"page": "last"`
- Avoid `"page": "any`" unless all other page types fail to segment the document.



Examples
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,25 @@ hidden: false

A [fingerprint](doc:fingerprint) for standalone documents changes Sensible's default behavior of running *all* the configs in a single document type. For example, if you extract company A and company B quotes, by default Sensible runs both the company A and the company B configs for a given document, then returns the extraction with the highest score.

The following table shows how this default behavior changes when you configure the following levels of strictness for a document type's fingerprints. You can configure strictness in the Sensible app in the document type settings tab:
The following tables show how this default behavior changes when you configure the following levels of strictness for a document type's fingerprints. You can configure strictness in the Sensible app in the document type settings tab:

## Single-document file fingerprints

| Strictness level | Description | If more than one config's tests pass over 50% | If no configs' tests passes over 50% or if no configs contain a fingerprint |
| ---------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| standard | If any of the configs in the document type contain a fingerprint, then Sensible runs extractions using any configs that pass over 50% of the fingerprint tests. | Sensible chooses the output from the passing config with the highest score | Sensible falls back to the default behavior of running extractions for the document using *all* configurations, and returns the one that has the highest score. |
| strict | The doc type must have at least one config containing a fingerprint. | Sensible chooses the output from the passing config that has the highest score. | Sensible returns a 400 error. |

In the preceding table, a score is calculated as:
In the preceding table, Sensible calculates a score as follows:

`classification score` = `num of non-null fields` - `penalties for validation errors or warnings`, where penalties are as follows:

- `validation error penalty` = 1 * num fields with validation errors
- `validation warning penalty` = 0.5 * num of fields with validation warnings
- `validation error penalty` = 1 * `num fields with validation errors`
- `validation warning penalty` = 0.5 * `num of fields with validation warnings`

The classification score is for comparing extractions within a single document type. To compare scores across document types, see [Accuracy measures](doc:accuracy-measures).

## Portfolio fingerprints

When using fingerprints for segmenting portfolio files into documents, Sensible ignores the document types' fingerprint mode setting. Sensible must find 100% of all matches in all tests to segment a document.

The classification score is for comparing extractions within a single document type. To compare scores across document types, see [Accuracy measures](doc:accuracy-measures).
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,6 @@ You can choose from among the following options for extracting tables:
| method | multiple pages | merged cells | variable column formatting | checkboxes in cells | Tables-in tables, labeled rows, and other complex formatting |
| ----------- | ------------------------------------------------------------ | ------------------------------------------------------------ | -------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Table | ✅<br/>Can extract tables that span multiple pages if the column headings repeat on each page. | ✅ <br />If you specify the Stop parameter, Sensible populates "empty" spanned cells with the spanned value. For an example, see [Merged cell example](doc:table#example-merged-cells). || ✅ <br />If you specify the Stop parameter, Sensible returns the selection status for checkboxes in table cells as `"[true]"` or `"[false]"`. | ❌<br/>Use Sections as an alternative |
| Fixed Table | ✅<br />Ignores repeating column headings. | ✅<br /> If you specify the Stop parameter, same behavior as for Table method. || ✅ <br />If you specify the Stop parameter, same behavior as for Table method. | ❌<br/>Use Sections as an alternative |
| Fixed Table | ✅<br />Ignores repeating column headings. | ✅<br /> If you specify the Stop parameter, Sensible populates "empty" spanned cells with the spanned value. For an example, see [Merged cell example](doc:table#example-merged-cells). || ✅ <br />If you specify the Stop parameter, Sensible returns the selection status for checkboxes in table cells as `"[true]"` or `"[false]"`. | ❌<br/>Use Sections as an alternative |
| Text Table | ✅<br />Supported if you specify the Stop parameter | ❌<br/>Sensible returns the first merged cell's value, and returns subsequent spanned cells as nulls. ||| ❌<br/>Use Sections as an alternative |
| NLP Table | ✅ <br />To troubleshoot intervening non-table text, use the Page Span Threshold parameter. | Indeterminate. Usually supported without additional prompting. || Indeterminate. | Indeterminate.<br/>Use Sections as an alternative. |

0 comments on commit ad9f664

Please # to comment.