Version 6 Changes #993

Balearica · 2025-01-07T14:42:03Z

Summary

Version 6 includes important under-the-hood improvements, along with a couple breaking changes.

Major Improvements

Fixed memory leaks (Fix memory leaks #977)
- This version fixed a long-standing issue where memory would rise over time, eventually leading to a crash.
- Despite this fix, workers should still be periodically refreshed--see the note below entitled "Refreshing Workers".

Minor Improvements

Reduced runtime and memory usage for most users by updating default formats (Disable non-text output formats by default #916).
Fixed compatibility with Electron main process (createWorker throws exception with option.langPath set in electron #925)
Fixed bug where user-provided parameters were overwritten by defaults (Parameters set using createWorker config argument overwritten by default arguments #975).

Breaking Changes

All outputs formats other than text are now disabled by default.
- To re-enable the hocr output (for example), set the following: worker.recognize(image, {}, { hocr: true })
  - See here for a list of possible output formats.
The JavaScript object output format (blocks) was tweaked.
- Only the array of blocks (blocks) is returned.
  - Previous versions would automatically generate lists of every unit of text (words, symbols, etc.).
    - If needed, these should now be generated by the user.
- Only text-based blocks are reported.
  - Previous versions reported non-text blocks when detected by Tesseract (e.g. line segments).
- The shape of some objects were changed.
  - See the type declarations for reference on properties.
  - The main properties--text and bbox--are unchanged.
Various functions and options marked as depreciated previously have been removed.
1. This includes worker.initialize and worker.loadLanguage, along with several depreciated options from v2.

Discussion

Note: Refreshing Workers

The memory leak present in previous versions forced users running Tesseract.js on a server to periodically create fresh workers. Even though this should no longer be necessary to avoid a memory-related crash, server users should still occasionally create new workers.

This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.

Why were non-text output formats disabled by default?

Previous versions of Tesseract.js exported results to 4 different formats by default. After analyzing performance, we found that the unnecessary formats were routinely adding 0.25-0.50s per page to runtime when recognizing documents, in addition to increasing memory usage. Therefore, requiring users to manually specify any non-text format(s) they need should result in non-trivial performance gains.

Why was the JavaScript object output (`blocks`) changed?

While there are several reasons why minor changes were made to this format, the primary goal was fixing memory leaks present in the previous implementation. The new version is free of leaks, runs faster, and should be easier to maintain. Note: if any blocks output property that you rely on was cut in this release, please leave a comment and we can consider adding it back.

The text was updated successfully, but these errors were encountered:

Sahil-Maiyani · 2025-01-10T18:48:14Z

Hi,

We are utilizing the paragraphs > lines > words structure in our existing project. We need support for bbox data. Can you add support for it back or please suggest alternative approach. Thanks.

Balearica · 2025-01-17T08:22:15Z

Hi,

We are utilizing the paragraphs > lines > words structure in our existing project. We need support for bbox data. Can you add support for it back or please suggest alternative approach. Thanks.

@Sahil-Maiyani Results are still reported as JavaScript objects if you enable the blocks output option, which still include bbox data. The only differences in v6 are that (1) you need to manually enable the blocks output and (2) if you want an array of paragraphs/lines/words/symbols you now need to calculate that yourself using blocks.

A code snippet showing how to get a list of paragraphs/lines/words/symbols using v6 is below.

// Run recognition with `blocks` (JavaScript object output) enabled
const ret = await worker.recognize(img, {}, { blocks: true });
// Array of paragraphs
const paragraphs = ret.data.blocks.map((block) => block.paragraphs).flat();
// Array of lines
const lines = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines)).flat(2);
// Array of words
const words = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines.map((line) => line.words))).flat(3);
// Array of symbols
const symbols = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines.map((line) => line.words.map((word) => word.symbols)))).flat(4);

Balearica pinned this issue Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 6 Changes #993

Version 6 Changes #993

Balearica commented Jan 7, 2025

Sahil-Maiyani commented Jan 10, 2025 •

edited

Loading

Balearica commented Jan 17, 2025 •

edited

Loading

Version 6 Changes #993

Version 6 Changes #993

Comments

Balearica commented Jan 7, 2025

Summary

Major Improvements

Minor Improvements

Breaking Changes

Discussion

Note: Refreshing Workers

Why were non-text output formats disabled by default?

Why was the JavaScript object output (blocks) changed?

Sahil-Maiyani commented Jan 10, 2025 • edited Loading

Balearica commented Jan 17, 2025 • edited Loading

Why was the JavaScript object output (`blocks`) changed?

Sahil-Maiyani commented Jan 10, 2025 •

edited

Loading

Balearica commented Jan 17, 2025 •

edited

Loading