-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Version 6 Changes #993
Comments
Hi, We are utilizing the paragraphs > lines > words structure in our existing project. We need support for bbox data. Can you add support for it back or please suggest alternative approach. Thanks. |
@Sahil-Maiyani Results are still reported as JavaScript objects if you enable the A code snippet showing how to get a list of // Run recognition with `blocks` (JavaScript object output) enabled
const ret = await worker.recognize(img, {}, { blocks: true });
// Array of paragraphs
const paragraphs = ret.data.blocks.map((block) => block.paragraphs).flat();
// Array of lines
const lines = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines)).flat(2);
// Array of words
const words = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines.map((line) => line.words))).flat(3);
// Array of symbols
const symbols = ret.data.blocks.map((block) => block.paragraphs.map((paragraph) => paragraph.lines.map((line) => line.words.map((word) => word.symbols)))).flat(4); |
Summary
Version 6 includes important under-the-hood improvements, along with a couple breaking changes.
Major Improvements
Minor Improvements
createWorker
config
argument overwritten by default arguments #975).Breaking Changes
text
are now disabled by default.hocr
output (for example), set the following:worker.recognize(image, {}, { hocr: true })
blocks
) was tweaked.blocks
) is returned.words
,symbols
, etc.).text
andbbox
--are unchanged.worker.initialize
andworker.loadLanguage
, along with several depreciated options from v2.Discussion
Note: Refreshing Workers
The memory leak present in previous versions forced users running Tesseract.js on a server to periodically create fresh workers. Even though this should no longer be necessary to avoid a memory-related crash, server users should still occasionally create new workers.
This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.
Why were non-text output formats disabled by default?
Previous versions of Tesseract.js exported results to 4 different formats by default. After analyzing performance, we found that the unnecessary formats were routinely adding 0.25-0.50s per page to runtime when recognizing documents, in addition to increasing memory usage. Therefore, requiring users to manually specify any non-text format(s) they need should result in non-trivial performance gains.
Why was the JavaScript object output (
blocks
) changed?While there are several reasons why minor changes were made to this format, the primary goal was fixing memory leaks present in the previous implementation. The new version is free of leaks, runs faster, and should be easier to maintain. Note: if any
blocks
output property that you rely on was cut in this release, please leave a comment and we can consider adding it back.The text was updated successfully, but these errors were encountered: