Skip to content

Conversation

mackurzawa
Copy link
Contributor

Description

Handled PageCountExceededError from the unstructured open-source library. Added the UNSTRUCTURED_MAX_PDF_PAGES environment variable, which indicates the maximum number of pages in a PDF file when the hi_res strategy is chosen (either directly by the consumer of the API or transformed from the auto strategy).

Testing

# works, number of pages is not exceeding UNSTRUCTURED_MAX_PDF_PAGES (300 by default), auto strategy transforms to hi_res
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/DA-1p-with-duplicate-pages.pdf'   -F 'output_format="text/csv"'

# on server
export UNSTRUCTURED_MAX_PDF_PAGES=2
# throws HTTPException - number of pages exceeds UNSTRUCTURED_MAX_PDF_PAGES while auto strategy transforms to hi_res
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/DA-1p-with-duplicate-pages.pdf'   -F 'output_format="text/csv"'

# on server
export UNSTRUCTURED_MAX_PDF_PAGES=2
# works, number of pages exceeds UNSTRUCTURED_MAX_PDF_PAGES, but fast strategy is chosen
curl -X 'POST'   'http://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/DA-1p-with-duplicate-pages.pdf'   -F 'output_format="text/csv"' -F 'strategy=fast'

README.md Outdated
@@ -373,6 +373,7 @@ As mentioned above, processing a pdf using `hi_res` is currently a slow operatio
* `UNSTRUCTURED_PARALLEL_MODE_THREADS` - the number of threads making requests at once, default is `3`.
* `UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE` - the number of pages to be processed in one request, default is `1`.
* `UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS` - the number of retry attempts on a retryable error, default is `2`. (i.e. 3 attempts are made in total)
* `UNSTRUCTURED_MAX_PDF_PAGES` - the maximum number of pages in pdf file that will not be rejected in `hi_res` strategy, default is `300`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name it more precisely - UNSTRUCTURED_PDF_HI_RES_MAX_PAGES

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants