A FastAPI application to extract text from pdf documents.
The PDF Text Extraction service is available as a Docker image.
docker pull ghcr.io/data-house/pdf-text-extractor:main
A sample docker-compose.yaml
file is available within the repository.
Please refer to Releases and Packages for the available tags.
The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response.
The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.
The service expose only one endpoint /extract-text
that accepts a POST
request
with the following input as a json
body:
url
: the URL of the PDF file to process.mime_type
: the mime type of the file (it is expected to beapplication/pdf
).driver
: two drivers are currently implementedpymupdf
andpdfact
. It defines the extraction backend to use.
warning The processing is performed synchronously
The response is a JSON with the extracted text organized into typed nodes, making it easy to navigate and understand the different components of a document. In particular, the structure is as follows:
category
: A string specifying the node category, which isdoc
content
: A list ofpage
nodes representing the pages within the document.
Each page node contains the following information:
category
: A string specifying the node category, which ispage
.attributes
: A list containing attributes of the page. Currently, it includes onlypage
, the number of the node page.content
: A list of chunk each representing a segment of text extracted from the page.
In particular, each content
contains the following information:
role
: The role of the chunk in the document (e.g., heading, body, etc.)text
: The text extracted from the chunk.marks
: A list of marks that characterize the text extracted from the chunk.attributes
: A list containing attributes of the chunk, currently including:- A list of
bounding_box
attributes that contain the text. Each bounding box is identified by 4 coordinated:min_x
,min_y
,max_x
,max_y
andpage
, which is the page number where the bounding box is located.
- A list of
The marks
of the chunks contains:
category
: the type of the mark, which can be:bold
,italic
,textStyle
,link
If the mark type is textStyle
, it includes additional attributes:
font
: An object representing the font of the text chunk. Each font is represented byname
,id
, andsize
. Available only usingpdfact
driver.color
: Which is the color of the text chunk. Each color is represented byr
,g
,b
andid
. Available only usingpdfact
driver.
if the mark category is link
, it provides the url
of the link.
The service can return the following errors
code | message | description |
---|---|---|
422 |
No url found in request | In case the url field in the request is missing |
422 |
No mime_type found in request | In case the mime_type field in the request is missing |
422 |
Unsupported file type | In case the file is not a PDF |
500 |
Error while saving file | In case it was not possible to download the file from the specified URL |
500 |
Error while parsing file | In case it was not possible to open the file after download |
The body of the response can contain a JSON with the following fields:
code
the error codemessage
the error descriptiontype
the type of the error
{
"code": 500,
"message": "Error while parsing file",
"type": "Internal Server Error",
}
The PDF text extract service is built using FastAPI and Python 3.9.
Given the selected stack the development requires:
- Python 3.9 with PIP
- Docker (optional) to test the build
Install all the required dependencies:
pip install -r requirements.txt
Run the local development application using:
fastapi dev text_extractor_api/main.py
to be documented
Thank you for considering contributing to the PDF text extract service! The contribution guide can be found in the CONTRIBUTING.md file.
The project is provided and supported by OneOff-Tech (UG).
If you discover a security vulnerability within PDF Text Extract, please send an e-mail to OneOff-Tech team via security@oneofftech.xyz. All security vulnerabilities will be promptly addressed.