PDF Text Extraction Service

A FastAPI application to extract text from pdf documents.

Getting started

The PDF Text Extraction service is available as a Docker image.

docker pull ghcr.io/data-house/pdf-text-extractor:main

A sample docker-compose.yaml file is available within the repository.

Please refer to Releases and Packages for the available tags.

Usage

The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response.

The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.

Text extraction endpoint

The service expose only one endpoint /extract-text that accepts a POST request with the following input as a json body:

url: the URL of the PDF file to process.
mime_type: the mime type of the file (it is expected to be application/pdf).
driver: two drivers are currently implemented pymupdf and pdfact. It defines the extraction backend to use.

warning The processing is performed synchronously

The response is a JSON with the extracted text organized into typed nodes, making it easy to navigate and understand the different components of a document. In particular, the structure is as follows:

category: A string specifying the node category, which is doc
content: A list of page nodes representing the pages within the document.

Each page node contains the following information:

category: A string specifying the node category, which is page.
attributes: A list containing attributes of the page. Currently, it includes only page, the number of the node page.
content: A list of chunk each representing a segment of text extracted from the page.

In particular, each content contains the following information:

role: The role of the chunk in the document (e.g., heading, body, etc.)
text: The text extracted from the chunk.
marks: A list of marks that characterize the text extracted from the chunk.
attributes: A list containing attributes of the chunk, currently including:
- A list of bounding_box attributes that contain the text. Each bounding box is identified by 4 coordinated: min_x,min_y, max_x, max_y and page, which is the page number where the bounding box is located.

The marks of the chunks contains:

category: the type of the mark, which can be: bold, italic, textStyle, link

If the mark type is textStyle, it includes additional attributes:

font: An object representing the font of the text chunk. Each font is represented by name, id, and size. Available only using pdfact driver.
color: Which is the color of the text chunk. Each color is represented by r, g, b and id. Available only using pdfact driver.

if the mark category is link, it provides the url of the link.

Error handling

The service can return the following errors

code	message	description
`422`	No url found in request	In case the `url` field in the request is missing
`422`	No mime_type found in request	In case the `mime_type` field in the request is missing
`422`	Unsupported file type	In case the file is not a PDF
`500`	Error while saving file	In case it was not possible to download the file from the specified URL
`500`	Error while parsing file	In case it was not possible to open the file after download

The body of the response can contain a JSON with the following fields:

code the error code
message the error description
type the type of the error

{
  "code": 500,
  "message": "Error while parsing file",
  "type": "Internal Server Error",
}

Development

The PDF text extract service is built using FastAPI and Python 3.9.

Given the selected stack the development requires:

Python 3.9 with PIP
Docker (optional) to test the build

Install all the required dependencies:

pip install -r requirements.txt

Run the local development application using:

fastapi dev text_extractor_api/main.py

Testing

to be documented

Contributing

Thank you for considering contributing to the PDF text extract service! The contribution guide can be found in the CONTRIBUTING.md file.

Supporters

The project is provided and supported by OneOff-Tech (UG).

Security Vulnerabilities

If you discover a security vulnerability within PDF Text Extract, please send an e-mail to OneOff-Tech team via security@oneofftech.xyz. All security vulnerabilities will be promptly addressed.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
text_extractor		text_extractor
text_extractor_api		text_extractor_api
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
root.py		root.py
uvicorn.sh		uvicorn.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Text Extraction Service

Getting started

Usage

Text extraction endpoint

Error handling

Development

Testing

Contributing

Supporters

Security Vulnerabilities

About

Releases 7

Packages

Contributors 3

Languages

data-house/pdf-text-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Text Extraction Service

Getting started

Usage

Text extraction endpoint

Error handling

Development

Testing

Contributing

Supporters

Security Vulnerabilities

About

Resources

Security policy

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 3

Languages

Packages