This project is designed to extract manual survey responses from scanned PDF documents. It uses image processing techniques to identify and grade multiple-choice answers from questionnaires.
The system processes scanned PDF surveys through the following steps:
- PDF to image conversion
- Image cropping and alignment
- Detection of answer grids
- Response extraction
- Grading and result compilation
This automated approach allows for efficient processing of large volumes of paper-based surveys, converting them into digital data for analysis.
-
Clone the repository:
git clone https://github.com/Hadrien-Cornier/ocr-pdf.git cd ocr-pdf
-
Create and activate a virtual environment:
For macOS and Linux:
python -m venv .venv source .venv/bin/activate
For Windows:
python -m venv .venv .venv\Scripts\activate
-
Install the requirements:
pip install -r requirements.txt
-
Place your PDF files in the
data/input/
directory. -
Run the pipeline:
python src/run_pipeline.py
This will execute the following steps:
- Crop the PDFs (output in
data/cropped/
) - Align the images (output in
data/aligned/
) - Perform OCR and grade extraction (output in
data/output/
)
- Crop the PDFs (output in
-
Check the results in the
data/output/
directory. -
Debug images for each step can be found in the respective subdirectories of
data/debug/
.
Adjust the settings in config/config.ini
to customize the pipeline behavior.
See requirements.txt
for the list of Python packages required.