Updated for Kokoro v1.0!
Now setting up is easier—simply install the required Python dependencies (including the updated Kokoro package) and run the app. No more manual downloads or moving model files into specific folders.
PDF Narrator (Kokoro Edition) transforms your PDF documents into audiobooks effortlessly using advanced text extraction and Kokoro TTS technology. With Kokoro v1.0, the integration is seamless and the setup is as simple as installing the requirements and running the application.
-
Audio Sample
Listen to a short sample of the generated audiobook:
Audio Sample
-
Intelligent PDF Text Extraction
- Skips headers, footers, and page numbers.
- Optionally splits based on Table of Contents (TOC) or extracts the entire document.
-
Kokoro TTS Integration
- Generate natural-sounding audiobooks with the updated Kokoro v1.0 model.
- Easily select or swap out different
.pt
voicepacks.
-
User-Friendly GUI
- Modern interface built with ttkbootstrap (theme selector, scrolled logs, progress bars).
- Pause/resume and cancel your audiobook generation anytime.
-
Configurable for Low-VRAM Systems
- Choose the chunk size for text to accommodate limited GPU resources.
- Switch to CPU if no GPU is available.
- Python 3.8+
- FFmpeg (for audio-related tasks on some systems)
- Torch (PyTorch for the Kokoro TTS model)
- Other dependencies as listed in
requirements.txt
-
Clone the Repository
git clone https://github.com/mateogon/pdf-narrator.git cd pdf-narrator
-
Create and Activate a Virtual Environment
python -m venv venv # On Linux/macOS: source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Python Dependencies
pip install --upgrade pip pip install -r requirements.txt
-
Install Kokoro v1.0
The updated Kokoro package is now available on PyPI. Simply install it:
pip install kokoro>=1.0.0
-
Install FFmpeg (if required)
- Ubuntu/Debian:
sudo apt-get install ffmpeg
- macOS:
brew install ffmpeg
- Windows:
Download from the FFmpeg official site and follow the installation instructions.
- Ubuntu/Debian:
For Windows users, some libraries may require extra steps:
-
Python 3.12.7
Download and install Python 3.12.7. Ensurepython
andpip
are added to your system's PATH. -
CUDA 12.4 (for GPU acceleration)
Install the CUDA 12.4 Toolkit if you plan to use GPU acceleration.
eSpeak NG is required for phoneme-based operations.
-
Download the Installer
eSpeak NG X64 Installer -
Run the Installer
Follow the on-screen instructions. -
Set Environment Variables
Add the following environment variables:PHONEMIZER_ESPEAK_LIBRARY
→C:\Program Files\eSpeak NG\libespeak-ng.dll
PHONEMIZER_ESPEAK_PATH
→C:\Program Files (x86)\eSpeak\command_line\espeak.exe
(Right-click "This PC" → Properties → Advanced system settings → Environment Variables)
-
Verify Installation
Open Command Prompt and run:
espeak-ng --version
-
Download Wheels
- DeepSpeed (for Python 3.12.7, CUDA 12.4): DeepSpeed Wheel
- lxml (for Python 3.12): lxml Release
-
Install the Wheels
Activate your virtual environment and run:
pip install path\to\deepspeed-0.11.2+cuda124-cp312-cp312-win_amd64.whl pip install path\to\lxml-5.3.0-cp312-cp312-win_amd64.whl
-
Verify Installation
deepspeed --version pip show lxml espeak-ng --version
-
Launch the App
python main.py
-
Select a Mode
- Single PDF: Choose a specific PDF file and extract its text.
- Batch PDFs: Select a folder with multiple PDFs (the app processes all PDFs, preserving folder structure).
- Skip Extraction: Use pre-extracted text files organized in folders.
-
Extract Text (for Single/Batch Modes)
- The app will split the text into chapters if a Table of Contents (TOC) is available; otherwise, it extracts the entire document.
-
Configure Kokoro TTS Settings
- Select your Kokoro model (the updated package handles this automatically).
- Choose a
.pt
voicepack (e.g.,voices/af_sarah.pt
). The app automatically derives the language code from the first letter of the voice name. - Adjust the chunk size for your system’s VRAM.
- Choose your desired output format (
.wav
or.mp3
).
-
Generate Audiobook
- Click Start Process.
- Monitor progress via logs, progress bars, and estimated time.
- Pause/Resume or Cancel the process as needed.
-
Enjoy Your Audiobook
- Open the output folder to find your generated audio files.
- Built on PyMuPDF for efficient text parsing.
- Cleans headers, footers, page numbers, and unwanted elements.
- Splits text based on chapters (if TOC is available) or extracts the entire document.
- Text Normalization & Phonemization
- Advanced handling of dates, times, currency, etc.
- Token-Based Splitting
- Splits text into chunks (<510 tokens) to meet model constraints and joins chunked audio into the final output.
- Voicepacks (.pt)
- Each voicepack provides a reference embedding for a given voice.
- The app derives the language code from the first letter of the voice identifier (e.g.,
"af_sarah"
→a
).
- Chunk Size
- Adjust according to your GPU’s memory.
- Device Selection
- Switch to CPU mode if a compatible GPU is unavailable.
We welcome contributions!
- Fork the repository, create a new branch, and submit a pull request.
- Report bugs or suggest features via Issues.
This project is released under the MIT License.
Enjoy converting your PDFs into immersive audiobooks powered by Kokoro v1.0 TTS!