Skip to content

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

License

Notifications You must be signed in to change notification settings

shoryasethia/markdrop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Markdrop Logo

Markdrop

Downloads PyPI Version License Stars Issues Forks Markdrop - PDF to markdown | Tables to Excel | Table/Images Description | Product Hunt

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Features

  • PDF to Markdown conversion with formatting preservation using Docling
  • Automatic image extraction with quality preservation using XRef Id
  • Table detection using Microsoft's Table Transformer
  • PDF URL support for core functionalities
  • AI-powered image and table descriptions using multiple LLM providers
  • Interactive HTML output with downloadable Excel tables
  • Customizable image resolution and UI elements
  • Comprehensive logging system
  • Support for other files
  • Streamlit/web interface

Installation

pip install markdrop  

Python Package Index (PyPI) Page: https://pypi.org/project/markdrop

Quick Start

Open in Colab Watch the demo

Basic PDF Processing

from markdrop import extract_images, make_markdown, extract_tables_from_pdf

source_pdf = 'url/or/path/to/pdf/file'    # Replace with your local PDF file path or a URL
output_dir = 'data/output'                 # Replace with desired output directory's path

make_markdown(source_pdf, output_dir)
extract_images(source_pdf, output_dir)
extract_tables_from_pdf(source_pdf, output_dir=output_dir)

Advanced PDF Processing with MarkDrop

from markdrop import markdrop, MarkDropConfig, add_downloadable_tables
from pathlib import Path
import logging

# Configure processing options
config = MarkDropConfig(
    image_resolution_scale=2.0,        # Scale factor for image resolution
    download_button_color='#444444',   # Color for download buttons in HTML
    log_level=logging.INFO,           # Logging detail level
    log_dir='logs',                   # Directory for log files
    excel_dir='markdropped-excel-tables'  # Directory for Excel table exports
)

# Process PDF document
input_doc_path = "path/to/input.pdf"
output_dir = Path('output_directory')

# Convert PDF and generate HTML with images and tables
html_path = markdrop(input_doc_path, output_dir, config)

# Add interactive table download functionality
downloadable_html = add_downloadable_tables(html_path, config)

AI-Powered Content Analysis

from markdrop import setup_keys, process_markdown, ProcessorConfig, AIProvider, logger
from pathlib import Path

# Set up API keys for AI providers
setup_apikeys(key='gemini')  # or setup_keys(key='openai')

# Configure AI processing options
config = ProcessorConfig(
    input_path="path/to/markdown/file.md",    # Input markdown file path
    output_dir=Path("output_directory"),      # Output directory
    ai_provider=AIProvider.GEMINI,            # AI provider (GEMINI or OPENAI)
    remove_images=False,                      # Keep or remove original images
    remove_tables=False,                      # Keep or remove original tables
    table_descriptions=True,                  # Generate table descriptions
    image_descriptions=True,                  # Generate image descriptions
    max_retries=3,                           # Number of API call retries
    retry_delay=2,                           # Delay between retries in seconds
    gemini_model_name="gemini-1.5-flash",    # Gemini model for images
    gemini_text_model_name="gemini-pro",     # Gemini model for text
    image_prompt=DEFAULT_IMAGE_PROMPT,        # Custom prompt for image analysis
    table_prompt=DEFAULT_TABLE_PROMPT         # Custom prompt for table analysis
)

# Process markdown with AI descriptions
output_path = process_markdown(config)

Image Description Generation

from markdrop import generate_descriptions

prompt = "Give textual highly detailed descriptions from this image ONLY, nothing else."
input_path = 'path/to/img_file/or/dir'
output_dir = 'data/output'
llm_clients = ['gemini', 'llama-vision']  # Available: ['qwen', 'gemini', 'openai', 'llama-vision', 'molmo', 'pixtral']

generate_descriptions(
    input_path=input_path,
    output_dir=output_dir,
    prompt=prompt,
    llm_client=llm_clients
)

API Reference

Core Functions

markdrop(input_doc_path: str, output_dir: str, config: Optional[MarkDropConfig] = None) -> Path

Converts PDF to markdown and HTML with enhanced features.

Parameters:

  • input_doc_path (str): Path to input PDF file
  • output_dir (str): Output directory path
  • config (MarkDropConfig, optional): Configuration options for processing

add_downloadable_tables(html_path: Path, config: Optional[MarkDropConfig] = None) -> Path

Adds interactive table download functionality to HTML output.

Parameters:

  • html_path (Path): Path to HTML file
  • config (MarkDropConfig, optional): Configuration options

Configuration Classes

MarkDropConfig

Configuration for PDF processing:

  • image_resolution_scale (float): Scale factor for image resolution (default: 2.0)
  • download_button_color (str): HTML color code for download buttons (default: '#444444')
  • log_level (int): Logging level (default: logging.INFO)
  • log_dir (str): Directory for log files (default: 'logs')
  • excel_dir (str): Directory for Excel table exports (default: 'markdropped-excel-tables')

ProcessorConfig

Configuration for AI processing:

  • input_path (str): Path to markdown file
  • output_dir (str): Output directory path
  • ai_provider (AIProvider): AI provider selection (GEMINI or OPENAI)
  • remove_images (bool): Whether to remove original images
  • remove_tables (bool): Whether to remove original tables
  • table_descriptions (bool): Generate table descriptions
  • image_descriptions (bool): Generate image descriptions
  • max_retries (int): Maximum API call retries
  • retry_delay (int): Delay between retries in seconds
  • gemini_model_name (str): Gemini model for image processing
  • gemini_text_model_name (str): Gemini model for text processing
  • image_prompt (str): Custom prompt for image analysis
  • table_prompt (str): Custom prompt for table analysis

Legacy Functions

make_markdown(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic PDF to markdown conversion.

Parameters:

  • source (str): Path to input PDF or URL
  • output_dir (str): Output directory path
  • verbose (bool): Enable detailed logging

extract_images(source: str, output_dir: str, verbose: bool = False)

Legacy function for basic image extraction.

Parameters:

  • source (str): Path to input PDF or URL
  • output_dir (str): Output directory path
  • verbose (bool): Enable detailed logging

extract_tables_from_pdf(pdf_path: str, **kwargs)

Legacy function for basic table extraction.

Parameters:

  • pdf_path (str): Path to input PDF or URL
  • start_page (int, optional): Starting page number
  • end_page (int, optional): Ending page number
  • threshold (float, optional): Detection confidence threshold
  • output_dir (str): Output directory path

Quick Start for Legacy Functions

Check an example in run.py

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

  1. Clone the repository:
git clone https://github.com/shoryasethia/markdrop.git  
cd markdrop  
  1. Create a virtual environment:
python -m venv venv  
source venv/bin/activate  # On Windows: venv\Scripts\activate  
  1. Install development dependencies:
pip install -r requirements.txt  

Project Structure

markdrop/  
├── LICENSE  
├── README.md  
├── CONTRIBUTING.md  
├── CHANGELOG.md  
├── requirements.txt  
├── setup.py  
└── markdrop/ 
    ├── __init__.py 
    ├── src
    |    └── markdrop-logo.png
    ├── main.py
    ├── process.py
    ├── api_setup.py
    ├── parse.py
    ├── utils.py  
    ├── helper.py
    ├── ignore_warnings.py
    ├── run.py
    └── models/
        ├── __init__.py
        ├── .env
        ├── img_descriptions.py
        ├── logger.py
        ├── model_loader.py
        ├── responder.py
        └── setup_keys.py  

Star History

Star History Chart

License

This project is licensed under the MIT License - see the LICENSE file for details.

Changelog

See CHANGELOG.md for version history.

Code of Conduct

Please note that this project follows our Code of Conduct.

Support

About

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages