document-analysis

Here are 106 public repositories matching this topic...

opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Mar 27, 2025
Python

UglyToad / PdfPig

Star

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdf csharp pdfbox netstandard pdf-files pdf-document pdf-generation hocr document-analysis pdf-extractor alto-xml page-xml layout-analysis pdf-document-processor

Updated Mar 26, 2025
C#

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Updated Dec 27, 2024
C++

tstanislawek / awesome-document-understanding

Star

A curated list of resources for Document Understanding (DU) topic

Updated Jun 2, 2023

DocumindHQ / documind

Star

Open-source platform for extracting structured data from documents using AI.

open-source pdf parser ocr ai pdf-converter developer-tools extract-data document-analysis pdf-extractor document-extraction llms pdf-extractor-llm

Updated Feb 21, 2025
JavaScript

Yuliang-Liu / Curve-Text-Detector

Star

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning object-detection document-analysis scene-text

Updated Jul 20, 2020
Jupyter Notebook

wenwenyu / PICK-pytorch

Star

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis graph-convolutional-network graph-learning graph-neural-networks document-understanding key-information-extraction

Updated Jul 25, 2024
Python

jpWang / LiLT

Star

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp information-extraction document-analysis document-understanding multilingual-models document-ai multimodal-pre-trained-model

Updated Oct 31, 2022
Python

CybercentreCanada / assemblyline

Star

AssemblyLine 4: File triage and malware analysis

framework incident-response malware python3 cybersecurity cert infosec malware-analyzer malware-analysis malware-research automation-framework cyber-security file-analysis document-analysis security-automation security-tools malware-detection assemblyline security-automation-framework

Updated Mar 28, 2025
Python

lazyFrogLOL / llmdocparser

Star

A package for parsing PDFs and analyzing their content using LLMs.

nlp ocr chunking document-analysis pdf-parser pdfparser rag llm text-chunking

Updated Aug 6, 2024
Python

pandora-analysis / pandora

Star

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

infosec document-analysis malware-detection document-analyzing

Updated Mar 27, 2025
Python

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

Updated Feb 14, 2025
Python

masyagin1998 / robin

Star

RObust document image BINarization

python opencv ocr computer-vision deep-learning keras neural-networks document-analysis u-net document-binarization

Updated Aug 2, 2024
Python

chriswolfvision / local_adaptive_binarization

Star

Local adaptive image binarization

computer-vision document-analysis document-binarization

Updated Mar 5, 2023
C++

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Star

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

natural-language-processing artificial-intelligence question-answering chat-application document-analysis streamlit gpt-3 large-language-models generative-ai langchain openai-chatgpt retrieval-augmented-generation

Updated Jul 4, 2024
Python

anisha2102 / docvqa

Star

Document Visual Question Answering

computer-vision deep-learning document-analysis visual-question-answering

Updated Jul 30, 2020
Python

ppaanngggg / yolo-doclaynet

Star

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

yolo document-analysis layout-analysis ultralytics yolov8 doclaynet

Updated Mar 12, 2025
Python

aws-samples / amazon-textract-transformer-pipeline

Star

Post-process Amazon Textract results with Hugging Face transformer models for document understanding

ocr document-analysis amazon-textract huggingface-transformers

Updated Dec 14, 2024
Python

monniert / docExtractor

Star

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

pytorch segmentation historical-data document-analysis

Updated May 25, 2023
Python

Xyntopia / pydoxtools

Star

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

python nlp pdf information-retrieval extraction document-analysis document-extraction llm chatgpt

Updated Sep 5, 2024
Python

Improve this page

Add a description, image, and links to the document-analysis topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-analysis topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-analysis

Here are 106 public repositories matching this topic...

opendatalab / MinerU

UglyToad / PdfPig

AlibabaResearch / AdvancedLiterateMachinery

tstanislawek / awesome-document-understanding

DocumindHQ / documind

Yuliang-Liu / Curve-Text-Detector

wenwenyu / PICK-pytorch

jpWang / LiLT

CybercentreCanada / assemblyline

lazyFrogLOL / llmdocparser

pandora-analysis / pandora

ispras / dedoc

masyagin1998 / robin

chriswolfvision / local_adaptive_binarization

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

anisha2102 / docvqa

ppaanngggg / yolo-doclaynet

aws-samples / amazon-textract-transformer-pipeline

monniert / docExtractor

Xyntopia / pydoxtools

Improve this page

Add this topic to your repo