Library/olmocr
Library/olmocrForked

allenai/olmocr

olmocr

Toolkit for linearizing PDFs for LLM datasets/training

Builder

Allen AI

Allen AI

allenai • ai-lab

Stars

17,101

Using upstream star count

Forks

1,371

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Sep 17, 2024

Project creation date

README Summary

OlmoCR is a Python toolkit designed to convert PDF documents into linearized text formats suitable for large language model training and dataset creation. The tool focuses on extracting and structuring content from PDFs in a way that preserves semantic meaning while making it consumable by LLMs.

AI Dev Skills

Unmapped

Document Layout AnalysisOptical Character RecognitionText Preprocessing for LLMsPDF Processing and ExtractionData Pipeline EngineeringLanguage Model Training Data Preparation

Tags

Document Layout AnalysisOptical Character RecognitionText Preprocessing for LLMsPDF Processing and ExtractionData Pipeline EngineeringLanguage Model Training Data PreparationSelf-hostedDocument ManagementTextPublishingLarge Language Model TrainingFinancial ServicesAcademic Paper ProcessingDocument AILLM Training Data PreparationImageOn-premiseLegal TechAcademic ResearchMultimodalPDF Text Extraction for AI ModelsLegal Document DigitizationMultimodal LearningDocument Dataset CreationCloud APIPython

Taxonomy

Recent Activity

Updated 1 months ago

7 Days

0

30 Days

0

90 Days

0

Quality

research
Quality
medium
Maturity
research

Categories

MLOps & InfrastructurePrimaryLearning ResourcesML Platform & InfrastructureFinance & LegalMultimodal AISearch & KnowledgeOther AI / MLModel TrainingFoundation Models

PM Skills

Scale & ReliabilityDeveloper Platform

Languages

Python100.0%

Timeline

Project created
Sep 17, 2024
Forked
Mar 16, 2026
Your last push
1 months ago
Upstream last push
19 days ago
Tracked since
Mar 14, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…