allenai/olmocr
olmocr
Toolkit for linearizing PDFs for LLM datasets/training
Builder

Allen AI
allenai • ai-lab
Stars
17,101
Using upstream star count
Forks
1,371
Using upstream fork count
Open Issues
0
Activity Score
0/100
0 commits in 30d
Created
Sep 17, 2024
Project creation date
README Summary
OlmoCR is a Python toolkit designed to convert PDF documents into linearized text formats suitable for large language model training and dataset creation. The tool focuses on extracting and structuring content from PDFs in a way that preserves semantic meaning while making it consumable by LLMs.
AI Dev Skills
Unmapped
Document Layout AnalysisOptical Character RecognitionText Preprocessing for LLMsPDF Processing and ExtractionData Pipeline EngineeringLanguage Model Training Data Preparation
Tags
Document Layout AnalysisOptical Character RecognitionText Preprocessing for LLMsPDF Processing and ExtractionData Pipeline EngineeringLanguage Model Training Data PreparationSelf-hostedDocument ManagementTextPublishingLarge Language Model TrainingFinancial ServicesAcademic Paper ProcessingDocument AILLM Training Data PreparationImageOn-premiseLegal TechAcademic ResearchMultimodalPDF Text Extraction for AI ModelsLegal Document DigitizationMultimodal LearningDocument Dataset CreationCloud APIPython
Taxonomy
Deployment Context
Modalities
Skill Areas
Recent Activity
Updated 1 months ago
7 Days
0
30 Days
0
90 Days
0
Quality
research- Quality
- medium
- Maturity
- research
Categories
MLOps & InfrastructurePrimaryLearning ResourcesML Platform & InfrastructureFinance & LegalMultimodal AISearch & KnowledgeOther AI / MLModel TrainingFoundation Models
PM Skills
Scale & ReliabilityDeveloper Platform
Languages
Python100.0%
Timeline
- Project created
- Sep 17, 2024
- Forked
- Mar 16, 2026
- Your last push
- 1 months ago
- Upstream last push
- 19 days ago
- Tracked since
- Mar 14, 2026
Similar Repos
pgvector cosine similarity · $0
Loading…