Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/olmocr
Library/olmocrForked

allenai/olmocr

olmocr

Toolkit for linearizing PDFs for LLM datasets/training

View on GitHub↗Upstream allenai/olmocr↗

Builder

Allen AI

Allen AI

allenai • ai-lab

Stars

17,360

Using upstream star count

Forks

1,392

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Sep 17, 2024

Project creation date

README Summary

<div align="center"> <img width="350" alt="olmocr-2-full@2x" src="https://github.com/user-attachments/assets/24f1b596-4059-46f1-8130-5d72dcc0b02e" /> <hr/> </div> <p align="center"> <a href="https://github.com/allenai/OLMo/blob/main/LICENSE"> <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo"> </a> <a href="https://github.com/allenai/olmocr/releases"> <img alt="GitHub release" src="https://img.shields.io/github/release/allenai/olmocr.svg"> </a>

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Data Pipeline EngineeringDocument Layout AnalysisLanguage Model Training Data PreparationOptical Character RecognitionPDF Processing and ExtractionText Preprocessing for LLMs

Tags

Data Pipeline EngineeringDocument Layout AnalysisLanguage Model Training Data PreparationOptical Character RecognitionPDF Processing and ExtractionText Preprocessing for LLMsAWSBenchmarkingDeepSeekDockerEvalsFine-TuningForkedGPU / CUDAGRPOHuggingFaceInferenceKV CacheLLM ServingMistralMultimodal AIOpenAIPyTorchPythonQwenResearch / PapersSGLangStatisticsSynthetic DatavLLM

Taxonomy

AI Trends

Large Language Model TrainingDocument AIMultimodal Learning

category

Inference & ServingFoundation ModelsModel TrainingEvals & BenchmarkingMLOps & InfrastructureCloud & PlatformsLearning ResourcesData Science & Analytics

Deployment Context

Self-hostedCloud APIOn-premise

Industries

Legal TechAcademic ResearchDocument ManagementPublishingFinancial Services

Modalities

TextImageMultimodal

Skill Areas

Document Layout AnalysisOptical Character RecognitionText Preprocessing for LLMsPDF Processing and ExtractionData Pipeline EngineeringLanguage Model Training Data Preparation

tag

AWSBenchmarkingDeepSeekDockerEvalsFine-TuningForkedGPU / CUDAGRPOHuggingFaceInferenceKV CacheLLM ServingMistralMultimodal AIOpenAIPyTorchPythonQwenResearch / PapersSGLangStatisticsSynthetic DatavLLM

Use Cases

LLM Training Data PreparationDocument Dataset CreationPDF Text Extraction for AI ModelsAcademic Paper ProcessingLegal Document Digitization

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

11

Bump version to v0.4.27 for release

Jake Poznanski • Mar 12, 2026

1e139a5

Version bump

Jake Poznanski • Mar 12, 2026

3c0ff52

Formatting fixes

Jake Poznanski • Mar 12, 2026

19a1b90

Quality

research
Quality
medium
Maturity
research

Categories

Inference & ServingPrimaryEvals & BenchmarkingMLOps & InfrastructureCloud & PlatformsLearning ResourcesData Science & AnalyticsFoundation ModelsModel TrainingMultimodal AISearch & KnowledgeOther AI / ML

PM Skills

Cost & EfficiencyUser ExperienceScale & ReliabilityData & Evaluation

Languages

Python100.0%

Timeline

Project created
Sep 17, 2024
Forked
Mar 16, 2026
Your last push
2 months ago
Upstream last push
2 months ago
Tracked since
Mar 14, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…