Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/lm-evaluation-harness
Library/lm-evaluation-harnessForked

EleutherAI/lm-evaluation-harness

lm-evaluation-harness

A framework for few-shot evaluation of language models.

View on GitHub↗Upstream EleutherAI/lm-evaluation-harness↗

Builder

EleutherAI

EleutherAI

EleutherAI • ai-lab

Stars

12,748

Using upstream star count

Forks

3,301

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Aug 28, 2020

Project creation date

README Summary

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256836.svg)](https://doi.org/10.5281/zenodo.10256836)

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Benchmark DesignFew-shot LearningLanguage Model EvaluationModel ComparisonModel Performance AssessmentNatural Language ProcessingPrompt EngineeringStandardized Testing FrameworksStatistical Analysis

Tags

Benchmark DesignFew-shot LearningLanguage Model EvaluationModel ComparisonModel Performance AssessmentNatural Language ProcessingPrompt EngineeringStandardized Testing FrameworksStatistical AnalysisAnthropic / ClaudeBackendBatchingBenchmarkingC++CLI ToolCachingClaudeData ScienceDeepSpeedDockerEvalsFSDPForkedGPTGPT4AllGPU / CUDAHuggingFaceInferenceJupyterKV CacheLLM ServingLM Eval HarnessLarge Language ModelsLoRA / PEFTMMLUMistralModel OptimizationMultimodal AIONNXOpenAIPandasPlanning / CoTPyTorchPythonQuantizationQwenResearch / PapersSGLangTransformersTutorialWeights & Biasesllama.cppvLLM

Taxonomy

AI Trends

Language Model EvaluationAI SafetyResponsible AIModel InterpretabilityBenchmark Standardization

category

Foundation ModelsAI AgentsModel TrainingEvals & BenchmarkingObservability & MonitoringInference & ServingMLOps & InfrastructureDev Tools & AutomationLearning ResourcesData Science & Analytics

Deployment Context

Self-hostedCloud APIResearch Computing

Modalities

Text

Skill Areas

Language Model EvaluationFew-shot LearningBenchmark DesignModel Performance AssessmentStatistical AnalysisNatural Language ProcessingPrompt EngineeringModel ComparisonStandardized Testing Frameworks

tag

Anthropic / ClaudeBackendBatchingBenchmarkingC++CLI ToolCachingClaudeData ScienceDeepSpeedDockerEvalsFSDPForkedGPTGPT4AllGPU / CUDAHuggingFaceInferenceJupyterKV CacheLLM ServingLM Eval HarnessLarge Language ModelsLoRA / PEFTMMLUMistralModel OptimizationMultimodal AIONNXOpenAIPandasPlanning / CoTPrompt EngineeringPyTorchPythonQuantizationQwenResearch / PapersSGLangTransformersTutorialWeights & Biasesllama.cppvLLM

Use Cases

Language Model BenchmarkingModel Performance ComparisonFew-shot Task EvaluationResearch ValidationModel SelectionCapability Assessment

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

8

Fix correctness issues in Arabic normalization and prompt loading (#3589)

Rin • Mar 16, 2026

7507703

Skip caching None responses in async generation path (#3633)

Joshua Swanson • Mar 16, 2026

d47ed3e

replace all CohereForAI with CohereLabs (#3631)

Júlia Falcão • Mar 16, 2026

6e23116

Quality

production
Quality
high
Maturity
production

Categories

Other AI / MLPrimaryFoundation ModelsAI AgentsModel TrainingEvals & BenchmarkingObservability & MonitoringInference & ServingSafety & AlignmentData Science & AnalyticsMultimodal AIEdge & Mobile AISearch & KnowledgeMLOps & InfrastructureDev Tools & AutomationLearning Resources

PM Skills

Cost & EfficiencyUser ExperienceScale & ReliabilityData & EvaluationDeveloper PlatformAI-Native Architecture

Languages

Python100.0%

Timeline

Project created
Aug 28, 2020
Forked
Mar 13, 2026
Your last push
2 months ago
Upstream last push
23 days ago
Tracked since
Mar 17, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…