Library/evalsForked

openai/evals

evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Builder

OpenAI

OpenAI

openai • ai-lab

Stars

18,100

Using upstream star count

Forks

2,909

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Jan 23, 2023

Project creation date

README Summary

Evals is OpenAI's framework for evaluating Large Language Models (LLMs) and LLM systems through standardized benchmarks and metrics. It provides an open-source registry of evaluation benchmarks that can be used to assess model performance across various tasks. The framework enables researchers and developers to create, run, and share evaluations for language models in a consistent and reproducible manner.

AI Dev Skills

Unmapped

Large Language Model EvaluationBenchmark Design and ImplementationModel Performance AssessmentLLM System TestingEvaluation Metrics and ScoringPrompt EngineeringModel Comparison and AnalysisAI Safety EvaluationCapability Assessment

Tags

Large Language Model EvaluationBenchmark Design and ImplementationModel Performance AssessmentLLM System TestingEvaluation Metrics and ScoringPrompt EngineeringModel Comparison and AnalysisAI Safety EvaluationCapability AssessmentTextLLM Performance BenchmarkingAI SafetyStatistical Analysis of AI SystemsModel Comparison MethodologiesModel Capability AssessmentEvaluation Framework DesignCapability Gap AnalysisEvaluation Dataset CreationCloud APIModel Selection and RankingResponsible AIAI BenchmarkingResearch Performance ComparisonBenchmark DevelopmentModel GovernanceSelf-hostedAI System ValidationLLM EvaluationPython

Taxonomy

Recent Activity

Updated 5 months ago

7 Days

0

30 Days

0

90 Days

0

Quality

production
Quality
high
Maturity
production

Categories

Dev Tools & AutomationPrimaryLearning ResourcesEvals & BenchmarkingSafety & AlignmentData Science & AnalyticsSearch & KnowledgeOther AI / MLFoundation ModelsAI AgentsModel Training

PM Skills

Developer Platform

Languages

Python100.0%

Timeline

Project created
Jan 23, 2023
Forked
Mar 12, 2026
Your last push
5 months ago
Upstream last push
7 days ago
Tracked since
Nov 3, 2025

Similar Repos

pgvector cosine similarity · $0

Loading…