Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/evals
Library/evalsForked

openai/evals

evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

View on GitHub↗Upstream openai/evals↗

Builder

OpenAI

OpenAI

openai • ai-lab

Stars

18,561

Using upstream star count

Forks

2,971

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Jan 23, 2023

Project creation date

README Summary

> You can now configure and run Evals directly in the OpenAI Dashboard. [Get started →](https://platform.openai.com/docs/guides/evals)

Community Evaluation

Loading…

AI Dev Skills

Unmapped

AI Safety EvaluationBenchmark Design and ImplementationCapability AssessmentEvaluation Metrics and ScoringLarge Language Model EvaluationLLM System TestingModel Comparison and AnalysisModel Performance AssessmentPrompt Engineering

Tags

AI Safety EvaluationBenchmark Design and ImplementationCapability AssessmentEvaluation Metrics and ScoringLarge Language Model EvaluationLLM System TestingModel Comparison and AnalysisModel Performance AssessmentPrompt EngineeringData ScienceDatabaseEvalsForkedOpenAIPythonTutorialWeights & Biases

Taxonomy

AI Trends

AI SafetyLarge Language ModelsModel Evaluation and TestingResponsible AI DevelopmentAI Benchmarking

category

Foundation ModelsEvals & BenchmarkingObservability & MonitoringDev Tools & AutomationLearning ResourcesData Science & Analytics

Deployment Context

Cloud APISelf-hostedResearch Environment

Industries

Developer ToolsResearch and AcademiaAI/ML Platform Services

Modalities

TextCodeMultimodal

Skill Areas

Large Language Model EvaluationBenchmark Design and ImplementationModel Performance AssessmentLLM System TestingEvaluation Metrics and ScoringPrompt EngineeringModel Comparison and AnalysisAI Safety EvaluationCapability Assessment

tag

Data ScienceDatabaseEvalsForkedOpenAIPythonTutorialWeights & Biases

Use Cases

LLM Performance BenchmarkingModel Selection and ComparisonAI System Quality AssuranceResearch Evaluation StandardsCustom Evaluation Protocol DevelopmentModel Capability AssessmentAI Safety Testing

Recent Activity

Updated 7 months ago

7 Days

0

30 Days

0

90 Days

0

Quality

production
Quality
high
Maturity
production

Categories

Evals & BenchmarkingPrimaryObservability & MonitoringDev Tools & AutomationLearning ResourcesData Science & AnalyticsFoundation ModelsSafety & AlignmentOther AI / ML

PM Skills

Data & Evaluation

Languages

Python100.0%

Timeline

Project created
Jan 23, 2023
Forked
Mar 12, 2026
Your last push
7 months ago
Upstream last push
1 months ago
Tracked since
Nov 3, 2025

Similar Repos

pgvector cosine similarity · $0

Loading…