Library/helmForked

stanford-crfm/helm

helm

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.

Builder

Stanford

Stanford

stanford-crfm • research

Stars

2,735

Using upstream star count

Forks

369

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Nov 29, 2021

Project creation date

README Summary

HELM (Holistic Evaluation of Language Models) is Stanford CRFM's comprehensive Python framework for evaluating foundation models including LLMs and multimodal models. The framework emphasizes holistic assessment across multiple dimensions, reproducibility through standardized benchmarks, and transparency in evaluation methodologies. It provides researchers and practitioners with systematic tools to assess model performance, capabilities, and limitations across diverse tasks and scenarios.

AI Dev Skills

Unmapped

Foundation Model EvaluationLarge Language Model AssessmentMultimodal Model BenchmarkingModel Performance AnalysisEvaluation Methodology DesignStatistical Model ComparisonPrompt Engineering EvaluationModel Bias AssessmentRobustness TestingBenchmark Suite Development

Tags

Foundation Model EvaluationLarge Language Model AssessmentMultimodal Model BenchmarkingModel Performance AnalysisEvaluation Methodology DesignStatistical Model ComparisonPrompt Engineering EvaluationModel Bias AssessmentRobustness TestingBenchmark Suite DevelopmentModel Development Progress TrackingMultimodal AIFoundation ModelsFoundation Model Performance EvaluationResponsible AITextResearch EnvironmentLLM Capability AssessmentBias and Fairness AnalysisLarge Language ModelsMultimodal Model AssessmentModel EvaluationModel Comparison and SelectionAI SafetyCloudMultimodalAcademic Research BenchmarkingSelf-hostedPython

Taxonomy

Recent Activity

Updated 24 days ago

7 Days

0

30 Days

0

90 Days

0

Quality

research
Quality
high
Maturity
research

Categories

Foundation ModelsPrimaryData Science & AnalyticsAI AgentsSafety & AlignmentEvals & BenchmarkingLearning ResourcesMultimodal AISearch & KnowledgeOther AI / ML

PM Skills

Product Discovery

Languages

Python100.0%

Timeline

Project created
Nov 29, 2021
Forked
Mar 22, 2026
Your last push
24 days ago
Upstream last push
7 days ago
Tracked since
Mar 20, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…