Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/helm
Library/helmForked

stanford-crfm/helm

helm

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.

View on GitHub↗Upstream stanford-crfm/helm↗

Builder

Stanford

Stanford

stanford-crfm • research

Stars

2,805

Using upstream star count

Forks

392

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Nov 29, 2021

Project creation date

README Summary

[comment]: <> (When using the img tag, which allows us to specify size, src has to be a URL.) <img src="https://github.com/stanford-crfm/helm/raw/v0.5.4/helm-frontend/src/assets/helm-logo.png" alt="HELM logo" width="480"/>

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Benchmark Suite DevelopmentEvaluation Methodology DesignFoundation Model EvaluationLarge Language Model AssessmentModel Bias AssessmentModel Performance AnalysisMultimodal Model BenchmarkingPrompt Engineering EvaluationRobustness TestingStatistical Model Comparison

Tags

Benchmark Suite DevelopmentEvaluation Methodology DesignFoundation Model EvaluationLarge Language Model AssessmentModel Bias AssessmentModel Performance AnalysisMultimodal Model BenchmarkingPrompt Engineering EvaluationRobustness TestingStatistical Model ComparisonAnthropic / ClaudeBenchmarkingClaudeEvalsForkedGoogle AILarge Language ModelsMachine LearningMMLUMultimodal AIOpenAIOpen SourcePythonResearch / Papers

Taxonomy

AI Trends

Foundation ModelsLarge Language ModelsMultimodal AIAI SafetyModel EvaluationResponsible AI

category

Foundation ModelsEvals & BenchmarkingCloud & PlatformsLearning Resources

Deployment Context

Self-hostedCloudResearch Environment

Modalities

TextMultimodal

Skill Areas

Foundation Model EvaluationLarge Language Model AssessmentMultimodal Model BenchmarkingModel Performance AnalysisEvaluation Methodology DesignStatistical Model ComparisonPrompt Engineering EvaluationModel Bias AssessmentRobustness TestingBenchmark Suite Development

tag

Anthropic / ClaudeBenchmarkingClaudeEvalsForkedGoogle AILarge Language ModelsMMLUMachine LearningMultimodal AIOpen SourceOpenAIPythonResearch / Papers

Use Cases

Foundation Model Performance EvaluationLLM Capability AssessmentModel Comparison and SelectionAcademic Research BenchmarkingModel Development Progress TrackingBias and Fairness AnalysisRobustness TestingMultimodal Model Assessment

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

20

Add metric for Arabic legal scenarios (#4123)

Yifan Mai • Mar 19, 2026

7c2f30b

Add Arabic legal scenario (#4122)

Yifan Mai • Mar 19, 2026

eefc96b

Bump @types/node from 25.4.0 to 25.5.0 in /helm-frontend in the npm group (#4121)

dependabot[bot] • Mar 19, 2026

579f617

Quality

research
Quality
high
Maturity
research

Categories

Evals & BenchmarkingPrimaryCloud & PlatformsLearning ResourcesFoundation ModelsMultimodal AISearch & KnowledgeOther AI / ML

PM Skills

Data & EvaluationUser Experience

Languages

Python100.0%

Timeline

Project created
Nov 29, 2021
Forked
Mar 22, 2026
Your last push
2 months ago
Upstream last push
20 days ago
Tracked since
Mar 20, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…