Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/AgentBench
Library/AgentBenchForked

THUDM/AgentBench

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

View on GitHub↗Upstream THUDM/AgentBench↗

Builder

THUDM

THUDM

THUDM β€’ individual

Stars

3,458

Using upstream star count

Forks

257

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Jul 28, 2023

Project creation date

README Summary

<p align="center"> <a href="https://docs.google.com/spreadsheets/d/e/2PACX-1vRR3Wl7wsCgHpwUw1_eUXW_fptAPLL3FkhnW_rua0O1Ji_GIVrpTjY5LaKAhwO-WeARjnY_KNw0SYNJ/pubhtml" target="_blank">🌐 Leaderboard (new)</a> | <a href="https://twitter.com/thukeg" target="_blank">🐦 Twitter</a> | <a href="mailto:agentbench@googlegroups.com">βœ‰οΈ Google Group</a> | <a href="https://arxiv.org/abs/2308.03688" target="_blank">πŸ“ƒ Paper </a> </p>

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Agentic AI SystemsAgent Performance MetricsAutonomous Agent DevelopmentLarge Language Model EvaluationLLM Benchmarking MethodologiesLLM Reasoning AssessmentMulti-Environment Agent TestingMulti-Task Agent Evaluation

Tags

Agentic AI SystemsAgent Performance MetricsAutonomous Agent DevelopmentLarge Language Model EvaluationLLM Benchmarking MethodologiesLLM Reasoning AssessmentMulti-Environment Agent TestingMulti-Task Agent EvaluationAI AgentsBenchmarkingCachingDatabaseDockerEvalsForkedKnowledge GraphLarge Language ModelsMobileMulti-AgentNumPyOpenAIPythonResearch / PapersRobot LearningStatisticsTool UseTutorial

Taxonomy

AI Trends

Agentic AILLM EvaluationAgent BenchmarkingAutonomous AI Systems

category

AI AgentsFoundation ModelsRAG & RetrievalEvals & BenchmarkingInference & ServingRoboticsMLOps & InfrastructureDev Tools & AutomationLearning ResourcesData Science & Analytics

Deployment Context

Self-hostedCloud APIResearch Environment

Industries

AI ResearchAcademic ResearchMachine Learning Operations

Modalities

Text

Skill Areas

Large Language Model EvaluationAgentic AI SystemsMulti-Environment Agent TestingLLM Benchmarking MethodologiesAgent Performance MetricsAutonomous Agent DevelopmentLLM Reasoning AssessmentMulti-Task Agent Evaluation

tag

AI AgentsBenchmarkingCachingDatabaseDockerEvalsForkedKnowledge GraphLarge Language ModelsMobileMulti-AgentNumPyOpenAIPythonResearch / PapersRobot LearningStatisticsTool UseTutorial

Use Cases

LLM Agent Performance BenchmarkingComparative Agent Model EvaluationResearch Paper ExperimentationAgent System Development TestingLLM Capability Assessment

Recent Activity

Updated 3 months ago

7 Days

0

30 Days

0

90 Days

0

Merge pull request #213 from mkimhi/agentbench-lite-suite

Shaw β€’ Feb 8, 2026

d1e4a10

Docs: clarify Python 3.9 recommended for dependency install

Moshe Kimhi β€’ Feb 8, 2026

a3cc91a

Add CI smoke test for lite preset YAML configs

Moshe Kimhi β€’ Feb 8, 2026

d3571d7

Quality

research
Quality
high
Maturity
research

Categories

Foundation ModelsPrimaryAI AgentsRAG & RetrievalEvals & BenchmarkingInference & ServingRoboticsData Science & AnalyticsEdge & Mobile AISearch & KnowledgeOther AI / MLMLOps & InfrastructureDev Tools & AutomationLearning Resources

PM Skills

Cost & EfficiencyScale & ReliabilityData & EvaluationProduct DiscoveryDeveloper PlatformAI-Native Architecture

Languages

Python100.0%

Timeline

Project created
Jul 28, 2023
Forked
Mar 22, 2026
Your last push
3 months ago
Upstream last push
3 months ago
Tracked since
Feb 8, 2026

Similar Repos

pgvector cosine similarity Β· $0

Loading…