Library/AgentBench
Library/AgentBenchForked

THUDM/AgentBench

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Builder

THUDM

THUDM

THUDM • individual

Stars

3,295

Using upstream star count

Forks

242

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Jul 28, 2023

Project creation date

README Summary

AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents across multiple environments and tasks. It provides a systematic framework for assessing how well LLMs can perform autonomous decision-making and interact with various environments. The benchmark was accepted at ICLR 2024 and offers standardized evaluation protocols for agent capabilities.

AI Dev Skills

Unmapped

Large Language Model EvaluationAgentic AI SystemsMulti-Environment Agent TestingLLM Benchmarking MethodologiesAgent Performance MetricsAutonomous Agent DevelopmentLLM Reasoning AssessmentMulti-Task Agent Evaluation

Tags

Large Language Model EvaluationAgentic AI SystemsMulti-Environment Agent TestingLLM Benchmarking MethodologiesAgent Performance MetricsAutonomous Agent DevelopmentLLM Reasoning AssessmentMulti-Task Agent EvaluationMulti-Environment TestingResearch EnvironmentAI Agent Performance AnalysisAutonomous AgentsAcademic ResearchAgentic AIAgent Capability AssessmentAI ResearchAutonomous Agent SystemsLarge Language Model AssessmentModel Selection for Agent TasksTextBenchmark DesignAI BenchmarkingAI DevelopmentLLM Agent BenchmarkingLLM EvaluationAgent Evaluation MetricsSelf-hostedAgent Performance ComparisonLLM Agent DevelopmentResearch Publication ValidationPython

Taxonomy

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

0

Quality

research
Quality
high
Maturity
research

Categories

Dev Tools & AutomationPrimaryLearning ResourcesEvals & BenchmarkingSearch & KnowledgeOther AI / MLFoundation ModelsAI Agents

PM Skills

Developer Platform

Languages

Python100.0%

Timeline

Project created
Jul 28, 2023
Forked
Mar 22, 2026
Your last push
2 months ago
Upstream last push
2 months ago
Tracked since
Feb 8, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…