Reporium
GraphWikiTaxonomyStacksInsightsTrendsArchitectureAI-NativeFAQ
Ask anything about the repo library…
Loading repo…
←Library/skill
Library/skillForked

pinchbench/skill

skill

PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai

View on GitHub↗Upstream pinchbench/skill↗

Builder

pinchbench

pinchbench

pinchbench • individual

Stars

1,209

Using upstream star count

Forks

133

Using upstream fork count

Open Issues

0

Activity Score

0/100

0 commits in 30d

Created

Feb 11, 2026

Project creation date

README Summary

[![Leaderboard](https://img.shields.io/badge/leaderboard-pinchbench.com-blue)](https://pinchbench.com) [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

Community Evaluation

Loading…

AI Dev Skills

Unmapped

Agent-Based ArchitectureAgentic AI SystemsBenchmark Design and MethodologyCode Generation AssessmentCode Generation BenchmarkingCode Generation ModelsCoding Task EvaluationLarge Language Model EvaluationLLM-as-Agent ArchitectureLLM Evaluation and BenchmarkingModel Comparison and AnalysisModel Performance MeasurementModel Performance MetricsPrompt EngineeringSoftware Engineering AgentsSoftware Engineering AI

Tags

Agent-Based ArchitectureAgentic AI SystemsBenchmark Design and MethodologyCode Generation AssessmentCode Generation BenchmarkingCode Generation ModelsCoding Task EvaluationLarge Language Model EvaluationLLM-as-Agent ArchitectureLLM Evaluation and BenchmarkingModel Comparison and AnalysisModel Performance MeasurementModel Performance MetricsPrompt EngineeringSoftware Engineering AgentsSoftware Engineering AIAI AgentsAnthropic / ClaudeBenchmarkingClaudeEvalsForkedLarge Language ModelsOpenAIPython

Taxonomy

AI Trends

Agentic AILLM as AgentsCode GenerationModel EvaluationLLM Evaluation and BenchmarkingAI Agent FrameworksCoding AI SystemsLLM EvaluationCode-Generating AgentsModel Benchmarking

category

Foundation ModelsAI AgentsEvals & Benchmarking

Deployment Context

Self-hostedCloud API

Industries

Developer ToolsAI/ML InfrastructureAI Research

Modalities

CodeText

Skill Areas

LLM Evaluation and BenchmarkingAgentic AI SystemsCode Generation ModelsModel Performance MetricsAgent-Based ArchitectureSoftware Engineering AILarge Language Model EvaluationCode Generation BenchmarkingLLM-as-Agent ArchitectureCoding Task EvaluationModel Performance MeasurementBenchmark Design and MethodologyCode Generation AssessmentPrompt EngineeringModel Comparison and AnalysisSoftware Engineering Agents

tag

AI AgentsActiveAnthropic / ClaudeBenchmarkingClaudeEvalsForkedLarge Language ModelsOpenAIPython

Use Cases

LLM Coding Agent BenchmarkingAutonomous Code Generation EvaluationModel Comparison and SelectionAgent Performance MeasurementLLM coding agent performance evaluationModel comparison and rankingCode generation quality assessmentAutomated programming task benchmarkingLLM Model EvaluationCoding Agent Performance ComparisonCode Generation Quality AssessmentAgent Capability Benchmarking

Recent Activity

Updated 2 months ago

7 Days

0

30 Days

0

90 Days

20

Merge pull request #73 from luccathescientist/fix-judge-total-normalization

Brendan O'Leary • Mar 24, 2026

1e2ba6b

Normalize judge totals to 0-1 scale

Lucca • Mar 21, 2026

4359719

Merge pull request #71 from pinchbench/lint-and-complie

Brendan O'Leary • Mar 19, 2026

e8e833b

Quality

prototype
Quality
medium
Maturity
prototype

Categories

Evals & BenchmarkingPrimaryFoundation ModelsAI AgentsOther AI / ML

PM Skills

Data & EvaluationAI-Native Architecture

Languages

Python100.0%

Timeline

Project created
Feb 11, 2026
Forked
Mar 28, 2026
Your last push
2 months ago
Upstream last push
20 days ago
Tracked since
Mar 24, 2026

Similar Repos

pgvector cosine similarity · $0

Loading…