📊 Benchmarks and Leaderboards - a society-ethics Collection

society-ethics 's Collections

⛔️🔦 Provenance, Watermarking & Deepfake Detection

🗳️ AI for Policymakers

⚖️ Showing Biases in ML Systems

🤬⛔ Hate Speech and Filtering

🪪🔦Model Cards

🔒☂️🧑‍🤝‍🧑 Privacy and AI

📊 Benchmarks and Leaderboards

📚🔍 Understanding Datasets

💻🔍 Understanding Models

🏛️📚🖼️ Open Data: Public Domain and Open Licenses

📊 Benchmarks and Leaderboards

updated Aug 11, 2025

Running on CPU Upgrade

13.9k

Open LLM Leaderboard

🏆

13.9k

Track, rank and evaluate open LLMs and chatbots
Runtime error

5

Zeno Evals Hub

🏃

5
Running on CPU Upgrade

7.08k

MTEB Leaderboard

🥇

7.08k

Embedding Leaderboard
Running

Featured

584

LLM-Perf Leaderboard

🏆

584

Explore LLM performance across hardware configurations
Runtime error

135

Leaderboards

📈

135
Running on CPU Upgrade

Featured

1.23k

Open ASR Leaderboard

🏆

1.23k

Explore ASR model performance across languages and datasets
Running

1.5k

Big Code Models Leaderboard

📈

1.5k

Explore and submit code model evaluations on a leaderboard
Running

4.73k

Arena Leaderboard

🏆

4.73k

View the latest LMArena model leaderboard
Running

176

Open Object Detection Leaderboard

🏆

176

Request evaluation for a new model
Running

Featured

71

Toolbench Leaderboard

⚡

71

Display leaderboard of language models
Running

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard
Running

95

OpenCompass LLM Leaderboard

🚀

95

Display a web page
nguha/legalbench

Updated Sep 30, 2024 • 166k • 163
Running

6

Skillmix

🚀

6

Browse and compare AI model evaluations
Runtime error

145

Hallucinations Leaderboard

🔥

145

View and submit LLM evaluations
Running

41

MVBench Leaderboard

🐨

41

Submit and view model evaluation results in a leaderboard format
Running

3

Mt Bench French Browser

📊

3

Compare model answers to questions in French
Running

54

NPHardEval Leaderboard

🥇

54

Explore and filter LLM benchmark results
Running

347

VBench Leaderboard

📊

347

Upload video model evaluation data to update the VBench leaderboard
Build error

105

Enterprise Scenarios Leaderboard

🥇

105
Running

192

Yet Another LLM Leaderboard

🌖

192

Launch a Streamlit web app interface
Running

71

CyberSecEvalTest

📈

71

Evaluate LLMs' cybersecurity risks and capabilities
Sleeping

30

Contextual Leaderboard

🐨

30

Submit and evaluate models for contextual understanding tasks
Runtime error

56

Open Multilingual Llm Leaderboard

🐨

56

Search for model performance across languages and benchmarks
Running on CPU Upgrade

93

OpenLLM Turkish leaderboard

🥇

93

Explore and submit LLM benchmarks
Runtime error

998

Open VLM Leaderboard

🌎

998

VLMEvalKit Evaluation Results Collection
Running

422

Reward Bench Leaderboard

📐

422

Explore RewardBench model rankings and scores
Build error

Featured

63

Guardrails Arena

⚔

63

Jailbreak the LLM and privacy guardrails
Running

18

🐍💨 Data Contamination Database

🏭

18

Filter data on contamination in datasets and models
Running on CPU Upgrade

176

Open Arabic LLM Leaderboard

🏆

176

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

75

AIR-Bench Leaderboard

🥇

75

Explore and compare QA and long doc benchmarks
Running

23

MM-UPD Leaderboard

🥇

23

Submit and evaluate model results on MM-UPD benchmarks
Running

230

BigCodeBench Leaderboard

🥇

230

Explore code-generation model leaderboards and task details
Running on CPU Upgrade

75

La Leaderboard

🌸

75

Evaluate open LLMs in the languages of LATAM and Spain.
Running

131

Open FinLLM Leaderboard

🥇

131

Compare financial LLMs on benchmark leaderboard