Running K2 V3 Error Analysis Scorecard 🧭 Explore model benchmarks and error analysis with interactive charts
Running K2 V3 Error Analysis Scorecard 🧭 Explore model benchmarks and error analysis with interactive charts
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization Paper • 2605.19436 • Published May 19 • 14
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device Paper • 2602.20161 • Published Feb 23 • 23
Ebisu: Benchmarking Large Language Models in Japanese Finance Paper • 2602.01479 • Published Feb 1 • 17