SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models
Abstract
SketchJudge benchmark evaluates multimodal large language models' ability to grade hand-drawn STEM diagrams, revealing significant limitations in visual understanding compared to human performance.
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models (2025)
- PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding (2025)
- MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models (2026)
- AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding (2026)
- CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution (2025)
- PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models (2025)
- Evaluating large language models on multimodal chemistry olympiad exams (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper