Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation
MathSpatial provides 10,000 problems with 26,000+ geometric diagrams, covering tasks from multi-view recognition to geometric property calculation. It is a large-scale, open dataset ecosystem dedicated to mathematical spatial reasoning in MLLMs.
Left: Humans achieve over 95% accuracy on MathSpatial-Bench while most MLLMs remain below 60%. Right: Three core challenges and how MathSpatial addresses them.
2,000 problems with 5,837 images for rigorous diagnostic evaluation with calibrated difficulty across 3 categories and 11 subtypes.
8,000 problems with 20,308+ images for training with verified solutions. Sourced from authentic educational materials spanning K-12 levels.
10,000 structured reasoning traces following the Correlate → Constrain → Infer paradigm for interpretable intermediate supervision.
3 main categories × 11 subtypes covering representative tasks in mathematical spatial reasoning education.
MathSpatial-Bench: 3 main categories × 11 subtypes. Holistic Recognition (518), Generative Inference (636), and Abstract Deduction (846).
(a) Category distribution across Bench & Corpus. (b) Subcategory breakdown. (c) 58% multiple-choice, 42% fill-in-blank for Bench. (d) Balanced A/B/C/D answer distribution.
Figure: Bench vs. Corpus comparison across categories.
Figure: Image count per problem and question length distributions.
MathSpatial vs. existing benchmarks in mathematical and spatial reasoning.
| Dataset | #Tasks | #Samples | Bilingual | Spatial Focus | Train Set |
|---|---|---|---|---|---|
| EmbSpatial-Bench | 6 | 3,640 | ✗ | ✗ | ✓ |
| Space3D-Bench | 6 | 211 | ✗ | ✗ | ✗ |
| SpatialRGPT-Bench | 12 | 1,406 | ✗ | ✗ | ✓ |
| BLINK-Spatial | 14 | 286 | ✗ | ✗ | ✗ |
| SpatialVLM | 2 | 546 | ✗ | ✗ | ✓ |
| GeoEval | 3 | 5,050 | ✗ | ✗ | ✗ |
| 3DSRBench | 4 | 6,942 | ✗ | ✗ | ✗ |
| SOLIDGEO | 8 | 3,113 | ✓ | ✓ | ✗ |
| SpatialBot-Bench | 5 | 200 | ✗ | ✗ | ✓ |
| VSI-Bench | 8 | 288 | ✗ | ✗ | ✗ |
| OmniSpatial | 50 | 1,387 | ✗ | ✗ | ✗ |
| MathSpatial (Ours) | 11 | 10,000 | ✓ | ✓ | ✓ |
Examples from three main categories: Holistic Recognition, Generative Inference, and Abstract Deduction.
Holistic Recognition: matching 3D objects to views. Generative Inference: completing missing views. Abstract Deduction: calculating geometric properties.
Representative examples from each of the 11 subcategories, showing the diversity of spatial reasoning tasks across Holistic Recognition, Generative Inference, and Abstract Deduction.
Every problem in MathSpatial-Corpus is annotated with structured reasoning traces decomposing spatial reasoning into three atomic operations: Correlate, Constrain, and Infer.
Establish cross-view correspondences between geometric features.
Apply geometric rules, projection constraints, and consistency conditions.
Deduce hidden attributes, measurements, or final answers from evidence.
(a) Atomic operation distribution: Infer ~42% > Correlate ~33% > Constrain ~25%. (b) Average trace length: 5.1 steps.
Overall accuracy on MathSpatial-Bench. Humans achieve 96.3% while the best-performing MLLM (GPT-5) reaches 58.5%.
| # | Model | Type | Overall | Best Subtype | Worst Subtype | Avg Tokens |
|---|---|---|---|---|---|---|
| — | Human | Human | 96.3% | C3M (100%) | 3VM (89.4%) | — |
| 🥇 1 | GPT-5 | Closed | 58.5% | PRA (71.7%) | C3M (28.6%) | 676 |
| 🥈 2 | Gemini-2.5-flash | Closed | 48.5% | MVC (58.5%) | FPBI (25.0%) | 1,115 |
| 🥉 3 | Gemini-2.5-pro | Closed | 44.9% | MVC (58.5%) | CR (26.7%) | 914 |
| 4 | Claude Sonnet 4 | Closed | 26.7% | PRA (60.7%) | GPC (0.2%) | 1,006 |
| 5 | Llama-4-Maverick-17B | Open | 23.8% | PRA (47.6%) | VT (4.5%) | 845 |
| 6 | MathSpatial-InternVL3-8B ⭐ | Ours | 22.6% | PRA (56.6%) | C3M (2.0%) | 318 |
| 7 | GPT-4.1 | Closed | 22.6% | PRA (55.2%) | VT (0.0%) | 676 |
| 8 | Claude 3.5 Sonnet | Closed | 22.3% | PRA (57.2%) | GPC (0.0%) | 859 |
| 9 | MathSpatial-Qwen-7B ⭐ | Ours | 22.1% | IVI (46.2%) | GPC (0.0%) | 352 |
| 10 | Claude 3.7 Sonnet | Closed | 21.5% | PRA (57.9%) | GPC (0.0%) | 886 |
| 11 | GLM-4.5V | Open | 21.0% | PRA (57.9%) | GPC (0.0%) | 1,391 |
| 12 | MathSpatial-Llama3-8B ⭐ | Ours | 20.3% | CR (41.1%) | PPCD (7.7%) | 397 |
| 13 | Qwen2.5-VL-72B | Open | 19.7% | PRA (47.6%) | VT (0.0%) | 498 |
| 14 | GPT-4o | Closed | 19.6% | PRA (45.5%) | GPC (0.0%) | 677 |
| 15 | Qwen2.5-VL-7B | Open | 17.8% | PRA (41.4%) | C3M (0.0%) | 465 |
| 16 | InternVL3-8B | Open | 17.4% | PRA (40.0%) | C3M (0.0%) | 474 |
| 17 | Llama3-8B | Open | 15.0% | PRA (23.4%) | PPCD (7.7%) | 785 |
Full results for 16+ models available in the paper. Our fine-tuned models achieve competitive accuracy compared to larger closed-source systems while using 50–60% fewer tokens. Want to submit your model? Open an issue on GitHub.
MathSpatial reveals notable limitations of current MLLMs in spatial reasoning.
Humans achieve 96.3% accuracy; the best-performing MLLM (GPT-5) reaches 58.5%, indicating a 37.8 percentage-point gap in spatial reasoning.
Most models score below 5% on Geometric Property Calculation (GPC), the most abstract subtype requiring multi-step numerical reasoning.
Qwen2.5-VL-72B (19.7%) barely improves over its 7B variant (17.8%), suggesting that spatial reasoning may not be easily addressed by scaling alone.
Fine-tuned models show consistent accuracy improvements while reducing token usage by 20–30% through structured reasoning trace supervision.
Overall accuracy comparison across all 17+ evaluated models on MathSpatial-Bench.
Fine-grained error taxonomy reveals distinct failure patterns across model families, highlighting open challenges in spatial reasoning.
(a) Error frequency distribution for baselines across 6 subcategories. (b) Overall error rate breakdown by failure mode.
The dominant failure mode. Models produce incomplete or inconsistent chains of thought. Fine-tuning on MathSpatial-Corpus reduces this error type.
Outputs break orthographic projection rules or visibility constraints. Claude models are particularly prone to this error type.
Models misinterpret top, side, or front views. Gemini-2.5-flash suffers more from this error category than other models.
Feature errors (10.5%): omitting components. Scale errors (7.2%): failing to preserve sizes. Deduction failures (2.3%): inability to synthesize multiple cues.
The two core challenges — enforcing low-level geometric consistency and sustaining coherent multi-step reasoning — remain largely unsolved by current MLLMs. The fine-grained error taxonomy provides a diagnostic foundation for targeted improvements.
MathSpatial is fully open-source. Access the dataset and code through the links below.
Full dataset with all 10K problems, images, and structured reasoning traces.
🤗 HuggingFaceReleased under the CC BY-NC-SA 4.0 license.
If you use MathSpatial in your research, please cite our paper:
@article{lu2026mllms,
title = {Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation},
author = {Lu, Shuo and Cheng, Jianjie and Xu, Yinuo and
Yu, Yongcan and Sheng, Lijun and Wang, Peijie and
Jiang, Siru and Hu, Yongguan and Ling, Run and
Shao, Yihua and Ma, Ao and Feng, Wei and
He, Lingxiao and Wang, Meng and Xie, Qianlong and
Wang, Xingxing and Sebe, Nicu and He, Ran and
Liang, Jian},
journal = {arXiv preprint arXiv:2602.11635},
year = {2026}
}
Additional details from the paper, including atomic annotation examples and ablation studies. Click to expand.
A complete atomic annotation example showing the Correlate → Constrain → Infer paradigm applied to a multi-view spatial reasoning problem.
Ablation study: complete solution with all three atomic operations versus degraded outputs when each operation (Correlate, Constrain, or Infer) is removed in turn. Removing any single operation leads to distinct failure modes.
Sources: Baidu Wenku, Zujuan, Provincial exam archives. Filtering thresholds: ≥0.90 (remove), 0.85–0.90 (manual review), <0.85 (retain). 1K items flagged for human-in-loop review; 60% corrected, 40% removed.
Participants: 80 middle/high school students · Redundancy: 20× per problem · Total responses: 40,000
Conditions: Closed-book, no calculators/software, pen and paper only
Result: Micro-averaged accuracy = 96.3%