📜 Submitted to ACM Multimedia 2026 — Dataset Track

MathSpatial

Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation

10,000 Problems 26,145 Diagrams 3 × 11 Categories 16+ Models CC BY-NC-SA 4.0

MathSpatial Team

Overview

MathSpatial provides 10,000 problems with 26,000+ geometric diagrams, covering tasks from multi-view recognition to geometric property calculation. It is a large-scale, open dataset ecosystem dedicated to mathematical spatial reasoning in MLLMs.

MathSpatial Overview

Left: Humans achieve over 95% accuracy on MathSpatial-Bench while most MLLMs remain below 60%. Right: Three core challenges and how MathSpatial addresses them.

10,000
Total Problems
26,145
Geometric Diagrams
2.6
Avg Images / Problem
96.3%
Human Accuracy
Evaluation

MathSpatial-Bench

2,000 problems with 5,837 images for rigorous diagnostic evaluation with calibrated difficulty across 3 categories and 11 subtypes.

Training

MathSpatial-Corpus

8,000 problems with 20,308+ images for training with verified solutions. Sourced from authentic educational materials spanning K-12 levels.

Annotations

Structured Reasoning Traces

10,000 structured reasoning traces following the Correlate → Constrain → Infer paradigm for interpretable intermediate supervision.

Dataset Statistics

3 main categories × 11 subtypes covering representative tasks in mathematical spatial reasoning education.

Taxonomy

MathSpatial-Bench: 3 main categories × 11 subtypes. Holistic Recognition (518), Generative Inference (636), and Abstract Deduction (846).

Combined Statistics

(a) Category distribution across Bench & Corpus. (b) Subcategory breakdown. (c) 58% multiple-choice, 42% fill-in-blank for Bench. (d) Balanced A/B/C/D answer distribution.

Bench vs Corpus

Figure: Bench vs. Corpus comparison across categories.

Image and Text Stats

Figure: Image count per problem and question length distributions.

Benchmark Comparison

MathSpatial vs. existing benchmarks in mathematical and spatial reasoning.

Dataset #Tasks #Samples Bilingual Spatial Focus Train Set
EmbSpatial-Bench63,640
Space3D-Bench6211
SpatialRGPT-Bench121,406
BLINK-Spatial14286
SpatialVLM2546
GeoEval35,050
3DSRBench46,942
SOLIDGEO83,113
SpatialBot-Bench5200
VSI-Bench8288
OmniSpatial501,387
MathSpatial (Ours)1110,000

Task Examples

Examples from three main categories: Holistic Recognition, Generative Inference, and Abstract Deduction.

Task Examples

Holistic Recognition: matching 3D objects to views. Generative Inference: completing missing views. Abstract Deduction: calculating geometric properties.

View Detailed Case Studies (All Subcategories)
Case Studies

Representative examples from each of the 11 subcategories, showing the diversity of spatial reasoning tasks across Holistic Recognition, Generative Inference, and Abstract Deduction.

Atomic Annotations

Every problem in MathSpatial-Corpus is annotated with structured reasoning traces decomposing spatial reasoning into three atomic operations: Correlate, Constrain, and Infer.

C

Correlate

Establish cross-view correspondences between geometric features.

K

Constrain

Apply geometric rules, projection constraints, and consistency conditions.

I

Infer

Deduce hidden attributes, measurements, or final answers from evidence.

Atomic Operation Analysis

(a) Atomic operation distribution: Infer ~42% > Correlate ~33% > Constrain ~25%. (b) Average trace length: 5.1 steps.

Trace Format Example

{ "reasoning": [ { "op": "Correlate", "step": "The front and left views are identical isosceles triangles." }, { "op": "Constrain", "step": "The top view is a circle, constraining the base to be circular." }, { "op": "Infer", "step": "A solid with triangular projections and circular top is a cone." } ], "final_answer": "C" }

🏆 Leaderboard

Overall accuracy on MathSpatial-Bench. Humans achieve 96.3% while the best-performing MLLM (GPT-5) reaches 58.5%.

Radar Chart
# Model Type Overall Best Subtype Worst Subtype Avg Tokens
Human Human 96.3% C3M (100%) 3VM (89.4%)
🥇 1 GPT-5 Closed 58.5% PRA (71.7%) C3M (28.6%) 676
🥈 2 Gemini-2.5-flash Closed 48.5% MVC (58.5%) FPBI (25.0%) 1,115
🥉 3 Gemini-2.5-pro Closed 44.9% MVC (58.5%) CR (26.7%) 914
4 Claude Sonnet 4 Closed 26.7% PRA (60.7%) GPC (0.2%) 1,006
5 Llama-4-Maverick-17B Open 23.8% PRA (47.6%) VT (4.5%) 845
6 MathSpatial-InternVL3-8B ⭐ Ours 22.6% PRA (56.6%) C3M (2.0%) 318
7 GPT-4.1 Closed 22.6% PRA (55.2%) VT (0.0%) 676
8 Claude 3.5 Sonnet Closed 22.3% PRA (57.2%) GPC (0.0%) 859
9 MathSpatial-Qwen-7B ⭐ Ours 22.1% IVI (46.2%) GPC (0.0%) 352
10 Claude 3.7 Sonnet Closed 21.5% PRA (57.9%) GPC (0.0%) 886
11 GLM-4.5V Open 21.0% PRA (57.9%) GPC (0.0%) 1,391
12 MathSpatial-Llama3-8B ⭐ Ours 20.3% CR (41.1%) PPCD (7.7%) 397
13 Qwen2.5-VL-72B Open 19.7% PRA (47.6%) VT (0.0%) 498
14 GPT-4o Closed 19.6% PRA (45.5%) GPC (0.0%) 677
15 Qwen2.5-VL-7B Open 17.8% PRA (41.4%) C3M (0.0%) 465
16 InternVL3-8B Open 17.4% PRA (40.0%) C3M (0.0%) 474
17 Llama3-8B Open 15.0% PRA (23.4%) PPCD (7.7%) 785

Full results for 16+ models available in the paper. Our fine-tuned models achieve competitive accuracy compared to larger closed-source systems while using 50–60% fewer tokens. Want to submit your model? Open an issue on GitHub.

Key Findings

MathSpatial reveals notable limitations of current MLLMs in spatial reasoning.

Large Human–Model Gap

Humans achieve 96.3% accuracy; the best-performing MLLM (GPT-5) reaches 58.5%, indicating a 37.8 percentage-point gap in spatial reasoning.

Abstract Deduction is Hardest

Most models score below 5% on Geometric Property Calculation (GPC), the most abstract subtype requiring multi-step numerical reasoning.

Scaling Alone Shows Limited Gains

Qwen2.5-VL-72B (19.7%) barely improves over its 7B variant (17.8%), suggesting that spatial reasoning may not be easily addressed by scaling alone.

Training on MathSpatial Helps

Fine-tuned models show consistent accuracy improvements while reducing token usage by 20–30% through structured reasoning trace supervision.

View Detailed Accuracy Chart (All Models)
Overall Accuracy

Overall accuracy comparison across all 17+ evaluated models on MathSpatial-Bench.

Error Analysis

Fine-grained error taxonomy reveals distinct failure patterns across model families, highlighting open challenges in spatial reasoning.

Error Analysis

(a) Error frequency distribution for baselines across 6 subcategories. (b) Overall error rate breakdown by failure mode.

Reasoning Gaps (34.4%)

The dominant failure mode. Models produce incomplete or inconsistent chains of thought. Fine-tuning on MathSpatial-Corpus reduces this error type.

Geometry Violations (33.0%)

Outputs break orthographic projection rules or visibility constraints. Claude models are particularly prone to this error type.

Projection Errors (12.6%)

Models misinterpret top, side, or front views. Gemini-2.5-flash suffers more from this error category than other models.

Feature / Scale / Deduction

Feature errors (10.5%): omitting components. Scale errors (7.2%): failing to preserve sizes. Deduction failures (2.3%): inability to synthesize multiple cues.

Key Insight

The two core challenges — enforcing low-level geometric consistency and sustaining coherent multi-step reasoning — remain largely unsolved by current MLLMs. The fine-grained error taxonomy provides a diagnostic foundation for targeted improvements.

Download & Access

MathSpatial is fully open-source. Access the dataset and code through the links below.

GitHub Repository

Source code, benchmark data, and documentation.

View on GitHub
🤗

HuggingFace Dataset

Full dataset with all 10K problems, images, and structured reasoning traces.

🤗 HuggingFace

License

Released under the CC BY-NC-SA 4.0 license.

Academic research use
Attribution required
Derivative works (same license)
Commercial use

Citation

If you use MathSpatial in your research, please cite our paper:

@article{lu2026mllms,
  title   = {Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation},
  author  = {Lu, Shuo and Cheng, Jianjie and Xu, Yinuo and
             Yu, Yongcan and Sheng, Lijun and Wang, Peijie and
             Jiang, Siru and Hu, Yongguan and Ling, Run and
             Shao, Yihua and Ma, Ao and Feng, Wei and
             He, Lingxiao and Wang, Meng and Xie, Qianlong and
             Wang, Xingxing and Sebe, Nicu and He, Ran and
             Liang, Jian},
  journal = {arXiv preprint arXiv:2602.11635},
  year    = {2026}
}

Supplementary Material

Additional details from the paper, including atomic annotation examples and ablation studies. Click to expand.

Atomic Annotation — Complete Worked Example
SRT Worked Example

A complete atomic annotation example showing the Correlate → Constrain → Infer paradigm applied to a multi-view spatial reasoning problem.

Ablation — Impact of Removing Individual Atomic Operations
SRT Ablation

Ablation study: complete solution with all three atomic operations versus degraded outputs when each operation (Correlate, Constrain, or Infer) is removed in turn. Removing any single operation leads to distinct failure modes.

Subcategory Definitions — 11 Fine-Grained Task Types

Holistic Recognition (4 subtypes)

  • C3M — Complex 3-View to 3D Matching
  • 3VM — 3D Model to 3-View Matching
  • IVI — Incorrect View Identification
  • CR — Component Recognition

Generative Inference (3 subtypes)

  • MVC — Missing View Completion
  • VT — Viewpoint Transformation
  • PRA — Projection Rule Application

Abstract Deduction (4 subtypes)

  • GPC — Geometric Property Calculation
  • FPBI — Function & Physical Behavior Inference
  • SCP — Structural Change Prediction
  • PPCD — Path Planning & Collision Detection
Data Construction Pipeline — Multi-Stage Quality Control
35,428 Raw Candidates Manual Filtering (61.1%) Deduplication (MD5 + SBERT + GPT-4.1 vision) Geometric Consistency Check 10,000 Final Problems

Sources: Baidu Wenku, Zujuan, Provincial exam archives. Filtering thresholds: ≥0.90 (remove), 0.85–0.90 (manual review), <0.85 (retain). 1K items flagged for human-in-loop review; 60% corrected, 40% removed.

Human Evaluation Protocol

Participants: 80 middle/high school students · Redundancy: 20× per problem · Total responses: 40,000

Conditions: Closed-book, no calculators/software, pen and paper only

Result: Micro-averaged accuracy = 96.3%