MathSpatial Benchmark

Overview

MathSpatial provides 10,000 problems with 26,000+ geometric diagrams, covering tasks from multi-view recognition to geometric property calculation. It is a large-scale, open dataset ecosystem dedicated to mathematical spatial reasoning in MLLMs.

Left: Humans achieve over 95% accuracy on MathSpatial-Bench while most MLLMs remain below 60%. Right: Three core challenges and how MathSpatial addresses them.

10,000

Total Problems

26,145

Geometric Diagrams

2.6

Avg Images / Problem

96.3%

Human Accuracy

Evaluation

MathSpatial-Bench

2,000 problems with 5,837 images for rigorous diagnostic evaluation with calibrated difficulty across 3 categories and 11 subtypes.

Training

MathSpatial-Corpus

8,000 problems with 20,308+ images for training with verified solutions. Sourced from authentic educational materials spanning K-12 levels.

Annotations

Structured Reasoning Traces

10,000 structured reasoning traces following the Correlate → Constrain → Infer paradigm for interpretable intermediate supervision.

Dataset Statistics

3 main categories × 11 subtypes covering representative tasks in mathematical spatial reasoning education.

MathSpatial-Bench: 3 main categories × 11 subtypes. Holistic Recognition (518), Generative Inference (636), and Abstract Deduction (846).

(a) Category distribution across Bench & Corpus. (b) Subcategory breakdown. (c) 58% multiple-choice, 42% fill-in-blank for Bench. (d) Balanced A/B/C/D answer distribution.

Figure: Bench vs. Corpus comparison across categories.

Figure: Image count per problem and question length distributions.

Benchmark Comparison

MathSpatial vs. existing benchmarks in mathematical and spatial reasoning.

Dataset	#Tasks	#Samples	Bilingual	Spatial Focus	Train Set
EmbSpatial-Bench	6	3,640	✗	✗	✓
Space3D-Bench	6	211	✗	✗	✗
SpatialRGPT-Bench	12	1,406	✗	✗	✓
BLINK-Spatial	14	286	✗	✗	✗
SpatialVLM	2	546	✗	✗	✓
GeoEval	3	5,050	✗	✗	✗
3DSRBench	4	6,942	✗	✗	✗
SOLIDGEO	8	3,113	✓	✓	✗
SpatialBot-Bench	5	200	✗	✗	✓
VSI-Bench	8	288	✗	✗	✗
OmniSpatial	50	1,387	✗	✗	✗
MathSpatial (Ours)	11	10,000	✓	✓	✓

Task Examples

Examples from three main categories: Holistic Recognition, Generative Inference, and Abstract Deduction.

Holistic Recognition: matching 3D objects to views. Generative Inference: completing missing views. Abstract Deduction: calculating geometric properties.

View Detailed Case Studies (All Subcategories)

Representative examples from each of the 11 subcategories, showing the diversity of spatial reasoning tasks across Holistic Recognition, Generative Inference, and Abstract Deduction.

Atomic Annotations

Every problem in MathSpatial-Corpus is annotated with structured reasoning traces decomposing spatial reasoning into three atomic operations: Correlate, Constrain, and Infer.

C

Correlate

Establish cross-view correspondences between geometric features.

K

Constrain

Apply geometric rules, projection constraints, and consistency conditions.

I

Infer

Deduce hidden attributes, measurements, or final answers from evidence.

(a) Atomic operation distribution: Infer ~42% > Correlate ~33% > Constrain ~25%. (b) Average trace length: 5.1 steps.

Trace Format Example

{
  "reasoning": [
    {
      "op": "Correlate",
      "step": "The front and left views are identical isosceles triangles."
    },
    {
      "op": "Constrain",
      "step": "The top view is a circle, constraining the base to be circular."
    },
    {
      "op": "Infer",
      "step": "A solid with triangular projections and circular top is a cone."
    }
  ],
  "final_answer": "C"
}
        

🏆 Leaderboard

Overall accuracy on MathSpatial-Bench. Humans achieve 96.3% while the best-performing MLLM (GPT-5) reaches 58.5%.

#	Model	Type	Overall	Best Subtype	Worst Subtype	Avg Tokens
—	Human	Human	96.3%	C3M (100%)	3VM (89.4%)	—
🥇 1	GPT-5	Closed	58.5%	PRA (71.7%)	C3M (28.6%)	676
🥈 2	Gemini-2.5-flash	Closed	48.5%	MVC (58.5%)	FPBI (25.0%)	1,115
🥉 3	Gemini-2.5-pro	Closed	44.9%	MVC (58.5%)	CR (26.7%)	914
4	Claude Sonnet 4	Closed	26.7%	PRA (60.7%)	GPC (0.2%)	1,006
5	Llama-4-Maverick-17B	Open	23.8%	PRA (47.6%)	VT (4.5%)	845
6	MathSpatial-InternVL3-8B ⭐	Ours	22.6%	PRA (56.6%)	C3M (2.0%)	318
7	GPT-4.1	Closed	22.6%	PRA (55.2%)	VT (0.0%)	676
8	Claude 3.5 Sonnet	Closed	22.3%	PRA (57.2%)	GPC (0.0%)	859
9	MathSpatial-Qwen-7B ⭐	Ours	22.1%	IVI (46.2%)	GPC (0.0%)	352
10	Claude 3.7 Sonnet	Closed	21.5%	PRA (57.9%)	GPC (0.0%)	886
11	GLM-4.5V	Open	21.0%	PRA (57.9%)	GPC (0.0%)	1,391
12	MathSpatial-Llama3-8B ⭐	Ours	20.3%	CR (41.1%)	PPCD (7.7%)	397
13	Qwen2.5-VL-72B	Open	19.7%	PRA (47.6%)	VT (0.0%)	498
14	GPT-4o	Closed	19.6%	PRA (45.5%)	GPC (0.0%)	677
15	Qwen2.5-VL-7B	Open	17.8%	PRA (41.4%)	C3M (0.0%)	465
16	InternVL3-8B	Open	17.4%	PRA (40.0%)	C3M (0.0%)	474
17	Llama3-8B	Open	15.0%	PRA (23.4%)	PPCD (7.7%)	785

Full results for 16+ models available in the paper. Our fine-tuned models achieve competitive accuracy compared to larger closed-source systems while using 50–60% fewer tokens. Want to submit your model? Open an issue on GitHub.

Key Findings

MathSpatial reveals notable limitations of current MLLMs in spatial reasoning.

Large Human–Model Gap

Humans achieve 96.3% accuracy; the best-performing MLLM (GPT-5) reaches 58.5%, indicating a 37.8 percentage-point gap in spatial reasoning.

Abstract Deduction is Hardest

Most models score below 5% on Geometric Property Calculation (GPC), the most abstract subtype requiring multi-step numerical reasoning.

Scaling Alone Shows Limited Gains

Qwen2.5-VL-72B (19.7%) barely improves over its 7B variant (17.8%), suggesting that spatial reasoning may not be easily addressed by scaling alone.

Training on MathSpatial Helps

Fine-tuned models show consistent accuracy improvements while reducing token usage by 20–30% through structured reasoning trace supervision.

View Detailed Accuracy Chart (All Models)

Overall accuracy comparison across all 17+ evaluated models on MathSpatial-Bench.

Error Analysis

Fine-grained error taxonomy reveals distinct failure patterns across model families, highlighting open challenges in spatial reasoning.

(a) Error frequency distribution for baselines across 6 subcategories. (b) Overall error rate breakdown by failure mode.

Reasoning Gaps (34.4%)

The dominant failure mode. Models produce incomplete or inconsistent chains of thought. Fine-tuning on MathSpatial-Corpus reduces this error type.

Geometry Violations (33.0%)

Outputs break orthographic projection rules or visibility constraints. Claude models are particularly prone to this error type.

Projection Errors (12.6%)

Models misinterpret top, side, or front views. Gemini-2.5-flash suffers more from this error category than other models.

Feature / Scale / Deduction

Feature errors (10.5%): omitting components. Scale errors (7.2%): failing to preserve sizes. Deduction failures (2.3%): inability to synthesize multiple cues.

Key Insight

The two core challenges — enforcing low-level geometric consistency and sustaining coherent multi-step reasoning — remain largely unsolved by current MLLMs. The fine-grained error taxonomy provides a diagnostic foundation for targeted improvements.

Download & Access

MathSpatial is fully open-source. Access the dataset and code through the links below.

GitHub Repository

Source code, benchmark data, and documentation.

View on GitHub

🤗

HuggingFace Dataset

Full dataset with all 10K problems, images, and structured reasoning traces.

🤗 HuggingFace

License

Released under the CC BY-NC-SA 4.0 license.

Academic research use

Attribution required

Derivative works (same license)

Commercial use

Citation

If you use MathSpatial in your research, please cite our paper:

@article{lu2026mllms,
  title   = {Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation},
  author  = {Lu, Shuo and Cheng, Jianjie and Xu, Yinuo and
             Yu, Yongcan and Sheng, Lijun and Wang, Peijie and
             Jiang, Siru and Hu, Yongguan and Ling, Run and
             Shao, Yihua and Ma, Ao and Feng, Wei and
             He, Lingxiao and Wang, Meng and Xie, Qianlong and
             Wang, Xingxing and Sebe, Nicu and He, Ran and
             Liang, Jian},
  journal = {arXiv preprint arXiv:2602.11635},
  year    = {2026}
}

Supplementary Material

Additional details from the paper, including atomic annotation examples and ablation studies. Click to expand.

Atomic Annotation — Complete Worked Example

A complete atomic annotation example showing the Correlate → Constrain → Infer paradigm applied to a multi-view spatial reasoning problem.

Ablation — Impact of Removing Individual Atomic Operations

Ablation study: complete solution with all three atomic operations versus degraded outputs when each operation (Correlate, Constrain, or Infer) is removed in turn. Removing any single operation leads to distinct failure modes.

Subcategory Definitions — 11 Fine-Grained Task Types

Holistic Recognition (4 subtypes)

C3M — Complex 3-View to 3D Matching
3VM — 3D Model to 3-View Matching
IVI — Incorrect View Identification
CR — Component Recognition

Generative Inference (3 subtypes)

MVC — Missing View Completion
VT — Viewpoint Transformation
PRA — Projection Rule Application

Abstract Deduction (4 subtypes)

GPC — Geometric Property Calculation
FPBI — Function & Physical Behavior Inference
SCP — Structural Change Prediction
PPCD — Path Planning & Collision Detection

Data Construction Pipeline — Multi-Stage Quality Control

35,428 Raw Candidates → Manual Filtering (61.1%) → Deduplication (MD5 + SBERT + GPT-4.1 vision) → Geometric Consistency Check → 10,000 Final Problems

Sources: Baidu Wenku, Zujuan, Provincial exam archives. Filtering thresholds: ≥0.90 (remove), 0.85–0.90 (manual review), <0.85 (retain). 1K items flagged for human-in-loop review; 60% corrected, 40% removed.

Human Evaluation Protocol

Participants: 80 middle/high school students · Redundancy: 20× per problem · Total responses: 40,000

Conditions: Closed-book, no calculators/software, pen and paper only

Result: Micro-averaged accuracy = 96.3%