🔍 RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Dayu Xia ^1,2,*

Yue Shi ^1,*,†

Yao Mu ^1,3,†

Huiting Ji ⁵

Chaofan Ma ³

Yingjie Zhou ³

Hua Chen ²

Yang Liu ⁴

Jiezhang Cao ³

Guangtao Zhai ^1,3,†

¹Shanghai AI Laboratory, ²Zhejiang University, ³Shanghai Jiaotong University, ⁴Tsinghua University, ⁵China University of Mining Technology, ^*Equal contribution, ^†Corresponding authors

Paper Code Data BibTeX

RoboProcessBench evaluates whether VLMs understand how robotic manipulation unfolds, not only whether it succeeds.

Diagnostic task families
Static Monitoring + Dynamic Reasoning

~58k

Process-aware QA pairs

260

Manipulation tasks

🔍 Overview

Process-aware evaluation

Evaluate contact, motion, progress, temporal order, and primitive-level process cues across 12 diagnostic families.

ProcessData-58k

A physically grounded QA corpus built from 260 manipulation execution traces.

Trainable evaluators

SFT on ProcessData turns benchmark supervision into VLM-based process evaluators.

⚖️ Why process-aware?

Outcome-only evaluation

Did the task succeed?

Final-state judgment. Sparse signal.

Process-aware evaluation

Is the execution unfolding correctly?

Contact · motion · progress · temporal · primitive cues. Dense diagnostic signal.

🧩 Task taxonomy

The 12 tasks cover current-state monitoring, temporal reasoning, and primitive-aware extensions. Hover over each row to see the full question.

ID	Task	Input
Static Monitoring
T1	Phase Recognition	Single frame
T2	Contact Detection	Single frame
T4	Bimanual Coordination State	Ordered clip
T10^†	Current Primitive Recognition	Ordered clip
Dynamic Reasoning
T3	Motion Direction Prediction	Ordered clip
T5	Primitive-local Progress	Ordered clip
T6	Motion State Recognition	Ordered clip
T7	Operation Outcome Prediction	Ordered clip
T8	Temporal Ordering	Shuffled frames
T9	Temporal Priority Prediction	Pairwise frames
T11^†	Next Primitive Prediction	Ordered clip
T12^†	Primitive Chain Restoration	Ordered clip

^† Primitive-aware extension. · Single · Ordered · Shuffled · Pairwise

📊 Results

💡 Key findings

Current VLMs are fragmented: they can recognize some local states, but struggle with primitive-local progress and temporal reasoning.

Strengths are fragmented

No single VLM excels across all 12 tasks; performance clusters by task family rather than model scale.

State easier than progress

VLMs recognize static states (contact, phase) but fail on within-primitive progress estimation (T5).

Temporal near chance

Temporal ordering (T8) and earlier-frame (T9) remain the hardest tasks across all models.

📋 Comprehensive evaluation on RoboProcessBench

Model	Static Monitoring				Dynamic Reasoning
Model	T1	T2	T4	T10	T3	T5	T6	T7	T8	T9	T11	T12
Open-Source VLMs
Qwen2.5-VL-7B	26.6	41.9	36.4	33.1	35.5	32.2	58.0	54.6	17.9	50.9	33.1	80.4
Qwen3-VL-8B	34.1	42.3	37.5	32.9	30.1	32.7	64.7	60.7	15.9	52.2	59.0	84.8
Qwen3-VL-32B	28.4	53.5	41.9	28.1	50.7	32.4	69.2	50.6	19.7	53.4	47.0	76.1
InternVL-3-8B	31.3	45.6	22.9	31.2	31.1	34.1	54.0	61.9	17.3	48.7	46.5	91.3
InternVL-3.5-8B	37.4	44.3	26.5	36.8	27.7	34.3	53.8	61.9	17.5	50.7	53.5	80.4
InternVL-3-38B	24.6	44.0	25.1	26.2	33.2	34.4	58.1	55.9	17.1	50.6	63.5	82.6
InternVL-3.5-38B	22.4	46.6	29.5	32.6	33.1	34.1	53.1	49.6	15.3	51.0	61.5	80.4
RoboBrain-2.0-7B	29.8	44.0	26.6	32.6	44.4	33.8	51.2	62.5	15.8	49.4	32.6	91.3
GLM-5.1	21.1	37.0	26.0	34.0	42.7	24.0	48.9	53.0	16.5	49.3	60.0	76.7
Closed-Source VLMs
Gemini-3.1-Flash	31.4	47.8	37.3	38.4	33.0	30.5	63.5	48.3	21.8	53.2	67.5	84.8
GPT-4o	29.0	46.1	40.0	32.0	26.1	32.1	68.4	41.1	20.4	49.5	44.5	76.1
GPT-5.4-mini	30.9	49.9	38.1	32.1	46.2	33.0	67.0	56.1	18.9	51.6	45.5	87.0
Claude-Haiku-4.5	24.8	40.7	39.5	23.4	27.1	29.6	61.3	46.2	22.3	52.7	56.0	67.4
Claude-Sonnet-4.6	31.9	52.6	36.3	27.3	54.5	30.8	68.1	49.7	20.2	47.1	44.8	65.2
Post-trained VLMs
ProcessData-SFT-Qwen	58.5	82.7	75.0	92.5	87.7	45.4	92.4	63.4	17.0	51.0	96.5	97.8
ProcessData-SFT-Intern	56.8	81.9	77.3	92.8	88.2	45.3	91.5	66.3	16.3	55.4	97.0	95.7

Accuracy (%) on ProcessData-Eval. Best per task among zero-shot models. Post-trained scores in orange bold. Static Monitoring: T1, T2, T4, T10. Dynamic Reasoning: T3, T5–T9, T11, T12.

🔄 From benchmark to supervision

ProcessData also enables SFT to build dedicated process evaluators. Fine-tuning on ProcessData-SFT yields task-family-level diagnostic gains on contact (T2), motion direction (T3), progress (T5), temporal ordering (T8), and next primitive (T11).

QwenIntern

Base VLMs

7B · 8B

→

Training

ProcessData-SFT

12 families · ~58k QA

→

Output

Process Evaluator

VLM process-aware

→

Held-out

ProcessData-Eval

→

Gains

↑↑ T1–T4, T6–T7, T10–T11
↑ T5, T12

🚀 Release

💾 Dataset

ProcessData-SFT & ProcessData-Eval, metadata, splits, annotation details.

Coming soon

🧪 Evaluation

Prompt templates, scoring scripts, per-task evaluation protocol.

Coming soon

🧠 Models

ProcessData-SFT-Qwen, ProcessData-SFT-Intern checkpoints.

Coming soon

📖 Documentation

Construction details, license, reproducibility guide.

Coming soon

📝 Citation

If you use RoboProcessBench, ProcessData, or the evaluation suite, please cite:

@article{xia2026roboprocessbench,
  title     = {RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language
               Robotic Manipulation},
  author    = {Xia, Dayu and Shi, Yue and Mu, Yao and Ji, Huiting and Ma, Chaofan and Zhou, Yingjie and
               Chen, Hua and Liu, Yang and Cao, Jiezhang and Zhai, Guangtao},
  journal   = {arXiv preprint arXiv:2606.13040},
  year      = {2026},
}