RoboProcessBench evaluates whether VLMs understand how robotic manipulation unfolds, not only whether it succeeds.
Static Monitoring + Dynamic Reasoning
๐ Overview
Process-aware evaluation
Evaluate contact, motion, progress, temporal order, and primitive-level process cues across 12 diagnostic families.
ProcessData-58k
A physically grounded QA corpus built from 260 manipulation execution traces.
Trainable evaluators
SFT on ProcessData turns benchmark supervision into VLM-based process evaluators.
โ๏ธ Why process-aware?
Outcome-only evaluation
Did the task succeed?
Final-state judgment. Sparse signal.
Process-aware evaluation
Is the execution unfolding correctly?
Contact ยท motion ยท progress ยท temporal ยท primitive cues. Dense diagnostic signal.
๐งฉ Task taxonomy
The 12 tasks cover current-state monitoring, temporal reasoning, and primitive-aware extensions. Hover over each row to see the full question.
| ID | Task | Input |
|---|---|---|
| Static Monitoring | ||
| T1 | Phase Recognition | Single frame |
| T2 | Contact Detection | Single frame |
| T4 | Bimanual Coordination State | Ordered clip |
| T10โ | Current Primitive Recognition | Ordered clip |
| Dynamic Reasoning | ||
| T3 | Motion Direction Prediction | Ordered clip |
| T5 | Primitive-local Progress | Ordered clip |
| T6 | Motion State Recognition | Ordered clip |
| T7 | Operation Outcome Prediction | Ordered clip |
| T8 | Temporal Ordering | Shuffled frames |
| T9 | Temporal Priority Prediction | Pairwise frames |
| T11โ | Next Primitive Prediction | Ordered clip |
| T12โ | Primitive Chain Restoration | Ordered clip |
โ Primitive-aware extension. ยท Single ยท Ordered ยท Shuffled ยท Pairwise
๐ฆ ProcessData-58k
ProcessData is built from four physically grounded manipulation datasets, each contributing distinct process-relevant signals.
GM-100
Long-tail goal-conditioned manipulation trajectories.
RH20T
Contact-rich multimodal demonstration traces.
REASSEMBLE
Primitive-level action chains with fine-grained segmentation.
AIST-Bimanual
Bimanual motion and coordination sequences.
Multi-source coverage across seven key process signal dimensions.
๐ Results
๐ก Key findings
Current VLMs are fragmented: they can recognize some local states, but struggle with primitive-local progress and temporal reasoning.
Finding 1: Zero-shot strengths are fragmented. No single VLM excels across all 12 tasks; strengths cluster by task family.
Finding 2: Local process states are easier than primitive-local progress. VLMs recognize static states (contact, phase) but fail on within-primitive progress estimation (T5).
Finding 3: Temporal reconstruction remains near chance. Temporal ordering (T8) and earlier-frame (T9) are the hardest tasks across all models.
๐ Comprehensive evaluation on RoboProcessBench

| Model | Static Monitoring | Dynamic Reasoning | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T1 | T2 | T4 | T10 | T3 | T5 | T6 | T7 | T8 | T9 | T11 | T12 | |
| Open-Source VLMs | ||||||||||||
| Qwen2.5-VL-7B | 26.6 | 41.9 | 36.4 | 33.1 | 35.5 | 32.2 | 58.0 | 54.6 | 17.9 | 50.9 | 33.1 | 80.4 |
| Qwen3-VL-8B | 34.1 | 42.3 | 37.5 | 32.9 | 30.1 | 32.7 | 64.7 | 60.7 | 15.9 | 52.2 | 59.0 | 84.8 |
| Qwen3-VL-32B | 28.4 | 53.5 | 41.9 | 28.1 | 50.7 | 32.4 | 69.2 | 50.6 | 19.7 | 53.4 | 47.0 | 76.1 |
| InternVL-3-8B | 31.3 | 45.6 | 22.9 | 31.2 | 31.1 | 34.1 | 54.0 | 61.9 | 17.3 | 48.7 | 46.5 | 91.3 |
| InternVL-3.5-8B | 37.4 | 44.3 | 26.5 | 36.8 | 27.7 | 34.3 | 53.8 | 61.9 | 17.5 | 50.7 | 53.5 | 80.4 |
| InternVL-3-38B | 24.6 | 44.0 | 25.1 | 26.2 | 33.2 | 34.4 | 58.1 | 55.9 | 17.1 | 50.6 | 63.5 | 82.6 |
| InternVL-3.5-38B | 22.4 | 46.6 | 29.5 | 32.6 | 33.1 | 34.1 | 53.1 | 49.6 | 15.3 | 51.0 | 61.5 | 80.4 |
| RoboBrain-2.0-7B | 29.8 | 44.0 | 26.6 | 32.6 | 44.4 | 33.8 | 51.2 | 62.5 | 15.8 | 49.4 | 32.6 | 91.3 |
| GLM-5.1 | 21.1 | 37.0 | 26.0 | 34.0 | 42.7 | 24.0 | 48.9 | 53.0 | 16.5 | 49.3 | 60.0 | 76.7 |
| Closed-Source VLMs | ||||||||||||
| Gemini-3.1-Flash | 31.4 | 47.8 | 37.3 | 38.4 | 33.0 | 30.5 | 63.5 | 48.3 | 21.8 | 53.2 | 67.5 | 84.8 |
| GPT-4o | 29.0 | 46.1 | 40.0 | 32.0 | 26.1 | 32.1 | 68.4 | 41.1 | 20.4 | 49.5 | 44.5 | 76.1 |
| GPT-5.4-mini | 30.9 | 49.9 | 38.1 | 32.1 | 46.2 | 33.0 | 67.0 | 56.1 | 18.9 | 51.6 | 45.5 | 87.0 |
| Claude-Haiku-4.5 | 24.8 | 40.7 | 39.5 | 23.4 | 27.1 | 29.6 | 61.3 | 46.2 | 22.3 | 52.7 | 56.0 | 67.4 |
| Claude-Sonnet-4.6 | 31.9 | 52.6 | 36.3 | 27.3 | 54.5 | 30.8 | 68.1 | 49.7 | 20.2 | 47.1 | 44.8 | 65.2 |
| Post-trained VLMs | ||||||||||||
| ProcessData-SFT-Qwen | 58.5 | 82.7 | 75.0 | 92.5 | 87.7 | 45.4 | 92.4 | 63.4 | 17.0 | 51.0 | 96.5 | 97.8 |
| ProcessData-SFT-Intern | 56.8 | 81.9 | 77.3 | 92.8 | 88.2 | 45.3 | 91.5 | 66.3 | 16.3 | 55.4 | 97.0 | 95.7 |
Accuracy (%) on ProcessData-Eval. Best per task among zero-shot models. Post-trained scores in orange bold. Static Monitoring: T1, T2, T4, T10. Dynamic Reasoning: T3, T5โT9, T11, T12.
๐ From benchmark to supervision
ProcessData also enables SFT to build dedicated process evaluators. Fine-tuning on ProcessData-SFT yields task-family-level diagnostic gains on contact (T2), motion direction (T3), progress (T5), temporal ordering (T8), and next primitive (T11).
๐ Release
๐พ Dataset
ProcessData-SFT & ProcessData-Eval, metadata, splits, annotation details.
Coming soon๐งช Evaluation
Prompt templates, scoring scripts, per-task evaluation protocol.
Coming soon๐ง Models
ProcessData-SFT-Qwen, ProcessData-SFT-Intern checkpoints.
Coming soon๐ Documentation
Construction details, license, reproducibility guide.
Coming soon๐ Citation
If you use RoboProcessBench, ProcessData, or the evaluation suite, please cite:
@inproceedings{xia2026roboprocessbench, title = {RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation}, author = {Xia, Dayu and Shi, Yue and Mu, Yao and Ji, Huiting and Ma, Chaofan and Zhou, Yingjie and Chen, Hua and Liu, Yang and Cao, Jiezhang and Zhai, Guangtao}, year = {2026},}