RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

1Zhejiang University, 2Shanghai Jiaotong University, 3Shanghai AI Laboratory, 4Tsinghua University, 5China University of Mining Technology, *Equal contribution, โ€ Corresponding authors

RoboProcessBench evaluates whether VLMs understand how robotic manipulation unfolds, not only whether it succeeds.

12
Diagnostic task families
Static Monitoring + Dynamic Reasoning
~58k
Process-aware QA pairs
260
Manipulation tasks

๐Ÿ” Overview

Process-aware evaluation

Evaluate contact, motion, progress, temporal order, and primitive-level process cues across 12 diagnostic families.

ProcessData-58k

A physically grounded QA corpus built from 260 manipulation execution traces.

Trainable evaluators

SFT on ProcessData turns benchmark supervision into VLM-based process evaluators.

โš–๏ธ Why process-aware?

Outcome-only evaluation

Did the task succeed?

Final-state judgment. Sparse signal.

Process-aware evaluation

Is the execution unfolding correctly?

Contact ยท motion ยท progress ยท temporal ยท primitive cues. Dense diagnostic signal.

๐Ÿงฉ Task taxonomy

The 12 tasks cover current-state monitoring, temporal reasoning, and primitive-aware extensions. Hover over each row to see the full question.

IDTaskInput
Static Monitoring
T1Phase RecognitionSingle frame
T2Contact DetectionSingle frame
T4Bimanual Coordination StateOrdered clip
T10โ€ Current Primitive RecognitionOrdered clip
Dynamic Reasoning
T3Motion Direction PredictionOrdered clip
T5Primitive-local ProgressOrdered clip
T6Motion State RecognitionOrdered clip
T7Operation Outcome PredictionOrdered clip
T8Temporal OrderingShuffled frames
T9Temporal Priority PredictionPairwise frames
T11โ€ Next Primitive PredictionOrdered clip
T12โ€ Primitive Chain RestorationOrdered clip

โ€  Primitive-aware extension. ยท Single ยท Ordered ยท Shuffled ยท Pairwise

๐Ÿ“ฆ ProcessData-58k

ProcessData is built from four physically grounded manipulation datasets, each contributing distinct process-relevant signals.

GM-100

Long-tail goal-conditioned manipulation trajectories.

RH20T

Contact-rich multimodal demonstration traces.

REASSEMBLE

Primitive-level action chains with fine-grained segmentation.

AIST-Bimanual

Bimanual motion and coordination sequences.

PhaseContactMotionProgressTemporalPrimitiveBimanual

Multi-source coverage across seven key process signal dimensions.

๐Ÿ“Š Results

๐Ÿ’ก Key findings

Current VLMs are fragmented: they can recognize some local states, but struggle with primitive-local progress and temporal reasoning.

1

Finding 1: Zero-shot strengths are fragmented. No single VLM excels across all 12 tasks; strengths cluster by task family.

2

Finding 2: Local process states are easier than primitive-local progress. VLMs recognize static states (contact, phase) but fail on within-primitive progress estimation (T5).

3

Finding 3: Temporal reconstruction remains near chance. Temporal ordering (T8) and earlier-frame (T9) are the hardest tasks across all models.

๐Ÿ“‹ Comprehensive evaluation on RoboProcessBench

๐Ÿ”„ From benchmark to supervision

ProcessData also enables SFT to build dedicated process evaluators. Fine-tuning on ProcessData-SFT yields task-family-level diagnostic gains on contact (T2), motion direction (T3), progress (T5), temporal ordering (T8), and next primitive (T11).

Base VLMs
Qwen2.5-VL-7B
InternVL-3-8B
โ†“
Supervised Fine-Tuning
ProcessData-SFT
12 task families ยท ~58k QA pairs
โ†“
Output
Process Evaluator
VLM fine-tuned for process-aware diagnosis
โ†“
Held-out Evaluation
ProcessData-Eval
โ†“
Result
Improved local state recognition
ยท
Enhanced motion understanding
ยท
Primitive-aware reasoning

๐Ÿš€ Release

๐Ÿ’พ Dataset

ProcessData-SFT & ProcessData-Eval, metadata, splits, annotation details.

Coming soon

๐Ÿงช Evaluation

Prompt templates, scoring scripts, per-task evaluation protocol.

Coming soon

๐Ÿง  Models

ProcessData-SFT-Qwen, ProcessData-SFT-Intern checkpoints.

Coming soon

๐Ÿ“– Documentation

Construction details, license, reproducibility guide.

Coming soon

๐Ÿ“ Citation

If you use RoboProcessBench, ProcessData, or the evaluation suite, please cite:

@inproceedings{xia2026roboprocessbench,
title = {RoboProcessBench: Benchmarking Process-Aware Understanding in
Vision-Language Robotic Manipulation},
author = {Xia, Dayu and Shi, Yue and Mu, Yao and Ji, Huiting and
Ma, Chaofan and Zhou, Yingjie and Chen, Hua and
Liu, Yang and Cao, Jiezhang and Zhai, Guangtao},
year = {2026},
}