MindLabel — Auto-Labeled VLA Co-Training Data

Shared annotation pipeline for the Mind-Omni family.

An auto-labeling pipeline run on WOD-E2E frames with two parallel branches — Scene Understanding VQA (Common · Spatial · Temporal · Motion · Object-Centric) and Action Dreaming (intent-conditioned trajectory synthesis with affordance / exemplar guidance + trajectory-evaluation QA) — feeding a unified answer-generation module. Main results consume only the basic VQA + 3-class GT intent; the richer outputs are released for the community.

MindLabel pipeline overview
Pipeline. Scene Understanding QA generation and Action Dreaming run in parallel; a unified answer module handles category-specific generation.
MindLabel example annotations
Example. One frame, multiple QA categories plus four intent-conditioned dreamed trajectory variants (Better / Alternative / Conservative / Worse).

Scene Gallery

Scene A — Daytime urban

Scene A — front-camera panorama with dreamed trajectories Scene A — BEV view with dreamed trajectories

Scene B — Nighttime intersection

Scene B — front-camera panorama with dreamed trajectories
Scene B — BEV view Scene B — object-centric annotations

Scene C — Rolling past video (full annotation stack)

Past-video input (8 frames, 2 Hz) 1/8
Past-video frame (auto-rolling)
Lat. meta
steer_left
Long. meta
decelerate
Intent
turning_left
4-step chain-of-thought
① Perceive
The ego vehicle is approaching a T-intersection in a residential area. A white sedan is parked on the left side of the road, approximately 15 m ahead, partially narrowing the available lane width. The road ahead curves to the left with parked cars on both sides of the street.
② Predict
The white sedan remains stationary. The road geometry requires a leftward trajectory change to navigate the curve and intersection safely. No oncoming traffic is immediately visible in the intersection.
③ Judge
The path is partially blocked by the parked car and the upcoming curve. A leftward steering maneuver is necessary to maintain safe clearance and follow the road curvature. Risk level: Low to Medium.
④ Plan
(1) Decelerate to reduce speed for safe maneuvering around the parked vehicle and curve. (2) Steer left to navigate the road curvature and avoid the parked sedan. (3) Monitor the intersection for any emerging cross-traffic or pedestrians.
Resulting trajectory demonstration: turning left at the intersection
Resulting planned trajectory — the ego vehicle decelerates and turns left at the intersection.
Sampled QA pairs (one per category)
Common
What is the text written on the street sign at the intersection corner?
'AVON' on the top line and 'SHAWNEE' on the bottom — clearly legible in the FRONT_RIGHT view, near panorama coordinates (742, 478).
Spatial
Is the ego vehicle positioned in the left lane or the right lane of the two-lane road?
The right lane — centred within the rightmost lane, with a clear lane boundary on the right and parked cars along the left curb.
Temporal
Did the relative distance between the ego vehicle and the white sedan ahead change over time?
No. The white sedan is parked on the right side and remains stationary throughout the 4-second clip, so its distance change mirrors the ego's own forward motion.
Motion
Based on the current state of the vehicle directly ahead, is it likely to remain stationary or begin moving in the next few seconds?
Stationary. Across all 8 frames the silver sedan on the left and the red sedan behind it show no sign of movement, indicating they are parked.
Object-Centric
Where is the corner house with stucco and red tile roof located in the scene?
Far-left of the stitched panorama, visible in the SIDE_LEFT view; its centre is at approximately (85, 450) in global stitched coordinates.

MindLabel annotations on real WOD-E2E frames. Scene A — daytime urban: front-camera panorama with dreamed trajectories overlaid (four AFF candidates A–D together with the GT future, color-coded by RFS quality) and the corresponding BEV view. Scene B — nighttime intersection: the same trajectory overlay plus the Object-Centric annotation pass (25 bounding boxes, 11 foreground / 14 background, each grounded by a natural-language descriptor e.g. “the car in front”, “intersection right corner”). Scene C — rolling past-video input (8 frames, 2 Hz) shown with its full annotation stack: rule-labelled lateral/longitudinal meta-actions and the derived driving intent, the 4-step Perceive/Predict/Judge/Plan CoT, and one sampled QA pair per Scene-Understanding category.