MindLabel · Driving Data Engine

An auto-labeling pipeline run on WOD-E2E frames with two parallel branches — Scene Understanding VQA (Common · Spatial · Temporal · Motion · Object-Centric) and Action Dreaming (intent-conditioned trajectory synthesis with affordance / exemplar guidance + trajectory-evaluation QA) — feeding a unified answer-generation module. Main results consume only the basic VQA + 3-class GT intent; the richer outputs are released for the community.

Pipeline. Scene Understanding QA generation and Action Dreaming run in parallel; a unified answer module handles category-specific generation.

Example. One frame, multiple QA categories plus four intent-conditioned dreamed trajectory variants (Better / Alternative / Conservative / Worse).

Scene Gallery

Scene A — Daytime urban

Scene A — front-camera panorama with dreamed trajectories

Scene A — BEV view with dreamed trajectories

Scene B — Nighttime intersection

Scene B — front-camera panorama with dreamed trajectories

Scene C — Rolling past video (full annotation stack)

Past-video input (8 frames, 2 Hz) 1/8

Lat. meta

steer_left

Long. meta

decelerate

Intent

turning_left

4-step chain-of-thought

① Perceive

The ego vehicle is approaching a T-intersection in a residential area. A white sedan is parked on the left side of the road, approximately 15 m ahead, partially narrowing the available lane width. The road ahead curves to the left with parked cars on both sides of the street.

② Predict

The white sedan remains stationary. The road geometry requires a leftward trajectory change to navigate the curve and intersection safely. No oncoming traffic is immediately visible in the intersection.

③ Judge

The path is partially blocked by the parked car and the upcoming curve. A leftward steering maneuver is necessary to maintain safe clearance and follow the road curvature. Risk level: Low to Medium.

④ Plan

(1) Decelerate to reduce speed for safe maneuvering around the parked vehicle and curve. (2) Steer left to navigate the road curvature and avoid the parked sedan. (3) Monitor the intersection for any emerging cross-traffic or pedestrians.

Resulting trajectory demonstration: turning left at the intersection

Resulting planned trajectory — the ego vehicle decelerates and turns left at the intersection.

Sampled QA pairs (one per category)

Common

What is the text written on the street sign at the intersection corner?

'AVON' on the top line and 'SHAWNEE' on the bottom — clearly legible in the FRONT_RIGHT view, near panorama coordinates (742, 478).

Spatial

Is the ego vehicle positioned in the left lane or the right lane of the two-lane road?

The right lane — centred within the rightmost lane, with a clear lane boundary on the right and parked cars along the left curb.

Temporal

Did the relative distance between the ego vehicle and the white sedan ahead change over time?

No. The white sedan is parked on the right side and remains stationary throughout the 4-second clip, so its distance change mirrors the ego's own forward motion.

Motion

Based on the current state of the vehicle directly ahead, is it likely to remain stationary or begin moving in the next few seconds?

Stationary. Across all 8 frames the silver sedan on the left and the red sedan behind it show no sign of movement, indicating they are parked.

Object-Centric

Where is the corner house with stucco and red tile roof located in the scene?

Far-left of the stitched panorama, visible in the SIDE_LEFT view; its centre is at approximately (85, 450) in global stitched coordinates.

MindLabel annotations on real WOD-E2E frames. Scene A — daytime urban: front-camera panorama with dreamed trajectories overlaid (four AFF candidates A–D together with the GT future, color-coded by RFS quality) and the corresponding BEV view. Scene B — nighttime intersection: the same trajectory overlay plus the Object-Centric annotation pass (25 bounding boxes, 11 foreground / 14 background, each grounded by a natural-language descriptor e.g. “the car in front”, “intersection right corner”). Scene C — rolling past-video input (8 frames, 2 Hz) shown with its full annotation stack: rule-labelled lateral/longitudinal meta-actions and the derived driving intent, the 4-step Perceive/Predict/Judge/Plan CoT, and one sampled QA pair per Scene-Understanding category.

MindLabel — Auto-Labeled VLA Co-Training Data

Scene Gallery

Scene A — Daytime urban

Scene B — Nighttime intersection

Scene C — Rolling past video (full annotation stack)