Action Emergence from Streaming Intent

1Li Auto  ·   2Tsinghua University  ·   3CUHK
*Equal contribution  Corresponding author

The first driving VLA to achieve intent-faithful trajectory control — purely data-driven, with NO pre-sampled action banks and NO post-hoc selection modules.

Abstract

We formalize action emergence — a driving agent's capability to produce physically feasible, semantically appropriate, and safety-compliant actions in arbitrary scenes through on-the-fly reasoning, rather than retrieval or interpolation of learned scene–action mappings — as the missing capability behind long-tail failures of end-to-end driving. Prior autoregressive decoders collapse the multimodal future; diffusion / flow-matching models express multimodality but are not steerable by reasoned intent; existing VLA designs leave language and action structurally disconnected. We propose Streaming Intent (SI) — a single-backbone VLA in which an AR-decoded intent token drives classifier-free guidance on a shared flow-matching action head.

Trajectory diversity comparison across paradigms

AR models collapse to an averaged future; diffusion / FM models sample a narrow prior-dominated bundle; SI produces intent-faithful trajectories with distinct geometry and speed profiles.

Method

A single Qwen3-VL backbone jointly supports AR CoT/intent decoding and FM intent-guided trajectory denoising. The AR half emits a four-step chain-of-thought (Perceive → Predict → Judge → Plan) terminated by <INTENT>intent_name</INTENT>; an intent bridge parses intent_name into a 20-class index that conditions the FM half via classifier-free guidance. Streaming Intent runs along two axes: semantic streaming — intent is the causal continuation of the CoT — and temporal streaming — a compact prev-intent memory token is carried across clips so commitments stay coherent along the driving horizon.

SI architecture

Streaming Intent: One single backbone for CoT Reasoning and intent-guided action generation.

Action Emergence on Long-Tail Scenes

We compare our action-emergence capability against the SOTA end-to-end driving model RAP. In each image, the colorful trajectories are SI's multi-intent predictions, while the purple dashed trajectories are RAP's proposals. Compared to SI, RAP's trajectories collapse into a single mode — a tight beam of proposals all centred on one direction (e.g.\ all turning left or all going straight) — whereas SI retains intent-faithful control across every commanded intent.

Ours (SI) Action Emergence
Previous Method (RAP) Mode Collapse
BEV Comparison
SI scene 1
RAP scene 1
BEV scene 1
SI scene 2
RAP scene 2
BEV scene 2

High Quality Multi-Intent Trajectory

On the Waymo E2E RFS validation set (three human-rated alternatives per scene in addition to the GT), SI's GT-intent trajectory matches the GT, while its other-intent trajectories align with the geometrically distinct RFS alternatives — evidence that the intent-driven diversity is high-quality coverage of the human-rated action repertoire rather than random variance.

RFS check scene 1
RFS check scene 2
RFS check scene 3

Streaming Intent Consistency in Long Horizon Inference

Three multi-clip episodes, each rendered as an auto-playing frame-by-frame sequence. Per-clip intent commitments evolve smoothly with the scene, driven by Streaming Intent. Each frame shows SI's full 4-step CoT and decoded intent on top of the front-3 view, with the predicted trajectory overlaid against GT. The highlighted line below each video announces the current intent as frames roll.

Case 1 Pedestrian-crossroad episode 1/5
Case 1 frame
 
Case 2 Oncoming-traffic + open-door van episode 1/6
Case 2 frame
 
Case 3 Open-door car + oncoming bus episode 1/4
Case 3 frame
 

Planning Performance on Waymo E2E Benchmark

On the Waymo Open Dataset End-to-End (WOD-E2E) long-tail benchmark, SI achieves high scores and ranks among the top on the leaderboard on both the validation split (RFS 7.96, improving on the strongest prior baseline RAP at 7.91 by +0.05) and the test leaderboard (RFS 7.74, ADE 1.24 m / 2.81 m at the 3 s / 5 s horizons), confirming that the qualitative action-emergence behaviour translates directly into aggregate planning quality.

WOD-E2E Val Split
MethodRFS ↑TR (%) ↑
WAM-Flow 5.1714.0
Curious-VLA 5.8130.0
AutoVLA 6.7446.0
RecogDrive 7.4058.0
RAP 7.9170.7
SI (ours)7.9663.5
WOD-E2E Test Split
Method RFS ↑ ADE 3s ↓ ADE 5s ↓
Open-LLaMA 7.431.313.22
NaiveEMMA 7.531.323.02
AutoVLA 7.561.352.96
dVLM-AD 7.631.293.02
HMVLM 7.741.333.07
SI (ours)7.741.242.81

BibTeX

@article{jing2026streamingintent,
  title={Action Emergence from Streaming Intent},
  author={Pengfei Jing and Victor Shea-Jay Huang and Hengtong Lu and Jifeng Dai and Yan Xie and Benjin Zhu},
  journal={arXiv preprint arXiv:2605.12622},
  year={2026}
}