Action Emergence from Streaming Intent

Pengfei Jing^1,2*, Victor Shea-Jay Huang^1,3, Hengtong Lu^1,2, Jifeng Dai², Yan Xie¹, Benjin Zhu^1,2*†

¹Li Auto · ²Tsinghua University · ³CUHK
^*Equal contribution ^†Corresponding author

The first driving VLA to achieve intent-faithful trajectory control — purely data-driven, with NO pre-sampled action banks and NO post-hoc selection modules.

Abstract

We formalize action emergence — a driving agent's capability to produce physically feasible, semantically appropriate, and safety-compliant actions in arbitrary scenes through on-the-fly reasoning, rather than retrieval or interpolation of learned scene–action mappings — as the missing capability behind long-tail failures of end-to-end driving. Prior autoregressive decoders collapse the multimodal future; diffusion / flow-matching models express multimodality but are not steerable by reasoned intent; existing VLA designs leave language and action structurally disconnected. We propose Streaming Intent (SI) — a single-backbone VLA in which an AR-decoded intent token drives classifier-free guidance on a shared flow-matching action head.

Trajectory diversity comparison across paradigms

AR models collapse to an averaged future; diffusion / FM models sample a narrow prior-dominated bundle; SI produces intent-faithful trajectories with distinct geometry and speed profiles.

Method

A single Qwen3-VL backbone jointly supports AR CoT/intent decoding and FM intent-guided trajectory denoising. The AR half emits a four-step chain-of-thought (Perceive → Predict → Judge → Plan) terminated by <INTENT>intent_name</INTENT>; an intent bridge parses intent_name into a 20-class index that conditions the FM half via classifier-free guidance. Streaming Intent runs along two axes: semantic streaming — intent is the causal continuation of the CoT — and temporal streaming — a compact prev-intent memory token is carried across clips so commitments stay coherent along the driving horizon.

Streaming Intent: One single backbone for CoT Reasoning and intent-guided action generation.

Action Emergence on Long-Tail Scenes

We compare our action-emergence capability against the SOTA end-to-end driving model RAP. In each image, the colorful trajectories are SI's multi-intent predictions, while the purple dashed trajectories are RAP's proposals. Compared to SI, RAP's trajectories collapse into a single mode — a tight beam of proposals all centred on one direction (e.g.\ all turning left or all going straight) — whereas SI retains intent-faithful control across every commanded intent.

Ours (SI) Action Emergence

Previous Method (RAP) Mode Collapse

BEV Comparison

High Quality Multi-Intent Trajectory

On the Waymo E2E RFS validation set (three human-rated alternatives per scene in addition to the GT), SI's GT-intent trajectory matches the GT, while its other-intent trajectories align with the geometrically distinct RFS alternatives — evidence that the intent-driven diversity is high-quality coverage of the human-rated action repertoire rather than random variance.

Streaming Intent Consistency in Long Horizon Inference

Three multi-clip episodes, each rendered as an auto-playing frame-by-frame sequence. Per-clip intent commitments evolve smoothly with the scene, driven by Streaming Intent. Each frame shows SI's full 4-step CoT and decoded intent on top of the front-3 view, with the predicted trajectory overlaid against GT. The highlighted line below each video announces the current intent as frames roll.

Planning Performance on Waymo E2E Benchmark

On the Waymo Open Dataset End-to-End (WOD-E2E) long-tail benchmark, SI achieves high scores and ranks among the top on the leaderboard on both the validation split (RFS 7.96, improving on the strongest prior baseline RAP at 7.91 by +0.05) and the test leaderboard (RFS 7.74, ADE 1.24 m / 2.81 m at the 3 s / 5 s horizons), confirming that the qualitative action-emergence behaviour translates directly into aggregate planning quality.

WOD-E2E Val Split

Method	RFS ↑	TR (%) ↑
WAM-Flow	5.17	14.0
Curious-VLA	5.81	30.0
AutoVLA	6.74	46.0
RecogDrive	7.40	58.0
RAP	7.91	70.7
SI (ours)	7.96	63.5

WOD-E2E Test Split

Method	RFS ↑	ADE 3s ↓	ADE 5s ↓
Open-LLaMA	7.43	1.31	3.22
NaiveEMMA	7.53	1.32	3.02
AutoVLA	7.56	1.35	2.96
dVLM-AD	7.63	1.29	3.02
HMVLM	7.74	1.33	3.07
SI (ours)	7.74	1.24	2.81

BibTeX

@article{jing2026action,
  title={Action Emergence from Streaming Intent},
  author={Jing, Pengfei and Huang, Victor Shea-Jay and Lu, Hengtong and Dai, Jifeng and Yan, Xie and Zhu, Benjin},
  journal={arXiv preprint arXiv:2605.12622},
  year={2026}
}