Driving Intents Amplify Planning-Oriented Reinforcement Learning

Hengtong Lu1,2,* Victor Shea-Jay Huang1,3,* Chengmin Yang1 Pengfei Jing1,2 Jifeng Dai2 Yan Xie1 Benjin Zhu1,2,*,†
1Li Auto  ·   2Tsinghua University  ·   3CUHK
*Equal contribution  ·  Corresponding author

The first continuous-action driving policy to escape demonstration-mode collapse — generating rater-preferred trajectories beyond the logged human demonstration with NO autoregressive action tokens and NO deployment-time oracle selection.

Driving intents amplify planning-oriented reinforcement learning teaser

What Problem Are We Solving?

Driving logs show what one human driver did, not the only safe or best action the car could take. Standard imitation models learn to copy that single logged path and often collapse around it, missing other reasonable maneuvers such as yielding earlier, changing lanes, or braking more smoothly. DIAL first asks the policy to propose different intent-level maneuvers, then learns which proposals human raters prefer.

Logged human demo
8.13
One recorded trajectory, used as the usual imitation target.
Best prior planner RAP
8.50
Prior best rater-preference ceiling under best-of-64.
DIAL proposal ceiling
9.14
+0.64 vs RAP and +1.01 vs the logged human demo.
Best candidate found
9.14
Best-of-128 RFS: if the policy proposes 128 candidates, this is how good the best one is.
Deployment policy gain
+0.515
After preference learning, DIAL improves held-out RFS from 7.696 to 8.211.
How we create alternatives
8 x 2
Eight driving intents, two samples per intent, giving 16 maneuver candidates per scene.
RFS human-rater preference score Best-of-K how good the best candidate is among K proposals GRPO preference-learning update after proposal expansion

Method

DIAL is built on MindVLA-U1. During imitation training, the action generator receives one of eight rule-derived intents: cruise, lane change left or right, turn left or right, U-turn, accelerate, or decelerate. Classifier-free dropout teaches both conditional and unconditional action distributions. During RL, each scene contributes an intent-balanced proposal group, so group-relative advantages compare maneuver-level alternatives instead of small coordinate perturbations around a single logged path.

Overview of DIAL with intent-conditioned CFG imitation and multi-intent GRPO
Stage 1 expands support with intent-conditioned CFG. Stage 2 preserves that support by scoring and updating over a multi-intent GRPO group.

Intent-CFG Proposal Ceiling

The central diagnosis is that ordinary stochastic samples from single-demonstration SFT policies do not reach enough maneuver basins. Baseline VA and VLA policies saturate below the logged human-driven trajectory at K=128. Intent-conditioned proposals cross the logged-trajectory reference around K=8, and all-intent pooling reaches the strongest ceiling.

Best-of-K RFS proposal ceiling for baseline and intent-conditioned policies
Intent-conditioned proposals raise the best-of-K RFS ceiling beyond ordinary SFT sampling and above the logged demonstration reference.
Samples per intent ablation at eight intents
With eight intents fixed, two samples per intent gives the best held-out peak in the reported sweep.

Planning-Oriented RL Results

Under a controlled Waymo-only task-training protocol, DIAL obtains the strongest held-out improvement among the RL-trained systems reported in the paper. Each peak is compared against the step-0 evaluation from the same RL stage.

Model Action Representation Stage Held RFS ↑ Delta RL TR ↑ Full RFS ↑
WAM-Flowdiscrete flowSFT init5.547--24.0%5.757
WAM-Flowdiscrete flowRL peak5.634+0.08719.0%6.111
Curious-VLAaction tokenSFT init5.808--30.0%5.827
Curious-VLAaction tokenRL peak5.954+0.14631.0%7.157
AutoVLAaction tokenSFT init6.744--46.0%6.809
AutoVLAaction tokenRL peak6.787+0.04347.0%6.780
ReCogDriveDiT diffusionSFT init7.399--58.0%7.244
ReCogDriveDiT diffusionRL peak7.714+0.31565.4%7.543
DIALcontinuous flowSFT init7.696--54.7%7.369
DIALcontinuous flowRL peak8.211+0.51568.0%8.631

Multi-Intent vs Single-Intent GRPO

Holding the per-scene rollout budget fixed at K=16, the main DIAL configuration spans all eight intent classes. Single-intent variants keep the same sample count but condition all rollouts on one selected intent. The multi-intent group reaches the highest held-out RFS and avoids the sharper post-peak collapse seen in single-intent recipes.

Group Composition C S K Held Peak ↑ TR ↑ Full RFS ↑
single gt, geometric116167.78365.0%7.733
single predicted, classifier116167.86457.0%8.331
single top-rater, leakage116167.72861.0%7.545
single random, uniform116167.99264.0%8.035
multi-intent DIAL main82168.21168.0%8.631

Diversity Preservation

The diversity analysis shows why the multi-intent group matters. DIAL keeps the largest RFS spread across intent-conditioned trajectories after RL, meaning different intents continue to produce quality-differentiated proposals. The training curves also show that DIAL peaks higher and declines less than single-intent variants.

Sampling D1 (m) D2 D3@1 D3@16 Gap
SFT init, no RL6.430.524.3876.617+2.23
DIAL, multi-intent4.170.754.5006.540+2.04
Single: random2.400.434.3816.231+1.85
Single: predicted4.820.644.3726.172+1.80
Single: top-rater7.080.654.2406.020+1.78
Single: GT intent6.020.594.2295.889+1.66
Held-out RFS throughout RL training for multi-intent and single-intent variants
Multi-intent GRPO reaches the highest held-out RFS and maintains a stronger level through training.

BibTeX

@article{lu2026drivingintents,
  title={Driving Intents Amplify Planning-Oriented Reinforcement Learning},
  author={Hengtong Lu and Victor Shea-Jay Huang and Chengmin Yang and Pengfei Jing and Jifeng Dai and Yan Xie and Benjin Zhu},
  journal={arXiv preprint arXiv:2605.12625},
  year={2026}
}