← Back to blogs

Offline RL Reboot: Diffusion Planners, "Clean Slate" Baselines, and Offline-to-Online Bridging

How offline RL is experiencing a renaissance through diffusion models, standardized baselines, and practical offline-to-online transitions.

Why this is a 2025 RL focal point

Offline RL is experiencing a second wave: not just incremental algorithmic tweaks, but a reset around (a) stronger evaluation discipline and (b) generative sequence models (diffusion/transformers) as policy and planner backbones. NeurIPS 2025 includes multiple offline RL papers—including an oral titled “A Clean Slate for Offline Reinforcement Learning”—and a clear cluster around diffusion-guided planning and offline safe RL.

References: NeurIPS 2025, arXiv

The three converging threads

1) “Clean slate” evaluation. As offline RL matured, comparisons became noisy. The emergence of “clean slate” baselines is a signal that the community is trying to standardize.

2) Diffusion as a policy/planner primitive. Diffusion is not only for images. In offline RL, diffusion policies offer a natural way to model multi-step action sequences, enforce behavior priors, and support long-horizon control.

3) Offline-to-online bridging. A pragmatic frontier is combining an offline dataset “prior” with limited online interaction.

Safety is becoming inseparable from offline RL

Offline RL is attractive exactly where interaction is expensive or risky—so safety constraints and distribution shift are unavoidable.

What to watch next

  • Faster inference-time planners
  • Benchmarks that match real operational constraints
  • Unified offline RL tooling

Suggested reading