← Back to blogs

Preference-Based RL Meets Foundation Models: Multimodal Feedback Without Reward Engineering

How foundation models are transforming preference-based RL by serving as scalable feedback sources.

Why this is a 2025 RL focal point

Preference-based reinforcement learning (PbRL) is resurging because it fits the realities of robotics and embodied systems: dense reward engineering is fragile, while preferences are easier to elicit or approximate. What’s new in 2025 is that foundation models (LLMs/VLMs) are increasingly used as scalable preference judges, critics, and synthetic feedback generators.

References: NeurIPS 2025, OpenReview

What is changing technically

1) From pairwise comparisons to richer query formats. Newer work explores “multiple options” and more structured preference elicitation to reduce ambiguity and improve sample efficiency.

2) Foundation models as “cheap” annotators—plus the reliability problem. Using an FM as a judge scales feedback, but introduces bias, inconsistency, and shortcut reliance.

3) Reward model improvement and anti-shortcutting. As preference learning is scaled, we see more explicit work on reward model training dynamics and avoiding learned shortcut behaviors.

Practical implications

  • Robotics: PbRL offers an engineering-friendly path when task success is subjective.
  • Agentic systems: Preferences can capture “style constraints” and “do-not-do” rules.
  • Evaluation: Labs should treat the preference pipeline as a full measurement system.

Suggested reading