Abstract
In direct preference alignment of LLMs, most existing methods seek to retrieve the reward function directly from preference data. However, real-world preference data often contains diversity in preference annotations reflective of true human preferences. Existing algorithms, including KTO (Ethayarajh et al., 2024), do not directly utilize such nuances in the annotations which limits their applicability. In this work, we propose Diverse Preference Learning (DPL), a reference model-free method that simultaneously learns a baseline desirability in LLM responses while being robust to the diversity of preference annotations. Our experiments for instruction-following on Ultrafeedback and AlpacaEval 2.0 and for text-summarization on Reddit TL;DR suggest that DPL is consistently better at learning the diversity of preferences compared to existing methods, including those that require a reference-model in memory. Apart from overall quality, we find that DPL's completions, on average, are more honest, helpful, truthful and safe compared to existing methods.
Cite
CITATION STYLE
Nath, A., Volozin, A., Saha, S., Nanda, A. A., Grunin, G., Bhotika, R., & Krishnaswamy, N. (2025). DPL: Diverse Preference Learning Without A Reference Model. In Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025 (Vol. 1, pp. 3727–3747). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2025.naacl-long.190
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.