TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Researchers propose TUR-DPO, an improvement to Direct Preference Optimization that incorporates topology and uncertainty awareness to better align LLMs with human preferences. Unlike standard DPO which treats preferences as flat binary signals, TUR-DPO rewards how answers are derived by combining semantic faithfulness, utility, and reasoning topology quality into a calibrated uncertainty signal. Empirical results show improvements in judge win-rates, faithfulness, and calibration across 7-8B models while maintaining training simplicity without online rollouts.