Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

9 Views· 12/14/25

Generative AI
3 Subscribers

In Learning

Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!

This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.

Full Playlist: https://www.youtube.com/playli....st?list=PLs8w1Cdi-zv

Video 0 (Optional): Introduction to deep reinforcement learning https://www.youtube.com/watch?v=SgC6AZss478
Video 1: Proximal Policy Optimization https://www.youtube.com/watch?v=TjHH_--7l8g
Video 2: Reinforcement Learning with Human Feedback https://www.youtube.com/watch?v=Z_JUqJBpVOk
Video 3 (This one!): Deterministic Policy Optimization

00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion

Get the Grokking Machine Learning book!
https://manning.com/books/grok....king-machine-learnin
Discount code (40%): serranoyt
(Use the discount code on checkout)

0 Comments

Up next

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Up next

Explore more

Please note that if you are under 18, you won't be able to access this site.

Up next

Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning

Up next

Explore more

Choose a payment method

Language