๐ Step 9: AI/LLM#304 / 350
Direct Preference Optimization
Direct Preference Optimization (DPO)
๐One-line summary
A technique that optimizes a model directly from human preference data without a separate reward model.
๐กEasy explanation
A simpler alternative to RLHF. Instead of training a separate reward model, you directly train on 'answer A is better than B' data.
โจExample
RLHF๋ฅผ ๋จ์ํํ ๋ฐฉ๋ฒ
RLHF
๋ณด์ ๋ชจ๋ธ ํ์
๋ณต์กํ ํ์ดํ๋ผ์ธ
DPO โจ
๋ณด์ ๋ชจ๋ธ ๋ถํ์
์ ํธ ๋ฐ์ดํฐ๋ก ์ง์ ํ์ต