๐Ÿ“– Step 9: AI/LLM#304 / 350

Direct Preference Optimization

Direct Preference Optimization (DPO)

๐Ÿ“–One-line summary

A technique that optimizes a model directly from human preference data without a separate reward model.

๐Ÿ’กEasy explanation

A simpler alternative to RLHF. Instead of training a separate reward model, you directly train on 'answer A is better than B' data.

โœจExample

RLHF๋ฅผ ๋‹จ์ˆœํ™”ํ•œ ๋ฐฉ๋ฒ•

RLHF

๋ณด์ƒ ๋ชจ๋ธ ํ•„์š”

๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ

DPO โœจ

๋ณด์ƒ ๋ชจ๋ธ ๋ถˆํ•„์š”

์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋กœ ์ง์ ‘ ํ•™์Šต