📖 Step 9: AI/LLM#304 / 350

Direct Preference Optimization

Direct Preference Optimization (DPO)

📖One-line summary

A technique that optimizes a model directly from human preference data without a separate reward model.

A simpler alternative to RLHF. Instead of training a separate reward model, you directly train on 'answer A is better than B' data.

RLHF를 단순화한 방법

RLHF

보상 모델 필요

복잡한 파이프라인

DPO ✨

보상 모델 불필요

선호 데이터로 직접 학습