Chapter 2.10 of the report discusses three reinforcement learning methods: RLHF, RLAIF, and DPO.
RLHF
RLHF (Reinforcement Learning with Human Feedback) is a well-known reinforcement learning technique that optimizes model performance by combining the output of machine learning models with feedback provided by humans. It has been widely applied in several renowned models such as GPT-4, Llama2, Claude 2, and Gemini; however, not all models use RLHF. For example, Mistral 7B does not adopt this method.
DPO
Regarding DPO (Direct Preference Optimization), we have already discussed it in detail in our previous articles. Please refer to the previous content -
RLAIF
Today, we will focus on introducing RLAIF — Reinforcement Learning with Artificial Intelligence Feedback. This is a newer approach that trains and optimizes models using feedback generated by AI itself, rather than entirely relying on human feedback. This method can reduce dependence on large-scale manual annotations while maintaining data quality, thereby improving learning efficiency.
More details:
Although RLHF (Reinforcement Learning with Human Feedback) has long been considered the gold standard for AI model calibration, its reliance on large amounts of human feedback data often becomes a limiting factor in terms of time and labor. As an alternative, a recent Google Research study proposed a method of using preferences from large language models for reinforcement learning — RLAIF (Reinforcement Learning with AI Feedback), aiming to calibrate other AI models toward human preferences.
RLAIF vs. RLHF: Which is more efficient?
The study compared the performance of RLAIF and RLHF in summarization and helpfulness tasks, finding that both are preferred over supervised fine-tuning (SFT). Statistically, there was no significant difference between the advantages of RLHF and RLAIF (see the figure below).
Notably, in the task of generating harmless dialogues aimed at producing the least harmful outputs, the effectiveness of RLAIF (88%) exceeded that of RLHF (76%) (see the figure below). This suggests that RLAIF may be a more resource-efficient and cost-effective method for calibrating AI models.