Reinforcement learning from human feedback Machine Learning
noun phrase
Definition: A variant of reinforcement learning that learns from human feedback rather than relying only on an engineered reward function; in LLM research, it is widely used to align model behavior with human preferences [Kaufmann et al. 2023].
Example in context: “Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.” [Lang et al. 2024]
Synonym: RLHF
Related terms: preference modeling, reward model, alignment, supervised fine-tuning, constitutional AI