Researchers develop a method to improve reward models using LLMs for synthetic critiques
The approach aims to reduce the time and cost associated with human annotation.
Researchers from Cohere and the University of Oxford have introduced an innovative method to enhance reward models (RMs) in reinforcement learning from human feedback (RLHF) by leveraging large language models (LLMs) for synthetic critiques. The novel approach aims to reduce the extensive time and cost associated with human annotation, which is traditionally required for training RMs to predict scores based on human preferences.
In their paper, ‘Improving Reward Models with Synthetic Critiques’, the researchers detailed how LLMs could generate critiques that evaluate the relationship between prompts and generated outputs, predicting scalar rewards. These synthetic critiques improved the performance of reward models on various benchmarks by providing additional feedback on aspects like instruction following, correctness, and style, leading to better assessment and scoring of language models.
The study highlighted that high-quality synthetic critiques significantly increased data efficiency, with one enhanced preference pair as valuable as forty non-enhanced pairs. The approach makes the training process more cost-effective and has the potential to match or surpass traditional reward models, as demonstrated by GPT-4.0’s performance in certain benchmarks.
As the field continues to explore alternatives to RLHF, including reinforcement learning from AI feedback (RLAIF), this research indicates a promising shift towards AI-based critiquing, potentially transforming how major AI players such as Google, OpenAI, and Meta align their large language models.