Fine-tuning GPT-2 for Positive Sentiment Generation using RLHF
This project provides a reference implementation for fine-tuning a pretrained GPT-2 model to generate sentences expressing positive sentiment using Reinforcement Learning from Human Feedback (RLHF). The process involves three steps: 1. Supervised Fine-Tuning (SFT): Fine-tuning GPT-2 on the stanfordnlp/sst2 dataset; 2. Reward Model Training: Training a GPT-2 model with a reward head to predict sentiment; 3. Reinforcement Learning via Proximal Policy Optimization (PPO): Optimizing the SFT model to generate sentences that the reward model evaluates positively. These three steps are implemented in three Jupyter Notebooks, allowing for a step-by-step approach. A Hugging Face access token is required to download the pretrained GPT-2 model.