by Krishna Kumar SIngh on August 9, 2023
Hugging Face has announced that they are doubling down on their efforts to democratize RLHF and reinforcement learning with TRL, the newest addition to the huggingface family.
TRL Library provides the possibility to give enough modularity to users to be able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques.
To leverage accelerate
to enable users to run their training on multiple GPUs or nodes. You should first create your accelerate config by simply running:
accelerate config
Then make sure you have selected multi-gpu / multi-node setup. You can then run your training by simply running:
accelerate launch your_script.py
Refer to the examples page for more details
By default, the PPOTrainer
creates a torch.optim.Adam
optimizer. You can create and define a different optimizer and pass it to PPOTrainer
:
import torch from transformers import GPT2Tokenizer from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # 1. load a pretrained model model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # 2. define config ppo_config = {'batch_size': 1, 'learning_rate':1e-5} config = PPOConfig(**ppo_config) # 2. Create optimizer optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate) # 3. initialize trainer ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
For memory efficient fine-tuning, you can also pass Adam8bit
optimizer from bitsandbytes
:
import torch import bitsandbytes as bnb from transformers import GPT2Tokenizer from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # 1. load a pretrained model model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # 2. define config ppo_config = {'batch_size': 1, 'learning_rate':1e-5} config = PPOConfig(**ppo_config) # 2. Create optimizer optimizer = bnb.optim.Adam8bit(model.parameters(), lr=config.learning_rate) # 3. initialize trainer ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
You can use the new LION optimizer from Google as well, first take the source code of the optimizer definition here, and copy it so that you can import the optimizer. Make sure to initialize the optimizer by considering the trainable parameters only for a more memory efficient training:
optimizer = Lion(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.config.learning_rate) ... ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer)
We advise you to use the learning rate that you would use for Adam
divided by 3 as pointed out here. We observed an improvement when using this optimizer compared to classic Adam (check the full logs here):
You can also play with your training by adding learning rate schedulers!
import torch from transformers import GPT2Tokenizer from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # 1. load a pretrained model model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2') tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # 2. define config ppo_config = {'batch_size': 1, 'learning_rate':1e-5} config = PPOConfig(**ppo_config) # 2. Create optimizer optimizer = torch.optim.SGD(model.parameters(), lr=config.learning_rate) lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9) # 3. initialize trainer ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer, optimizer=optimizer, lr_scheduler=lr_scheduler)
Another tool you can use for more memory efficient fine-tuning is to share layers between the reference model and the model you want to train.
import torch from transformers import AutoTokenizer from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model # 1. load a pretrained model model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m') model_ref = create_reference_model(model, num_shared_layers=6) tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m') # 2. initialize trainer ppo_config = {'batch_size': 1} config = PPOConfig(**ppo_config) ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
Since trl
supports all key word arguments when loading a model from transformers
using from_pretrained
, you can also leverage load_in_8bit
from transformers
for more memory efficient fine-tuning.
Read more about 8-bit model loading in transformers
here.
# 0. imports # pip install bitsandbytes import torch from transformers import AutoTokenizer from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead # 1. load a pretrained model model = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m') model_ref = AutoModelForCausalLMWithValueHead.from_pretrained('bigscience/bloom-560m', device_map="auto", load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m') # 2. initialize trainer ppo_config = {'batch_size': 1} config = PPOConfig(**ppo_config) ppo_trainer = PPOTrainer(config, model, model_ref, tokenizer)
When training large models, you should better handle the CUDA cache by iteratively clearing it. Do do so, simply pass optimize_cuda_cache=True
to PPOConfig
:
config = PPOConfig(..., optimize_cuda_cache=True)
A small tweak need to be added to your training script to use DeepSpeed stage 3 correctly. You need to properly initialize your reward model on the correct device using the zero3_init_context_manager
context manager. Here is an example adapted for the gpt2-sentiment
script:
ds_plugin = ppo_trainer.accelerator.state.deepspeed_plugin if ds_plugin is not None and ds_plugin.is_zero3_init_enabled(): with ds_plugin.zero3_init_context_manager(enable=False): sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device) else: sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)