Earlier in the week, Meta unveiled Llama 2, which comes with both a foundational pre-trained model and a chat-specific fine-tuned version, available in three different scales. The base model of Llama 2 already showcases notable performance. However, the potential to amplify its capabilities for various linguistic tasks is unlocked when it undergoes further fine-tuning. Navigating through this refinement phase can be intricate, involving challenges from securing appropriate computational resources to the correct deployment of training and inference scripts.
To simplify this endeavor, Scale has introduced its new open-source LLM Engine repository. Let's delve into how this tool can be utilized and examine a few illustrative examples.
from llmengine import FineTune response = FineTune.create( model="llama-2-7b", training_file="s3://my-bucket/path/to/training-file.csv", ) print(response.json())
You can also follow the code examples in this notebook.
Dataset For this example, we will use ScienceQA, a popular dataset consisting of a diverse set of multiple-choice science questions. Each question may have textual context and image context and contains a thorough explanation and lecture supporting the solution.
Currently, LLM Engine supports fine-tuning on prompt-completion pairs. Let's first convert this dataset into the supported format, a CSV with two columns: prompt and response.
Before you get started, install the required dependencies.
pip install datasets==2.13.1 smart_open[s3]==5.2.1 pandas==1.4.4
We can load the dataset from Hugging Face, and observe the dataset's features.
from datasets import load_dataset from smart_open import smart_open import pandas as pd dataset = load_dataset('derek-thomas/ScienceQA') dataset['train'].features
A commonly used format for feeding ScienceQA examples is:
Context: A baby wants to know what is inside of a cabinet. Her hand applies a force to the door, and the door opens.
Question: Which type of force from the baby's hand opens the cabinet door?
Options: (A) pull (B) push
Answer: A.
Since the format of options in the Huggingface dataset is only a list of possible answers, we need to convert this list into the example format from above by adding the enumeration prefix.
choice_prefixes = [chr(ord('A') + i) for i in range(26)] # A-Z def format_options(options, choice_prefixes): return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])
Now, let's write the formatting function to turn a single sample from this dataset into a prompt and response to feed into the model.
def format_prompt(r, choice_prefixes): options = format_options(r['choices'], choice_prefixes) return f'''Context: {r["hint"]}\nQuestion: {r["question"]}\nOptions:{options}\nAnswer:''' def format_response(r, choice_prefixes): return choice_prefixes[r['answer']]
Finally, let's construct the dataset. Note that some samples in ScienceQA only have an image for context - we'll be skipping those in this example as Llama-2 is purely a language model, and cannot accept image inputs.
def convert_dataset(ds): prompts = [format_prompt(i, choice_prefixes) for i in ds if i['hint'] != ''] labels = [format_response(i, choice_prefixes) for i in ds if i['hint'] != ''] df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels}) return df
LLM Engine supports training with a training and validation dataset. If you only provide a training dataset, the LLM Engine will randomly split 10% from the dataset for validation. Splitting your dataset prevents the model from overfitting on your training data, resulting in poor generalization to live data it might see during inference.
These dataset files must be stored in a publicly accessible URL so that LLM Engine can read them. For this example, we will be saving the datasets to s3. We also have the preprocessed train dataset and validation dataset publicly available in Github Gists - you can directly replace train_url and val_url with these links.
train_url = 's3://...' val_url = 's3://...' df_train = convert_dataset(dataset['train']) with smart_open(train_url, 'wb') as f: df_train.to_csv(f) df_val = convert_dataset(dataset['validation']) with smart_open(val_url, 'wb') as f: df_val.to_csv(f)
Now, we can start fine-tuning via the LLM Engine API.