TinyLlama Colorist - fine-tuned with Color dataset
Recently a project caught my attention is the TinyLlama project which aims to pretrain a 1.1B Llama model on 3 trillion tokens. It adopted exactly the same architecture and tokenizer as Llama 2. With only 1.1B parameters, this compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. plus, TinyLlama can be plugged and played in many open-source projects built upon Llama. Moreover, it’s Blazingly Fast with supports the following features:
- multi-gpu and multi-node distributed training with FSDP.
- flash attention 2.
- fused layernorm.
- fused swiglu.
- fused cross entropy loss .
- fused rotary positional embedding.
The training target to complete within a span of “just” 90 days using 16 A100–40G GPUs . The training has started on 2023–09–01. To watch a live track of cross entropy loss click here.
## Training Details
Below are some details of our training setup:
| Setting | Description |
|---------------------------------|----------------------------------------------------------------|
| Parameters | 1.1B |
| Attention Variant | Grouped Query Attention |
| Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632|
| Sequence Length | 2048 |
| Batch Size | 2 million tokens (2048 * 1024) |
| Learning Rate | 4e-4 |
| Learning Rate Schedule | Cosine with 2000 warmup steps. See [Issue 27](https://github.com/jzhang38/TinyLlama/issues/27) for a minor bug |
| Training Data | [Slimpajama](https://huggingface.co/datasets/cerebras/slimpajama-627b) & [Starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
| Data Preprocessing | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
| Combined Dataset Size | Around 950B tokens |
| Total Tokens During Training | 3 trillion (slightly more than 3 epochs/1430k steps) |
| Natural Language to Code Ratio | 7:3 |
| Hardware | 16 A100-40G GPUs |
The net result is faster training and inference time make it worth to explore finetuning TinyLlama. The fine-tune experiment use a color dataset to make TinyLLama a colorist expert.
How it works?
You input a description of the colors like then the LLM model return a single color code value like this:
User Input: Give me a sky blue color.
LLM response: #6092ff
One of the challenge on this fine tuning process is limit the model return value to single color value.
Notebook:Finetune_TinyLlama_with_colorist.ipynb
Model Training
The model training process is similar to regular Llama2 model with a chat prompt format like this:
<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n
here is the training code:
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
import os
dataset_id="burkelibbey/colors"
base_model_id="PY007/TinyLlama-1.1B-Chat-v0.3"
model_id_colorist_lora="mychen76/tinyllama-colorist-lora"
def formatted_train(question,answer)->str:
return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n"
def prepare_train_data(data_id):
data = load_dataset(data_id, split="train")
data_df = data.to_pandas()
data_df["text"] = data_df[["description", "color"]].apply(lambda x: "<|im_start|>user\n" + x["description"] + " <|im_end|>\n<|im_start|>assistant\n" + x["color"] + "<|im_end|>\n", axis=1)
data = Dataset.from_pandas(data_df)
return data
def get_model_and_tokenizer(mode_id):
tokenizer = AutoTokenizer.from_pretrained(mode_id)
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
mode_id, quantization_config=bnb_config, device_map="auto"
)
model.config.use_cache=False
model.config.pretraining_tp=1
return model, tokenizer
def finetune_tinyllama(data_id,base_model_id,model_id_colorist_lora):
data = prepare_train_data(data_id)
model, tokenizer = get_model_and_tokenizer(base_model_id)
peft_config = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
training_arguments = TrainingArguments(
output_dir=model_id_colorist_lora,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
save_strategy="epoch",
logging_steps=10,
num_train_epochs=3,
max_steps=200,
fp16=True,
push_to_hub=True
)
trainer = SFTTrainer(
model=model,
train_dataset=data,
peft_config=peft_config,
dataset_text_field="text",
args=training_arguments,
tokenizer=tokenizer,
packing=False,
max_seq_length=1024
)
trainer.train()
trainer.push_to_hub()
if __name__ == "__main__":
finetune_tinyllama(dataset_id,base_model_id,model_id_colorist_lora)
- View project at https://wandb.ai/mychen76/huggingface
- View run at https://wandb.ai/mychen76/huggingface/runs/etjkgg08
Model Inference
Using Huggingface we have few inference options, see below:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from transformers import pipeline
model_id_colorist_final="mychen76/tinyllama-colorist-v2"
def print_color_space(hex_color):
def hex_to_rgb(hex_color):
hex_color = hex_color.lstrip('#')
return tuple(int(hex_color[i:i+2], 16) for i in (0, 2, 4))
r, g, b = hex_to_rgb(hex_color)
print(f'{hex_color}: \033[48;2;{r};{g};{b}m \033[0m')
def formatted_prompt(question)-> str:
return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"
Run pipeline
tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
pipe = pipeline(
"text-generation",
model=model_id_colorist_final,
torch_dtype=torch.float16,
device_map="auto",
)
from time import perf_counter
start_time = perf_counter()
prompt = formatted_prompt('give me a pure brown color')
sequences = pipe(
prompt,
do_sample=True,
temperature=0.1,
top_p=0.9,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=12
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")
Input: ‘give me a pure brown color’ >> Result: `#807070` in 0.21 seconds
Result: <|im_start|>user
give me a pure brown color<|im_end|>
<|im_start|>assistant: #807070<|im_end
Time taken for inference: 0.21 seconds
Run streaming
tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
model = AutoModelForCausalLM.from_pretrained(model_id_colorist_final)
prompt = formatted_prompt('give me a deep blue color')
inputs = tokenizer([prompt], return_tensors="pt")
streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, eos_token_id=[tokenizer.eos_token_id],streamer=streamer, max_new_tokens=10)
response: ‘#000088’
<s><|im_start|>user
give me a deep blue color<|im_end|>
<|im_start|>assistant: #000088<|im
lastly, try measure the generation performance with regular inference
from transformers import GenerationConfig
from time import perf_counter
prompt = formatted_prompt('give me a sky blue color')
inputs = tokenizer([prompt], return_tensors="pt")
generation_config = GenerationConfig(penalty_alpha=0.6,do_sample = True,
top_k=5,temperature=0.5,repetition_penalty=1.2,
max_new_tokens=12,pad_token_id=tokenizer.eos_token_id
)
start_time = perf_counter()
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")
result: “#6092ff” in 2 seconds
<|im_start|>user
give me a sky blue color<|im_end|>
<|im_start|>assistant: #6092ff<|im_end|
Time taken for inference: 2.0 seconds
Conclusion:
Finetuning TinyLlama is faster than finetuning Llama 2 model. the inference speed is impressive as fast as ~0.21 seconds or less than 2 seconds. The quality as good as regular Llama 2 7B for this colorist experiment. More importantly VRAM usage is < 6 GB. So it deliver what the model designer intend to archive. Good Job!
I hope you like this model as I do. Enjoy it!
Thanks for reading.
Reference: