TinyLlama Colorist - fine-tuned with Color dataset

Minyang Chen
4 min readOct 7, 2023

--

Photo by Pawel Czerwinski on Unsplash

Recently a project caught my attention is the TinyLlama project which aims to pretrain a 1.1B Llama model on 3 trillion tokens. It adopted exactly the same architecture and tokenizer as Llama 2. With only 1.1B parameters, this compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. plus, TinyLlama can be plugged and played in many open-source projects built upon Llama. Moreover, it’s Blazingly Fast with supports the following features:

  • multi-gpu and multi-node distributed training with FSDP.
  • flash attention 2.
  • fused layernorm.
  • fused swiglu.
  • fused cross entropy loss .
  • fused rotary positional embedding.

The training target to complete within a span of “just” 90 days using 16 A100–40G GPUs . The training has started on 2023–09–01. To watch a live track of cross entropy loss click here.

## Training Details
Below are some details of our training setup:

| Setting | Description |
|---------------------------------|----------------------------------------------------------------|
| Parameters | 1.1B |
| Attention Variant | Grouped Query Attention |
| Model Size | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632|
| Sequence Length | 2048 |
| Batch Size | 2 million tokens (2048 * 1024) |
| Learning Rate | 4e-4 |
| Learning Rate Schedule | Cosine with 2000 warmup steps. See [Issue 27](https://github.com/jzhang38/TinyLlama/issues/27) for a minor bug |
| Training Data | [Slimpajama](https://huggingface.co/datasets/cerebras/slimpajama-627b) & [Starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
| Data Preprocessing | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
| Combined Dataset Size | Around 950B tokens |
| Total Tokens During Training | 3 trillion (slightly more than 3 epochs/1430k steps) |
| Natural Language to Code Ratio | 7:3 |
| Hardware | 16 A100-40G GPUs |

The net result is faster training and inference time make it worth to explore finetuning TinyLlama. The fine-tune experiment use a color dataset to make TinyLLama a colorist expert.

How it works?

You input a description of the colors like then the LLM model return a single color code value like this:

  User Input: Give me a sky blue color.
LLM response: #6092ff

One of the challenge on this fine tuning process is limit the model return value to single color value.

Notebook:Finetune_TinyLlama_with_colorist.ipynb

Model Training

The model training process is similar to regular Llama2 model with a chat prompt format like this:

<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n

here is the training code:

import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
import os

dataset_id="burkelibbey/colors"
base_model_id="PY007/TinyLlama-1.1B-Chat-v0.3"
model_id_colorist_lora="mychen76/tinyllama-colorist-lora"

def formatted_train(question,answer)->str:
return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n"

def prepare_train_data(data_id):
data = load_dataset(data_id, split="train")
data_df = data.to_pandas()
data_df["text"] = data_df[["description", "color"]].apply(lambda x: "<|im_start|>user\n" + x["description"] + " <|im_end|>\n<|im_start|>assistant\n" + x["color"] + "<|im_end|>\n", axis=1)
data = Dataset.from_pandas(data_df)
return data

def get_model_and_tokenizer(mode_id):
tokenizer = AutoTokenizer.from_pretrained(mode_id)
tokenizer.pad_token = tokenizer.eos_token
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
mode_id, quantization_config=bnb_config, device_map="auto"
)
model.config.use_cache=False
model.config.pretraining_tp=1
return model, tokenizer

def finetune_tinyllama(data_id,base_model_id,model_id_colorist_lora):
data = prepare_train_data(data_id)
model, tokenizer = get_model_and_tokenizer(base_model_id)

peft_config = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
training_arguments = TrainingArguments(
output_dir=model_id_colorist_lora,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
save_strategy="epoch",
logging_steps=10,
num_train_epochs=3,
max_steps=200,
fp16=True,
push_to_hub=True
)
trainer = SFTTrainer(
model=model,
train_dataset=data,
peft_config=peft_config,
dataset_text_field="text",
args=training_arguments,
tokenizer=tokenizer,
packing=False,
max_seq_length=1024
)
trainer.train()
trainer.push_to_hub()

if __name__ == "__main__":
finetune_tinyllama(dataset_id,base_model_id,model_id_colorist_lora)

Model Inference

Using Huggingface we have few inference options, see below:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from transformers import pipeline

model_id_colorist_final="mychen76/tinyllama-colorist-v2"

def print_color_space(hex_color):
def hex_to_rgb(hex_color):
hex_color = hex_color.lstrip('#')
return tuple(int(hex_color[i:i+2], 16) for i in (0, 2, 4))
r, g, b = hex_to_rgb(hex_color)
print(f'{hex_color}: \033[48;2;{r};{g};{b}m \033[0m')

def formatted_prompt(question)-> str:
return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"

Run pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
pipe = pipeline(
"text-generation",
model=model_id_colorist_final,
torch_dtype=torch.float16,
device_map="auto",
)
from time import perf_counter
start_time = perf_counter()

prompt = formatted_prompt('give me a pure brown color')
sequences = pipe(
prompt,
do_sample=True,
temperature=0.1,
top_p=0.9,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=12
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")

Input: ‘give me a pure brown color’ >> Result: `#807070` in 0.21 seconds

Result: <|im_start|>user
give me a pure brown color<|im_end|>
<|im_start|>assistant: #807070<|im_end
Time taken for inference: 0.21 seconds

Run streaming

tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
model = AutoModelForCausalLM.from_pretrained(model_id_colorist_final)

prompt = formatted_prompt('give me a deep blue color')
inputs = tokenizer([prompt], return_tensors="pt")
streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, eos_token_id=[tokenizer.eos_token_id],streamer=streamer, max_new_tokens=10)

response: ‘#000088

<s><|im_start|>user
give me a deep blue color<|im_end|>
<|im_start|>assistant: #000088<|im

lastly, try measure the generation performance with regular inference

from transformers import GenerationConfig
from time import perf_counter

prompt = formatted_prompt('give me a sky blue color')

inputs = tokenizer([prompt], return_tensors="pt")
generation_config = GenerationConfig(penalty_alpha=0.6,do_sample = True,
top_k=5,temperature=0.5,repetition_penalty=1.2,
max_new_tokens=12,pad_token_id=tokenizer.eos_token_id
)
start_time = perf_counter()
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")

result: “#6092ff” in 2 seconds

<|im_start|>user
give me a sky blue color<|im_end|>
<|im_start|>assistant: #6092ff<|im_end|
Time taken for inference: 2.0 seconds

Conclusion:

Finetuning TinyLlama is faster than finetuning Llama 2 model. the inference speed is impressive as fast as ~0.21 seconds or less than 2 seconds. The quality as good as regular Llama 2 7B for this colorist experiment. More importantly VRAM usage is < 6 GB. So it deliver what the model designer intend to archive. Good Job!

I hope you like this model as I do. Enjoy it!

Thanks for reading.

Reference:

--

--

Minyang Chen
Minyang Chen

Written by Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.

Responses (3)