TinyLlama Colorist - fine-tuned with Color dataset

4 min readOct 7, 2023

Recently a project caught my attention is the TinyLlama project which aims to pretrain a 1.1B Llama model on 3 trillion tokens. It adopted exactly the same architecture and tokenizer as Llama 2. With only 1.1B parameters, this compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. plus, TinyLlama can be plugged and played in many open-source projects built upon Llama. Moreover, it’s Blazingly Fast with supports the following features:

multi-gpu and multi-node distributed training with FSDP.
flash attention 2.
fused layernorm.
fused swiglu.
fused cross entropy loss .
fused rotary positional embedding.

The training target to complete within a span of “just” 90 days using 16 A100–40G GPUs . The training has started on 2023–09–01. To watch a live track of cross entropy loss click here.

## Training Details
Below are some details of our training setup:

| Setting                         | Description                                                    |
|---------------------------------|----------------------------------------------------------------|
| Parameters                      | 1.1B                                                           |
| Attention Variant               | Grouped Query Attention                                        |
| Model Size                      | Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632|
| Sequence Length                 | 2048                                                           |
| Batch Size                      | 2 million tokens (2048 * 1024)                                             |
| Learning Rate                   | 4e-4                                                           |
| Learning Rate Schedule          | Cosine with 2000 warmup steps. See [Issue 27](https://github.com/jzhang38/TinyLlama/issues/27) for a minor bug     |
| Training Data                   | [Slimpajama](https://huggingface.co/datasets/cerebras/slimpajama-627b) & [Starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata) |
| Data Preprocessing              | Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata |
| Combined Dataset Size           | Around 950B tokens                                              |
| Total Tokens During Training    | 3 trillion (slightly more than 3 epochs/1430k steps)                                          |
| Natural Language to Code Ratio  | 7:3                                                            |
| Hardware                        | 16 A100-40G GPUs                                               |

The net result is faster training and inference time make it worth to explore finetuning TinyLlama. The fine-tune experiment use a color dataset to make TinyLLama a colorist expert.

How it works?

You input a description of the colors like then the LLM model return a single color code value like this:

  User Input: Give me a sky blue color.
LLM response: #6092ff

One of the challenge on this fine tuning process is limit the model return value to single color value.

Notebook:Finetune_TinyLlama_with_colorist.ipynb

Model Training

The model training process is similar to regular Llama2 model with a chat prompt format like this:

<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n

here is the training code:

import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
import os

dataset_id="burkelibbey/colors"
base_model_id="PY007/TinyLlama-1.1B-Chat-v0.3"
model_id_colorist_lora="mychen76/tinyllama-colorist-lora"

def formatted_train(question,answer)->str: 
    return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n{answer}<|im_end|>\n"

def prepare_train_data(data_id):
    data = load_dataset(data_id, split="train")
    data_df = data.to_pandas()
    data_df["text"] = data_df[["description", "color"]].apply(lambda x: "<|im_start|>user\n" + x["description"] + " <|im_end|>\n<|im_start|>assistant\n" + x["color"] + "<|im_end|>\n", axis=1)
    data = Dataset.from_pandas(data_df)
    return data 

def get_model_and_tokenizer(mode_id):
    tokenizer = AutoTokenizer.from_pretrained(mode_id)
    tokenizer.pad_token = tokenizer.eos_token
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )
    model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto"
    )
    model.config.use_cache=False
    model.config.pretraining_tp=1
    return model, tokenizer

def finetune_tinyllama(data_id,base_model_id,model_id_colorist_lora):
    data = prepare_train_data(data_id)
    model, tokenizer = get_model_and_tokenizer(base_model_id)
    
    peft_config = LoraConfig(
        r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )
    training_arguments = TrainingArguments(
        output_dir=model_id_colorist_lora,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=3,
        max_steps=200,
        fp16=True,
        push_to_hub=True
    )
    trainer = SFTTrainer(
        model=model,
        train_dataset=data,
        peft_config=peft_config,
        dataset_text_field="text",
        args=training_arguments,
        tokenizer=tokenizer,
        packing=False,
        max_seq_length=1024
    )
    trainer.train()
    trainer.push_to_hub()

if __name__ == "__main__":
    finetune_tinyllama(dataset_id,base_model_id,model_id_colorist_lora)

View project at https://wandb.ai/mychen76/huggingface
View run at https://wandb.ai/mychen76/huggingface/runs/etjkgg08

Model Inference

Using Huggingface we have few inference options, see below:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from transformers import pipeline

model_id_colorist_final="mychen76/tinyllama-colorist-v2"

def print_color_space(hex_color):
    def hex_to_rgb(hex_color):
        hex_color = hex_color.lstrip('#')
        return tuple(int(hex_color[i:i+2], 16) for i in (0, 2, 4))
    r, g, b = hex_to_rgb(hex_color)
    print(f'{hex_color}: \033[48;2;{r};{g};{b}m           \033[0m')

def formatted_prompt(question)-> str: 
    return f"<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant:"

Run pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
pipe = pipeline(
    "text-generation",
    model=model_id_colorist_final,
    torch_dtype=torch.float16,
    device_map="auto",
)
from time import perf_counter
start_time = perf_counter()

prompt = formatted_prompt('give me a pure brown color')
sequences = pipe(
    prompt,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=12
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")

Input: ‘give me a pure brown color’ >> Result: `#807070` in 0.21 seconds

Result: <|im_start|>user
give me a pure brown color<|im_end|>
<|im_start|>assistant: #807070<|im_end
Time taken for inference: 0.21 seconds

Run streaming

tokenizer = AutoTokenizer.from_pretrained(model_id_colorist_final)
model = AutoModelForCausalLM.from_pretrained(model_id_colorist_final)

prompt = formatted_prompt('give me a deep blue color')
inputs = tokenizer([prompt], return_tensors="pt")
streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, eos_token_id=[tokenizer.eos_token_id],streamer=streamer, max_new_tokens=10)

response: ‘#000088’

<s><|im_start|>user
give me a deep blue color<|im_end|>
<|im_start|>assistant: #000088<|im

lastly, try measure the generation performance with regular inference

from transformers import GenerationConfig
from time import perf_counter

prompt = formatted_prompt('give me a sky blue color')

inputs = tokenizer([prompt], return_tensors="pt")
generation_config = GenerationConfig(penalty_alpha=0.6,do_sample = True,
    top_k=5,temperature=0.5,repetition_penalty=1.2,
    max_new_tokens=12,pad_token_id=tokenizer.eos_token_id
)
start_time = perf_counter()
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")

result: “#6092ff” in 2 seconds

<|im_start|>user
give me a sky blue color<|im_end|>
<|im_start|>assistant: #6092ff<|im_end|
Time taken for inference: 2.0 seconds

Conclusion:

Finetuning TinyLlama is faster than finetuning Llama 2 model. the inference speed is impressive as fast as ~0.21 seconds or less than 2 seconds. The quality as good as regular Llama 2 7B for this colorist experiment. More importantly VRAM usage is < 6 GB. So it deliver what the model designer intend to archive. Good Job!

I hope you like this model as I do. Enjoy it!

Thanks for reading.

Reference:

GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama…

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens. - GitHub …

github.com