Fine-tuning LLM Model (Llama2) for inverse information generation

Minyang Chen
7 min readAug 13, 2023


Photo by Tachina Lee on Unsplash

The hype for Large Language Models (LLMs) came from ChatGPT. These models are so good at understanding and generating text that it shocked people, myself included.

Training a Large Language Model is a very costly and long process with a lot of hardware issues. The release of Llama 2 pre-trained models offers a highly efficient base model along with a more permissive license — this means we can use these models with licenses suitable for our own commercial applications.

However, a base LLM is pre-trained using a gigantic corpus of text corpus, frequently in the billions or even trillions of tokens. These models are general task models, there are lots of unknown at this moment. problems like bias, prompt injection and toxicity of the pre-training dataset and the base LLM.

The concept of fine-tuning a model is to update and expand its internal knowledge and personalize it to specific needs. The process of fine-tuning (or instruction tuning) models is set to become a standard procedure in the LLMOps workflow because the potential for cost savings, the ability to process confidential data, and even the potential to develop models that exceed the performance of prominent models like ChatGPT and GPT-4 in certain specific tasks.

[LLM Base Model] → [Tuning with custom instruction ] =[ Optimized Model]

In this new paradigm, instruction datasets are the new gold, and the quality of your model heavily depends on the data on which it’s been fine-tuned. so , building high-quality datasets is essential.

The fine-tuning step is relatively cheap regarding computation cost due to the availability of parameter-efficient fine-tuning (PEFT) techniques like the LoRa and QLoRA. LoRA is a training method designed to expedite the training process of large language models, all while reducing memory consumption. By introducing pairs of rank-decomposition weight matrices, known as update matrices, to the existing weights, LoRA focuses solely on training these new added weights. Flash Attention is a method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up 3x and reduce memory usage from quadratic to linear in sequence length.

In this post on supervised fine-tuning of an LLM — Llama 2 from Meta AI.

Here, I would like to share my experiment with the concept of LLM reverse-thinking by swapping the training dataset responses as input instruction to the model. So the model can filter long descriptions or conversation back to key points Knowledge Condensation Distillation with specific domain knowledge. see notebook code here.

This technique has practical use on customer support space to speed up resolution time. For example, customers provide you with a long description of the problem and issues over the phone. The technical support person needs to narrow down the specific problem or investigation focus area quickly to help customers. So a potential AI solution could transcript customer voice into a text, feed into the LLM model to generate an answer on the specific problem or resolution to reduce customer wait time on resolution..

Let’s go a simple example to get the point across:

Example-1: someone give you these description:
1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps."

### Generated Response:
What are the 4 steps in the brewing process?

How the model generates the answer will depend on the training dataset applied to the model fine-tuning task.

Tell me more about Llama 2

Model Architecture: Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.

Variations Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations.

All models are trained with a global batch-size of 4M tokens. Bigger models — 70B — use Grouped-Query Attention (GQA) for improved inference scalability.

Model Dates Llama 2 was trained between January 2023 and July 2023.

Intended Use Cases Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.

How to fine-tune Llama 2

For fine-tune a Llama 2 model with 7 billion parameters on a T4 GPU with high RAM using Google Colab (2.21 credits/hour). Note that a T4 only has 16 GB of VRAM, which is barely enough to store Llama 2–7b’s weights (7b × 2 bytes = 14 GB in FP16).

The overall fine-tuning process can be summarised into following steps

1.Setup training environment

2.Select a base LLM — eg. Llama2

3.Prepare Training Dataset in the selected LLM model format

4.Model Fine-tune training and optimization

5Load and Use the trained model

1. Setup Training Environment

There are few choices available here. Use your own existing home PC, use Cloud provider service Free or Paid like Google Colab (Free version) or Amazon SageMaker Studio (Free version) — both environments provide access to T4 GPU with 16GB of VRAM. One advantage Amazon SageMaker Studio has over Google Colab is the CPU core count is 4 versus 2 and full version of jupyter lab. However,Google is more friendly on sharing notebooks for collaboration purposes.

For this experiment, I will have my own home PC which has an AMD 8 core CPU, 64 GB Ram and RTX 3090 for faster training time, more cost effectiveness.

2. Select a Base Model

The Llama model used for this “NousResearch/Llama-2–7b-hf

3. Prepare a Training Dataset

There are several ways to create an instruction dataset. One way is Using an existing dataset and converting it into an instruction dataset (eg. FLAN) or Use existing LLMs to create synthetically instruction datasets (eg.Alpaca) or use Humans to create instruction datasets (eg. Dolly). Each method has its own advantages and disadvantages.

To keep it simple, I will use Dolly (Databricks-dolly-15K) an open source dataset created by Databricks employees in several behavioral categories outlined in the [InstructGPT paper] including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

Essentially, I am swapping the Response as Input and Response as Instruction from the Dolly in a format like this:

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.
### Input:
### Response:

4. Model Fine-tine and optimization

The LoRA training procedure required to specify a few more hyperparameters exclusive to LoRA. To drastically reduce the VRAM usage, we must fine-tune the model in 4-bit precision, which is why we’ll use QLoRA here. QLoRA will use a rank of 64 with a scaling parameter of 16.We’ll load the Llama 2 model directly in 4-bit precision using the NF4 type and train. To get more information about the other parameters, check the TrainingArguments, PeftModel, and SFTTrainer documentation.

In my setup, training is complete

So training 15k dolly dataset was relative quick, see result below:

{'train_runtime': 8208.2839, 'train_samples_per_second': 5.486, 'train_steps_per_second': 0.686, 'train_loss': 1.2315216935490556, 'epoch': 2.08}
"CPU times: user 1h 10min 14s, sys: 1h 6min 45s, total: 2h 16min 59s"

See notebook here for full code: Notebook-1: Training Model, Saved and Merged into a result model

5. Model Testing and Inference

For model loading and testing various strategies have been tested out.

  1. Use LoRA adapter to run test
  2. Use merged model to run test
  3. Use Huggingface pipeline to run test
  4. Use Langchain to run test

See notebook here for full code: Notebook-2: Model Testing


For a quick test, I picked 3 input texts as a user prompt.

text_input1=”1. Mashing 2. Separation 3. Boiling 4. Fermentation. The ingredients are brought together through these 4 steps.”

Expect result: “What are the 4 main steps in making beer?”

text_input2=’A polygon is a form in Geometry. It is a single dimensional plane made of connecting lines and any number of vertices. It is a closed chain of connected line segments or edges. The vertices of the polygon are formed where two edges meet. Examples of polygons are hexagons, pentagons, and octagons. Any plane that does not contain edges or vertices is not a polygon. An example of a non-polygon is a circle.’

Expect result: “What is a polygon?”

text_input3=’Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.’

Expect result: “What is Delta Lake?”

The result of the test show LLM generate the expected text. Interesting observation, using Huggingface pipeline to generate the text, the result for input_text_2 and 3 came up with slightly different answers. No good explanation on this behavior yet.


### Huggingface pipeline Response: What is geometry?

### expected result: “What is a polygon?”


### Huggingface pipeline Response: “What are ways you can make your data lake look more like a Delta?”

### expected result: “What is Delta Lake?”


In short, we successfully fine-tuned the Llama 2 model using customer dataset (inverse of Dolly) . We applied the latest development in model fine-tuning techniques to enable training on consumer grade hardware and important considerations related to instruction datasets. Moreover, some care on inference framework use for model testing since it could introduce additional unknowns that may alter the result.

Click a like if you find it useful and helpful!

Thanks for reading… have a nice day!

Acknowledgement and Reference

Low-Rank Adaptation (LoRA)

Training language models to follow instructions with human feedback (Dolly)

Using LoRA for Efficient Stable Diffusion Fine-Tuning

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness



Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.