Member-only story

Multi-GPU Training for Llama 3.2 using DeepSpeed and Redundancy Optimizer (ZeRO)

Minyang Chen
15 min readOct 1, 2024

--

For inference tasks, it’s preferable to load entire model onto one GPU, containing all necessary parameters, to carry out these tasks efficiently. However, when training the model, much more VRAM is required compared to inference tasks.

Large language models (LLMs) require substantial GPU VRAM for training due to two primary reasons: the large model weights themselves, and the optimizer states. For instance, using AdamW optimizer necessitates storing a copy of the model weights, momentum, variance parameters, each requiring 12 bytes per parameter.

When your model can’t fit on one GPU for training, consider employing distributed training strategies like data parallelism, where the model’s training process is divided across multiple GPUs or nodes by layer splitting or tensor splitting methods.

In this Article, I will cover my experiments on distributed training using Zero1,2,3 and DeepSpeed on single node with 2 GPUs each with 16GB vRAM.

You can find my code here: distributed_train_finetune.

Before jump into training process, let’s determine how much memory a model require for the training. Lucky, Huggingface team provice a easy to use python script to calculate memory requirements in following:

Training Model Memory Requirements

## pip install transfomers (if required)

python -c 'from transformers…

--

--

Minyang Chen
Minyang Chen

Written by Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.

Responses (1)