Member-only story

Convert a Regular LLM Model into a Full-Fledged DeepSeek R1-Like Reasoning Model

13 min readFeb 12, 2025

By now, we all hear about the innovative approach of the open-source model DeepSeek-R1 in training reasoning models that achieve state-of-the-art results while using the least training cost compared to other closed-source model development costs.

In this article, I will discuss models, components, and architectures of reinforcement learning (RL). Then, we will delve into the advanced training algorithm GRPO (Group Relative Policy Optimization) that DeekSeek uses to train their reasoning model R1.

I will use examples to demonstrate how it works. Finally, I will provide a step-by-step guide using a Jupyter notebook to convert a regular large language model (LLM) model into a DeepSeek-like reasoning model using the GRPO fine-tuning process.” please note: this not fine-tuning DeepSeek’s R1 distilled models or using DeepSeek distilled data.

Let’s quickly take a look at reinforcement learning (RL) models, components, and architectures.

Reinforcement Learning (RL)

The goal in reinforcement learning (RL) is to learn an optimal policy that maximizes cumulative reward.

A RL policy is a strategy that an agent uses to decide which action to take in a given state. The type of…

Convert a Regular LLM Model into a Full-Fledged DeepSeek R1-Like Reasoning Model

Reinforcement Learning (RL)

Written by Minyang Chen

No responses yet