Member-only story

Convert a Regular LLM Model into a Full-Fledged DeepSeek R1-Like Reasoning Model

Minyang Chen
13 min readFeb 12, 2025

--

By now, we all hear about the innovative approach of the open-source model DeepSeek-R1 in training reasoning models that achieve state-of-the-art results while using the least training cost compared to other closed-source model development costs.

Figure-1: DeepSeek GRPO

In this article, I will discuss models, components, and architectures of reinforcement learning (RL). Then, we will delve into the advanced training algorithm GRPO (Group Relative Policy Optimization) that DeekSeek uses to train their reasoning model R1.

I will use examples to demonstrate how it works. Finally, I will provide a step-by-step guide using a Jupyter notebook to convert a regular large language model (LLM) model into a DeepSeek-like reasoning model using the GRPO fine-tuning process.” please note: this not fine-tuning DeepSeek’s R1 distilled models or using DeepSeek distilled data.

Let’s quickly take a look at reinforcement learning (RL) models, components, and architectures.

Reinforcement Learning (RL)

The goal in reinforcement learning (RL) is to learn an optimal policy that maximizes cumulative reward.

A RL policy is a strategy that an agent uses to decide which action to take in a given state. The type of…

--

--

Minyang Chen
Minyang Chen

Written by Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.

No responses yet