Member-only story

The Rise of CPU Inference: Powerful and Affordable LLMs

Minyang Chen
27 min readAug 10, 2024

--

GPUs are the default choice for LLM inference due to their speed advantage, but I believe we’re approaching a wall in terms of cost and efficiency. The aim of this writing is to help save costs on LLM inference and make a case that CPU inference is a better, yet efficent low-cost option for some use cases.

1. The rapid increase in demand for large LLM models requires substantial GPU VRAM, making it a bottleneck for organizations wanting to run their own private instances. For example, using recently released state-of-the-art open-source models like Llama 3.1 405 (with GPT-4-like quality) on-premise or in a private cloud is not an easy task and can be very expensive to operate.

According to data released by Meta, the largest model Llama 3.1 405B has 405 billion parameters, requiring roughly 800 GB memory to be served in its original BF16 precision, exceeding the total GPU memory capacity of a single AWS P4 or P5 instance with 8 x 80GB A100/H100 (640GB memory capacity)

Clearly, running a fully loaded single server equipped with the best GPUs (maximum x8) is not feasible due to cost constraints. The workaround is distributed inference by standing up another server instance, which effectively doubles the cost.

2. Deploying LLMs on highly-parallel CPUs is more cost-effective because the price of CPUs and RAM is vastly cheaper than the price of GPU VRAM for running large models. Let’s use an example calculation to demonstrate the value of GB per…

--

--

Minyang Chen
Minyang Chen

Written by Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.

No responses yet