One year later… ChatGPT vs Open source LLMs

Minyang Chen
8 min readNov 8, 2023

--

Since ChatGPT launched on November 30, 2022. the race to develop an open source model that can match ChatGPT like experience started…

Almost a year later, can open source community developed models catching up ChatGPT in term of user experience?

People like ChatGPT because that enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language…etc. Now that many enterprises are considering integrating their processes, functions, and workflows with LLMs, enterprises are looking for how to approach them.

What are considerations and concerns of using ChatGPT? ChatGPT was trained largely on public data from the internet, making it susceptible to the following risks: Privacy Risks, Bias Risk, Customer Data concerns, lack of trust, observability risks…etc.

How can we tell open source model are up to the task? One simple way to answer this question is look at published benchmarks and user experience in these 3 areas:

  1. Model Capability
  2. Chatbot User experience
  3. API experience for build AI application

Model Capability

Generally speaking, larger models tend to perform better than smaller models. event a 3bits quantized 70B can still outperform a regular 13B model in some cases. However, the cost of getting them up and running can be quite expensive, making it difficult to operate them. Therefore, the preference is to create smaller LLM models that can achieve the same results as larger models while incurring the lowest operation cost and faster speed running at consumer grade hardware.

Here’s the proof that 7B models are getting smarter and better speed…

OOpenChat claims “The first 7B model that Achieves Comparable Results with ChatGPT (March)!

OpenChat Metrics (source: https://github.com/imoneoi/openchat/blob/master/README.md)

ZZephyr claims the highest ranked 7B chat model on the MT-Bench and AlpacaEval benchmarks.

Zerphyr Metric (source: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)

Mmistral-7B claims outperforms Llama 2 13B across all evaluated benchmarks and Llama 1 34B in reasoning, mathematics, and code generation.

source: https://mistral.ai/news/announcing-mistral-7b/

Chatbot Experience

In open source space there are many great product allow us to run LLM locally in our desktop at incredible easy for consumers as desktop application, here’s few products I have tried, they all work well.

LM Studio is an easy to use desktop app for experimenting with local and open-source Large Language Models (LLMs). It work out of box, run quantized model at speed in Laptop with 16 GB of memory with or without GPU with built-in APi server (based on llama_cpp). One Limitation is that you only run server or in chat mode but not both.

source: https://lmstudio.ai/

ollama — Get up and running with large language models, locally. It deploy in a client and server mode. so it doesn’t have limitation of LM Studio. It works as expected run quantized model at speed in Laptop with 16 GB of memory with or without GPU.

source: https://github.com/jmorganca/ollama

H2OGPT — Private Q&A and summarization of documents+images or chat with local GPT, 100% private,Apache 2.0. Supports LLaMa2, llama.cpp, and more. It’s slower that ollama and LM studio running in a laptop.

source: https://github.com/h2oai/h2ogpt

text-generation-webui — A Swiss army Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models. This almost became the default tool to test new models due it’s flexibility in configuration and long list of support quantized methods.

source: https://github.com/oobabooga/text-generation-webui

Chatbot UI — An open source ChatGPT UI. This is UI only product. however, this give us the best ChatGPT like user experience.

source: https://github.com/mckaywrigley/chatbot-ui

API Experience

The open source community provides a wide range of libraries and open source servers for inference. Many of them offer the openai chat completion and embedding API calls, as well as a drop-in replacement for openai calls. This means that if you are already using the openai API, you can easily switch to these libraries without having to make any code changes. however, openai spec has lot more functionality than open source offer : https://platform.openai.com/docs/api-reference/chat/object

Here’s some model serving framework that offer openai comparable API

LiteLLM — a library attempt to unify 100+ LLMs using a same Input and Output format for basic usage. So it will be perfect for solution use many LLMs, the core offering is OpenAI proxy.

For example, you can specify a custom API to your own LLM server API.

import openai 
openai.api_base = "http://0.0.0.0:8000"
print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

FastChat — A distributed multi-model serving system with web UI and OpenAI-compatible RESTful APIs.

# Launch the controller
python3 -m fastchat.serve.controller
# launch model
python3 -m fastchat.serve.test_message --model-name vicuna-7b-v1.5
# Launch the Gradio web server ( chatUI)
python3 -m fastchat.serve.gradio_web_server
# Finally, launch the RESTful API server
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

vLLM — Easy, Fast , and cheap LLM seving with PageAttention. To use vLLM for online serving, you can start an OpenAI API-compatible server.

python -m vllm.entrypoints.openai.api_server --model lmsys/vicuna-7b-v1.3

Llama.cpp — Port of Facebook’s LLaMA model in C/C++. the backend for oLLama and LM Studio and many other product that support GGUF.

# start the base server
./server -m models/7B/ggml-model.gguf -c 2048
# start the openai api server point back to the base server
python api_like_OAI.py --llama-api http://127.0.0.1:8080

My setup for ChatpGPT like experience locally

the best way to gain experience is try it your self. For my own testing, I use following setup.

  1. Select a model server I want to test ( I pick OpenChat, Zephy and Mistral) for their 7B capability.
  2. Start a Chatbot UI — modified version to support GGUF and Mistral
  3. Launch the browser pointing to: http://localhost:3000

see repo below for full source code: chatgpt_like_experience_locally

Model Serving

For model serving, if you have a GPU with 24 GB VRAM, I recommend try a full deployment of OpenChat-3.5 model, it’s incredible fast with high accuracy close to ChatGPT 3.5 (March) release according the Author.

# create conda environment
conda create -n openchat python=3.11
conda activate openchat

# install
pip3 install torch torchvision torchaudio
pip3 install ochat
pip3 install openai

# run_openchat_full_server.sh
echo "run openchat server"
python -m ochat.serving.openai_api_server --model openchat/openchat_3.5 --engine-use-ray --worker-use-ray

Otherwise, use adjust to your hardware using the GGUF serving based

#!/bin/bash
# create environment
virtualenv venv --python=3.10
source venv/bin/activate

# Install Python Libraries with Nvidia GPU support
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DBUILD_SHARED_LIBS=ON" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DBUILD_SHARED_LIBS=ON" FORCE_CMAKE=1 pip install llama-cpp-python[server] --force-reinstall --upgrade --no-cache-dir

pip3 install torch torchvision torchaudio
pip3 install openai

echo "serving [openchat 3.5]"

export MODEL_FILE="./models/openchat_3.5.Q5_K_M.gguf"
export MODEL_ID="TheBloke/openchat_3.5.Q5_K_M.GGUF"
export OFFLOAD_GPU_LAYERS=35
export HOST=0.0.0.0
export PORT=8000
#export CHAT_FORMAT="llama-2"
export CHAT_FORMAT="vicuna"
export CONTEXT_SIZE=4096

## run server
python3 -m llama_cpp.server \
--n_gpu_layers $OFFLOAD_GPU_LAYERS \
--model $MODEL_FILE \
--model_alias $MODEL_ID \
--chat_format $CHAT_FORMAT \
--n_ctx $CONTEXT_SIZE \
--host $HOST \
--port $PORT \
--seed 123

similarly you can run Zerphy or Mistral


#!/bin/bash
# create environment
virtualenv venv --python=3.10
source venv/bin/activate

# Install Python Libraries with Nvidia GPU support
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DBUILD_SHARED_LIBS=ON" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
CMAKE_ARGS="-DLLAMA_CUBLAS=on -DBUILD_SHARED_LIBS=ON" FORCE_CMAKE=1 pip install llama-cpp-python[server] --force-reinstall --upgrade --no-cache-dir

pip3 install torch torchvision torchaudio
pip3 install openai

echo "serving [Zerphyr] - multilingual model"

export MODEL_FILE="./models/zephyr-7b-beta.Q5_K_M.gguf"
export MODEL_ID="TheBloke/zephyr-7B-beta.Q5_K_M.gguf"
export OFFLOAD_GPU_LAYERS=35
export HOST=0.0.0.0
export PORT=8000
export CHAT_FORMAT="chatml"
export CONTEXT_SIZE=4096

## run server
python3 -m llama_cpp.server \
--n_gpu_layers $OFFLOAD_GPU_LAYERS \
--model $MODEL_FILE \
--model_alias $MODEL_ID \
--chat_format $CHAT_FORMAT \
--n_ctx $CONTEXT_SIZE \
--host $HOST \
--port $PORT \
--seed 123

Chatbot UI

The chatbot is a nodejs application, required node js installed.

echo "GGUF Chatbot UI"

export NEXT_PUBLIC_DEFAULT_SYSTEM_PROMPT="You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown"
export NEXT_PUBLIC_DEFAULT_TEMPERATURE=0.5
export DEFAULT_MODEL=gpt-3.5-turbo
export OPENAI_API_KEY=EMPTY
export OPENAI_API_TYPE=openai

## openchat vllm server
#export OPENAI_API_HOST=http://localhost:18888
## llama_cpp_python GGUF file server

export OPENAI_API_HOST=http://127.0.0.1:8000
cd gguf-chatbot-ui
npm install
npm run dev

I really like the Chatbot UI, it feels well thought out just like ChatGPT. except not multi-modal support yet. Very good interface for conversation, it allows you to save conversations and a set of prompts, making it easy to continue where you left off. see screen shot below for using Zephy model in a single GGUF format.

Chatbot UI (GGUF version)
Zephy: incredible multi-language model
coding expereince

How to pick a good open source LLM model?

Depends on your use case, for more details see nice work by Troyanovsky [Local-LLM-Comparison-Colab-UI] on comparing the performance of different LLM that can be deployed locally on consumer hardware.

Try it yourself in google Colab it’s free including GPU (T4-16GB vram)

Summary

I believe the gap between ChatGPT and open source LLM models is becoming narrower and in some use case building AI application using Open source LLM is a better choice. However, ChatGPT is a unified product with more advance capabilities adding every day such as API based fine-tuning capability, better function call implementation and visual understanding. so, the race still goes on and innovation happens every day.

Thanks again for reading.

I hope you find something useful and learn something new like I do.

Have a nice day!

# Credits

--

--

Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.