Practical Techniques to constraint LLM output in JSON format

Minyang Chen
13 min readOct 28, 2023

--

JSON is one of the most widely used data interchange formats globally, supporting all our requirements. When building an AI-powered app, engineers inevitably need to integrate the outputs of a Large Language Model (LLM) into their codebase.

By instructing the LLM to use a specific syntax or schema and then output the generated result the application needs, we can make the application’s behavior more predictable. In short, JSON’s interoperability makes it a preferred choice for data exchange.

Photo by Bernd 📷 Dittrich on Unsplash

Why tell LLM output to JSON format is so hard?

Language models excel at predicting the next token and generating text, but they can be challenging to produce precise outputs beyond text, as they do not always follow instructions precisely.

For example: With OpenAI, I want GPT-3.5-turbo to respond always in the following form
(message_type) {message_content}
However, sometimes it responds message_type: message_content.
Or, message_type: "message_content".
Or, Author (message_type): "message_content".

Prompt engineering approach

“Please provide the response in the form of a Python list. It should begin with “[“ and end with “]”.”

Chatgpt (gpt4) support prompt for system/user (gpt4 api) to format the data in csv. usually worked flawlessly. While gpt4 is good for prototyping a demo, it’s quite pricey, therefore a local solution would be perfect.

There are many prompt engineer frameworks to restrict output in json format, see here’s one A Strict JSON Framework for LLM Outputs.

## simple example provided by the author
res = strict_output(system_prompt = 'You are a classifier',
user_prompt = 'It is a beautiful day',
output_format = {"Sentiment": "Type of Sentiment",
"Tense": "Type of Tense"})
print(res)
## output
{'Sentiment': 'Positive', 'Tense': 'Present'}

While prompt engineering can be effective for certain use cases, it has a limitation — any internal changes made by the LLM can lead to unexpected outputs. This has been known to cause issues in production environments, as seen in stories online where AI apps that rely on ChatGPT’s API have failed due to constant background updates.

Constraint LLM Output Approaches

There has been a significant amount of innovative work in this area, and I have had the opportunity to explore three frameworks that all address the issue from different perspectives. I am impressed by how each framework arrives at similar results, despite using distinct approaches.

A. GRAMMAR — grammars to constrain model output. For example, you can force the model to output JSON only:

B. KOR — This is a half-baked prototype that “helps” you extract structured data from text using LLMs

C. LM-Format-Enforcer — Enforce the output format (JSON Schema, Regex etc) of a language model

D. Finetune LLM model — teach the model to output JSON based on the input data

A. Use Grammars Rules to force the model to output JSON only

On this approach, you need to use Llama.cpp to run the model and create a grammar file. GBNF (GGML BNF) is a format for defining formal grammars to constrain model outputs in llama.cpp.

here’s a simple grammar file I created for a basic test.

root ::= answer
answer ::= "{" ws "\"id\":" ws number "," ws "\"name\":" ws string "}"
answerlist ::= "[]" | "[" ws answer ("," ws answer)* "]"
string ::= "\"" ([^"]*) "\""
boolean ::= "true" | "false"
ws ::= [ \t\n]*
number ::= [0-9]+ "."? [0-9]*
stringlist ::= "[" ws "]" | "[" ws string ("," ws string)* ws "]"
numberlist ::= "[" ws "]" | "[" ws string ("," ws number)* ws "]"

It’s harder to understand, however, you start with a schema definition that is easier to understand. see below.

interface answer {
id: number;
name: string;
}

next paste the schema to this online tool (https://grammar.intrinsiclabs.ai/) to generate the grammar file automatically — save lots of headache.

Now, we have a grammar file and ready to plug-in the Llama.cpp. see repo for more details on setup to run locally for your machine.

## start with a prompt
./main -m ./models/Mistral-7B-Instruct-v0.1-Q8.gguf -n 256 — grammar-file grammars/answer.gbnf -p ‘Q: Name the planets in the solar system? A:’
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 64.00 MB
llama_new_context_with_model: compute buffer total size = 79.13 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 73.00 MB (model: 0.00 MB, context: 73.00 MB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0

## response
Q: Name the planets in the solar system? A:{ "id": 1, "name": "Mercury"} [end of text]

llama_print_timings: load time = 845.86 ms
llama_print_timings: sample time = 157.01 ms / 16 runs ( 9.81 ms per token, 101.91 tokens per second)
llama_print_timings: prompt eval time = 649.35 ms / 13 tokens ( 49.95 ms per token, 20.02 tokens per second)
llama_print_timings: eval time = 3280.48 ms / 15 runs ( 218.70 ms per token, 4.57 tokens per second)
llama_print_timings: total time = 4104.05 ms
Log end

viola! the result is exactly only in json “{ “id”: 1, “name”: “Mercury”}

so, grammar is flexible to create complex object. here’s my second attempt to create a receipt schema and grammar file.

## Receipt Type Definitions using Typescript.
```
interface RestaurantReceipt {
restaurant: Restaurant;
customer: Customer;
order_date: string;
total_price: number;
tax_rate: number;
tax_amount: number;
discount_code: string;
payment_method: string;
card_type: string;
card_number: string;
expiration_month: number;
expiration_year: number;
cvv: string;
shipping_address: string;
items: Item[];
}

interface Restaurant {
name: string;
location: Location;
year: number;
phone_number: string;
email:string;
}

interface Customer {
first_name: string;
last_name: string;
email:string;
phone_number: string;
}

interface Location {
address: string;
city: string;
state: string;
country: string;
}

interface Item {
item_name: string;
quantity: number;
unit_price: number;
description: string;
item_total: number;
}
```

receipt grammar file

## Generated Grammar used during LLMs generation.

```
root ::= RestaurantReceipt
Item ::= "{" ws "\"item_name\":" ws string "," ws "\"quantity\":" ws number "," ws "\"unit_price\":" ws number "," ws "\"description\":" ws string "," ws "\"item_total\":" ws number "}"
Itemlist ::= "[]" | "[" ws Item ("," ws Item)* "]"
Location ::= "{" ws "\"address\":" ws string "," ws "\"city\":" ws string "," ws "\"state\":" ws string "," ws "\"country\":" ws string "}"
Locationlist ::= "[]" | "[" ws Location ("," ws Location)* "]"
Customer ::= "{" ws "\"first_name\":" ws string "," ws "\"last_name\":" ws string "," ws "\"email\":" ws string "," ws "\"phone_number\":" ws string "}"
Customerlist ::= "[]" | "[" ws Customer ("," ws Customer)* "]"
Restaurant ::= "{" ws "\"name\":" ws string "," ws "\"location\":" ws Location "," ws "\"year\":" ws number "," ws "\"phone_number\":" ws string "," ws "\"email\":" ws string "}"
Restaurantlist ::= "[]" | "[" ws Restaurant ("," ws Restaurant)* "]"
RestaurantReceipt ::= "{" ws "\"restaurant\":" ws Restaurant "," ws "\"customer\":" ws Customer "," ws "\"order_date\":" ws string "," ws "\"total_price\":" ws number "," ws "\"tax_rate\":" ws number "," ws "\"tax_amount\":" ws number "," ws "\"discount_code\":" ws string "," ws "\"payment_method\":" ws string "," ws "\"card_type\":" ws string "," ws "\"card_number\":" ws string "," ws "\"expiration_month\":" ws number "," ws "\"expiration_year\":" ws number "," ws "\"cvv\":" ws string "," ws "\"shipping_address\":" ws string "," ws "\"items\":" ws Itemlist "}"
RestaurantReceiptlist ::= "[]" | "[" ws RestaurantReceipt ("," ws RestaurantReceipt)* "]"
string ::= "\"" ([^"]*) "\""
boolean ::= "true" | "false"
ws ::= [ \t\n]*
number ::= [0-9]+ "."? [0-9]*
stringlist ::= "[" ws "]" | "[" ws string ("," ws string)* ws "]"
numberlist ::= "[" ws "]" | "[" ws string ("," ws number)* ws "]"

then run with llama.cpp


## Constrained output with grammars
## llama.cpp supports grammars to constrain model output. For example, you can force the model to output JSON only:

./main -m ./models/Mistral-7B-Instruct-v0.1-Q8.gguf -n 256 --grammar-file grammars/json.gbnf -p 'give me a sample receipt:'

outputs

...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 64.00 MB
llama_new_context_with_model: compute buffer total size = 79.13 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 73.00 MB (model: 0.00 MB, context: 73.00 MB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0


give me a sample receipt:{"receiptNumber":"12345","customerName":"John Smith","date":
"2021-01-01 10:30:00.000000",
"items": [
{
"itemId": "1",
"productId": "ABC123",
"quantity": 1,
"unitPrice": 19.99
},
{
"itemId": "2",
"productId": "DEF456",
"quantity": 2,
"unitPrice": 29.99
}
],
"subTotal": 59.98,
"taxAmount": 2.37,
"total": 62.35
} [end of text]

llama_print_timings: load time = 842.78 ms
llama_print_timings: sample time = 2477.51 ms / 177 runs ( 14.00 ms per token, 71.44 tokens per second)
llama_print_timings: prompt eval time = 509.36 ms / 9 tokens ( 56.60 ms per token, 17.67 tokens per second)
llama_print_timings: eval time = 38122.00 ms / 176 runs ( 216.60 ms per token, 4.62 tokens per second)
llama_print_timings: total time = 41331.49 ms
Log end

So far, grammar can control output always generate JSON as output — looks promising solution. see my repo here for schema and grammar file I created for this test.

B. KOR — structured data from text using LLMs

Ideas of some things that could be done with Kor.

  • Extract data from text that matches an extraction schema.
  • Power an AI assistant with skills by precisely understanding a user request.
  • Provide natural language access to an existing API.

see my repo link here for the test notebook I created to this test.

For this test I will use an open source LLama-2 model since we all love to save cost of not using ChatGPT api.

## download LLM model
from huggingface_hub import hf_hub_download
downloaded_model_path = hf_hub_download(repo_id="TheBloke/Llama-2-7b-Chat-GGUF", filename="llama-2-7b-chat.Q5_K_M.gguf")
from langchain.llms  import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from kor.extraction import create_extraction_chain

# get model chain
llm = LlamaCpp(model_path=downloaded_model_path,temperature=0.8,verbose=True,echo=True,n_ctx=512)

DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""
def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
return f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'

Example-1: Schema and Chain — Output Single Json object

#from langchain.chat_models import ChatOpenAI
from kor import create_extraction_chain, Object, Text
from kor.nodes import Object, Text, Number

schema = Object(
id="player",
description=(
"User is controlling a music player to select songs, pause or start them or play"
" music by a particular artist."
),
attributes=[
Text(
id="song",
description="User wants to play this song",
examples=[],
many=True,
),
Text(
id="album",
description="User wants to play this album",
examples=[],
many=True,
),
Text(
id="artist",
description="Music by the given artist",
examples=[("Songs by paul simon", "paul simon")],
many=True,
),
Text(
id="action",
description="Action to take one of: `play`, `stop`, `next`, `previous`.",
examples=[
("Please stop the music", "stop"),
("play something", "play"),
("play a song", "play"),
("next song", "next"),
],
),
],
many=False,
)
## chain
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
chain.run("play songs by paul simon and led zeppelin and the doors")['data']

## result
{'player': {'artist': ['paul simon', 'led zeppelin', 'the doors']}}

Result looks good, match the schema definition for a single object. KOR also support a pydantic schema definition which more popular. here’s a second example to create a list of json objects.

Example-2: Pydantic Schema — Output List of Json objects

from kor import from_pydantic
from typing import List, Optional
from pydantic import BaseModel, Field

## schema
class PlanetSchema(BaseModel):
planet_name: str = Field(description="The name of the planet")

class PlanetList(BaseModel):
planets: List[PlanetSchema]

schema, validator = from_pydantic(
PlanetSchema,
description="Planet Information",
many=True, # <-- Note Many = True
)

chain = create_extraction_chain(llm, schema, validator=validator)

result = chain.run(("list planets in our solar system."))
result

## output
{'data': {'planetschema': []},
'raw': '\n"planetname|name|\nMercury|4|244|0.387|\nVenus|10|210|0.936|\nEarth|5|127|1.000|\nMars|2|210|0.181|\nJupiter|15|890|4.35|\nSaturn|6|720|0.550|\nUranus|7|510|0.750|\nNeptune|8|490|1.778|"',
'errors': [],
'validated_data': []}

hmm, the result didn’t match what I expect for a list of json objects. need more investigation. given the raw data did result the values correctly.

C. LM-Format-Enforcer — Enforce the output format (JSON Schema, Regex etc) of a language model. This is a never framework and looks very promising to be the best framework. according to the documentation, the framework manipulate the output of tokens to generate json according to the schema design.

See my repo here notebook I created for this test. Similar to KOR test, I will continue use Open Source LLama-2 model since it was supported by the framework.

## setup LLM model
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
downloaded_model_path = hf_hub_download(repo_id="TheBloke/Llama-2-7b-Chat-GGUF", filename="llama-2-7b-chat.Q5_K_M.gguf")
llm = Llama(model_path=downloaded_model_path)


DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""
def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
return f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{message} [/INST]'

For output manipulation of token, it’s tightly couple to a LLM inference framework. In case of Llama.cpp, it need to create a LogitProcessor. see code below:

## LM Format Enforcer Logits Processor
from typing import Optional
from llama_cpp import LogitsProcessorList
from lmformatenforcer import CharacterLevelParser
from lmformatenforcer.integrations.llamacpp import build_llamacpp_logits_processor
from lmformatenforcer import JsonSchemaParser
from pydantic import BaseModel
from typing import List
from IPython.display import display, Markdown

def display_header(text):
display(Markdown(f'**{text}**'))

def display_content(text):
display(Markdown(f'```\n{text}\n```'))

def llamacpp_with_character_level_parser(llm: Llama, prompt: str, character_level_parser: Optional[CharacterLevelParser]) -> str:
logits_processors: Optional[LogitsProcessorList] = None
if character_level_parser:
logits_processors = LogitsProcessorList([build_llamacpp_logits_processor(llm, character_level_parser)])

output = llm(prompt, logits_processor=logits_processors)
text: str = output['choices'][0]['text']
return text

Now, we are really to run a simple test to return a single json object

class PlayerSchema(BaseModel):
first_name: str
last_name: str
year_of_birth: int
num_seasons_in_nba: int

question = 'Please give me information about Michael Jordan. You MUST answer using the following json schema: '
question_with_schema = f'{question}{PlayerSchema.schema_json()}'
prompt = get_prompt(question_with_schema)

display_header("Standard LLM Output:")
result = llamacpp_with_character_level_parser(llm, prompt, None)
display_content(result)
## result 
Of course! I'd be happy to provide information about Michael Jordan using the provided JSON schema.
{
"first_name": "Michael",
"last_name": "Jordan",
"year_of_birth": 1963,
"num_seasons_in_nba": 15
}
I hope this helps! Let me know if you have any other questions.

so, the result is not bad, it contain a json object. however, for application to consume this output, it still require additional parsing work to remove the unwanted text. so this framework does exactly that keeping unwanted text in the output — just return a json object.

display_header("LLM Output with json schema enforcing:")
result = llamacpp_with_character_level_parser(llm, prompt, JsonSchemaParser(PlayerSchema.schema()))
display_content(result)
 { "first_name": "Michael", "last_name": "Jordan", "year_of_birth": 1963, "num_seasons_in_nba": 15 }

nice, well done!

next, let’s test the generation of a list of json objects, first start with a standard LLM output

message="Q:please give me a list of planets in the solar system? A: "
prompt=get_prompt(message,DEFAULT_SYSTEM_PROMPT)
output = llm(prompt,max_tokens=512,stop=["Q:"])
text: str = output['choices'][0]['text']
display_header("LLM standard output")
print(text)

## LLM standard output

Of course! I'd be happy to help you with that. The eight planets in our solar system are:
1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune

now, let’s put in LLM output enforcement along with a simple schema.

## llm
llm = Llama(model_path=downloaded_model_path, n_ctx=4096,n_threads=16,verbose=False)

from typing import List
from pydantic import BaseModel

## schema
class PlanetSchema(BaseModel):
planet_name: str

class PlanetList(BaseModel):
planets: List[PlanetSchema]

## question
question = 'please give me a list of planets in the solar system?. You MUST answer using the following json schema: '
question_with_schema = f'{question}{PlanetList.schema_json()}'
prompt = get_prompt(question_with_schema)
#display_content(prompt)

## response
display_header("LLM Output with json schema enforcing:")
result = llamacpp_with_character_level_parser(llm, prompt, JsonSchemaParser(PlanetList.schema()))
display_content(result)
## LLM Output with json schema enforcing:
{ "planets": [
{ "planet_name": "Mercury" },
{ "planet_name": "Venus" }, { "planet_name": "Earth" },
{ "planet_name": "Mars" }, { "planet_name": "Jupiter" },
{ "planet_name": "Saturn" }, { "planet_name": "Uranus" },
{ "planet_name": "Neptune" }
] }

awesome result is a list of json object as we define in the schema.

D. Finetune LLM Approach

see my previous post on my attempts by finetune LLM to output JSON format from OCR data as input as well use Image as input, on both case, the result works well.

New Approaches (Updated in April 08, 2024)

F. Response_format = json_object

since this article is few month old, a better way to specify `response_format` as output format. see sample below in Llama.ccp-python:

from llama_cpp import Llama
model_name = "TheBloke/OpenHermes-2.5-Mistral-7B-GGUF"
model_file = "openhermes-2.5-mistral-7b.Q4_K_S.gguf"
HF_TOKEN="your token here"
model_path = hf_hub_download(model_name,
filename=model_file,
local_dir='./gguf',
token=HF_TOKEN)
llm = Llama(model_path=model_path,
n_gpu_layers=-1)

llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a helpful assistant that outputs in JSON.",
},
{"role": "user", "content": "Who is prime mister of Canada in 2022?"},
],
response_format={
"type": "json_object",
},
temperature=0.7,
)

Output

{'id': 'chatcmpl-0065348d-962c-4f5e-a72b-a4fb2cc69f67',
'object': 'chat.completion',
'created': 1712572334,
'model': './gguf/openhermes-2.5-mistral-7b.Q4_K_S.gguf',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'content': '{\n"prime minister": "Justin Trudeau",\n"year": 2022}'},
'logprobs': None,
'finish_reason': 'length'}],
'usage': {'prompt_tokens': 42, 'completion_tokens': 470, 'total_tokens': 512}}

G. Instructor (Json / Json Schema)

install instructor library

#!pip install instructor
## notebook-1. run llamacpp-open-api server
!CMAKE_ARGS="-DLLAMA_CUDA=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
!python3 -m llama_cpp.server --model gguf/openhermes-2.5-mistral-7b.Q4_K_S.gguf --n_gpu_layers 35
## notebook-2. patch api and make call 
import openai
import json

client = openai.OpenAI(
api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # can be anything
base_url = "http://localhost:8000/v1" # NOTE: Replace with IP address and port of your llama-cpp-python server
)

import instructor
from pydantic import BaseModel

# Enables `response_model`
client = instructor.patch(client=client)

class PrimeMisterDetail(BaseModel):
name: str
year: int


prime_mister = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=PrimeMisterDetail,
messages=[
{"role": "user", "content": "Who is prime mister of Canada in 2021?"},
]
)

assert isinstance(prime_mister, PrimeMisterDetail)
print(prime_mister)

output

name='Justin Trudeau' year=2021

Conclusion

While there is no one-size-fits-all solution, the search for the perfect approach continues. These amazing frameworks are tailored to specific use cases, and I find that imposing constraints on the output yields better results than prompt engineering.

Training my own local model gives me more control over the output, and it’s essential to test the model before using it, as each model’s output may vary and generating a list of JSON objects can be challenging for LLMs.

Thanks for reading, I hope you found something useful.

Have a nice day!

Credits and References

Thanks for authors for their hard works and sharing of these amazing frameworks.

Llama.cpp: https://github.com/ggerganov/llama.cpp

lm-format-enforcer: https://github.com/noamgat/lm-format-enforcer/tree/main

Grammar builder: https://grammar.intrinsiclabs.ai/

StrictJson: https://github.com/tanchongmin/strictjson

KOR: https://github.com/eyurtsev/kor

llama-cpp-python: https://github.com/abetlen/llama-cpp-python

langchain: https://github.com/langchain-ai/langchain

--

--

Minyang Chen

Enthusiastic in AI, Cloud, Big Data and Software Engineering. Sharing insights from my own experiences.