When working with LLMs sometimes you want to know if the response you’re getting from the model is the one that at least the model itself is sort of confident about. For example, I recently worked on classifying pull requests into categories like “feature”, “bugfix”, “infrastructure”, etc with LLMs, and as part of the process we wanted to know how many categories should we assign for each PR. We were interested in assigning any number of categories that are relevant to the PR (a PR can be both a “bugfix” and “infrastructure”). It’s hard to get a proper confidence score from an LLM, but logprobs
probably is the closest we can get. The problem is, in a structured response generation (e.g. when you prompt the model to generate its response in a JSON format), you’re only interested in the logprobs
of the values, not everything. In the example generation below, we are only interested in the logprobs
of “bugfix”, “testing”, and “infrastructure”, but not “primary_category”, etc:
{
"primary_category": "bugfix",
"other_relevant_categories": ["testing", "infrastructure"]
}
Today I learned there’s a python library that helps getting these logprops
from the OpenAI’s structured response in a convenient way, and it’s called structured-logprobs
. Here is how to use it:
import math
from openai import OpenAI
from structured_logprobs.main import add_logprobs
## generate structured output with openai
client = OpenAI(api_key="your-api-key")
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"""Please output metadata about the following function:
def say_hello():
print("Hello World")
"""
),
}
],
logprobs=True,
response_format={
"type": "json_schema",
"json_schema": {
"name": "answer",
"schema": {
"type": "object",
"properties": {
"function_name": {"type": "string"},
"language": {"type": "string"},
"version": {"type": "string"},
},
},
},
},
)
## using the structured-logprobs library:
chat_completion = add_logprobs(completion)
print(json.loads(chat_completion.value.choices[0].message.content))
This gives me:
{"function_name": "say_hello", "language": "Python", "version": "3.x"}
As you can see, I’ve intentionally added the version
to the expected output, and I’m expecting the model to be less confident about its generation for version
compared to say_hello
and langugae
. Let’s examine the logprobs
and probabilities:
print(chat_completion.log_probs[0])
Output:
{"function_name": -2.6073003596138733e-06,
"language": -0.16022545099258423,
"version": -0.26306185549037764}
Converting to probabilities to make the numbers more intuitive:
data = chat_completion.log_probs[0]
transformed_data = {
key + "_prob": [round(math.exp(log_prob), 2) for log_prob in value]
if isinstance(value, list)
else round(math.exp(value), 2)
for key, value in data.items()
}
print(transformed_data)
Outputs:
{"function_name_prob": 1.0, "language_prob": 0.85, "version_prob": 0.77}
As expected, we get the absolute highest number (1.0
) for function_name
, since it’s already in our prompt. A very high number for language
, since our function was obviously a python function. And a lower (but still relatively high!) number for version
. As I mentioned previously, these numbers don’t necessarily represent the absolute correctness of the values, but they rather are extra information for us to gauge the usefulness of the generated output.