patch_litellm()Core
LiteLLM
Deterministic outputs
LiteLLM ModelResponse(Stream) objects have id and created_at fields that are generated dynamically. Even when we use cachy to cache the LLM response these dynamic fields create diffs which makes code review more challenging. The patches below ensure that id and created_at fields are fixed and won’t generate diffs.
patch_litellm
patch_litellm (seed=0)
Patch litellm.ModelResponseBase such that id and created are fixed.
Completion
LiteLLM provides an convenient unified interface for most big LLM providers. Because it’s so useful to be able to switch LLM providers with just one argument. We want to make it even easier to by adding some more convenience functions and classes.
This is very similar to our other wrapper libraries for popular AI providers: claudette (Anthropic), gaspard (Gemini), cosette (OpenAI).
# litellm._turn_on_debug()ms = ["gemini/gemini-2.5-flash", "claude-sonnet-4-5", "openai/gpt-4.1"]
msg = [{'role':'user','content':'Hey there!', 'cache_control': {'type': 'ephemeral'}}]
for m in ms:
display(Markdown(f'**{m}:**'))
display(completion(m,msg))gemini/gemini-2.5-flash:
Hey there! How can I help you today?
- id:
chatcmpl-xxx - model:
gemini-2.5-flash - finish_reason:
stop - usage:
Usage(completion_tokens=153, prompt_tokens=4, total_tokens=157, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=143, rejected_prediction_tokens=None, text_tokens=10), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=4, image_tokens=None))
claude-sonnet-4-5:
Hello! How can I help you today?
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=12, prompt_tokens=10, total_tokens=22, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
openai/gpt-4.1:
Hello! How can I help you today? 😊
- id:
chatcmpl-xxx - model:
gpt-4.1-2025-04-14 - finish_reason:
stop - usage:
Usage(completion_tokens=10, prompt_tokens=10, total_tokens=20, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None))
Messages formatting
Let’s start with making it easier to pass messages into litellm’s completion function (including images, and pdf files).
remove_cache_ckpts
remove_cache_ckpts (msg)
remove cache checkpoints and return msg.
mk_msg
mk_msg (content, role='user', cache=False, ttl=None)
Create a LiteLLM compatible message.
| Type | Default | Details | |
|---|---|---|---|
| content | Content: str, bytes (image), list of mixed content, or dict w ‘role’ and ‘content’ fields | ||
| role | str | user | Message role if content isn’t already a dict/Message |
| cache | bool | False | Enable Anthropic caching |
| ttl | NoneType | None | Cache TTL: ‘5m’ (default) or ‘1h’ |
Now we can use mk_msg to create different types of messages.
Simple text:
msg = mk_msg("hey")
msg{'role': 'user', 'content': 'hey'}
Which can be passed to litellm’s completion function like this:
model = ms[1]res = completion(model, [msg])
resHey! How’s it going? What’s on your mind?
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=16, prompt_tokens=8, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
We’ll add a little shortcut to make examples and testing easier here:
def c(msgs, **kw):
msgs = [msgs] if isinstance(msgs,dict) else listify(msgs)
return completion(model, msgs, **kw)c(msg)Hey! How’s it going? What’s on your mind?
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=16, prompt_tokens=8, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Lists w just one string element are flattened for conciseness:
test_eq(mk_msg("hey"), mk_msg(["hey"]))(LiteLLM ignores these fields when sent to other providers)
Text and images:
img_fn = Path('samples/puppy.jpg')
Image(filename=img_fn, width=200)
msg = mk_msg(['hey what in this image?',img_fn.read_bytes()])
print(json.dumps(msg,indent=1)[:200]+"..."){
"role": "user",
"content": [
{
"type": "text",
"text": "hey what in this image?"
},
{
"type": "image_url",
"image_url": "...
c(msg)This image shows an adorable Cavalier King Charles Spaniel puppy! The puppy has the breed’s characteristic features:
- Coloring: Brown (chestnut) and white coat
- Sweet expression: Large, dark eyes and a gentle face
- Setting: The puppy is lying on grass near some purple flowers (appear to be asters or similar blooms)
The puppy looks very young and has that irresistibly cute, innocent look that Cavalier puppies are famous for. The photo has a professional quality with nice lighting and composition, capturing the puppy’s endearing personality perfectly!
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=139, prompt_tokens=104, total_tokens=243, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Let’s also demonstrate this for PDFs
pdf_fn = Path('samples/solveit.pdf')
msg = mk_msg(['Who is the author of this pdf?', pdf_fn.read_bytes()])
c(msg)The author of this PDF is Jeremy Howard from fast.ai. He explicitly introduces himself in the document with “Hi, I’m Jeremy Howard, from fast.ai” and goes on to describe his work co-founding fast.ai with Rachel Thomas eight years ago.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=59, prompt_tokens=1610, total_tokens=1669, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Caching
Some providers such as Anthropic require manually opting into caching. Let’s try it:
def cpr(i): return f'{i} '*1024 + 'This is a caching test. Report back only what number you see repeated above.'disable_cachy()msg = mk_msg(cpr(1), cache=True)
res = c(msg)
res1
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=5, prompt_tokens=3, total_tokens=8, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=2070, cache_read_input_tokens=0)
Anthropic has a maximum of 4 cache checkpoints, so we remove previous ones as we go:
res = c([remove_cache_ckpts(msg), mk_msg(res), mk_msg(cpr(2), cache=True)])
res2
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=5, prompt_tokens=2073, total_tokens=2078, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=2070, text_tokens=None, image_tokens=None), cache_creation_input_tokens=2074, cache_read_input_tokens=2070)
We see that the first message was cached, and this extra message has been written to cache:
res.usage.prompt_tokens_detailsPromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=2070, text_tokens=None, image_tokens=None)
We can add a bunch of large messages in a loop to see how the number of cached tokens used grows.
We do this for 25 times to ensure it still works for more than >20 content blocks, which is a known anthropic issue.
The code below is commented by default, because it’s slow. Please uncomment when working on caching.
# h = []
# msg = mk_msg(cpr(1), cache=True)
# for o in range(2,25):
# h += [remove_cache_ckpts(msg), mk_msg(res)]
# msg = mk_msg(cpr(o), cache=True)
# res = c(h+[msg])
# detls = res.usage.prompt_tokens_details
# print(o, detls.cached_tokens, detls.cache_creation_tokens, end='; ')enable_cachy()Reconstructing formatted outputs
Lisette can call multiple tools in a loop. Further down this notebook, we’ll provide convenience functions for formatting such a sequence of toolcalls and responses into one formatted output string.
For now, we’ll show an example and show how to transform such a formatted output string back into a valid LiteLLM history.
fmt_outp = '''
I'll solve this step-by-step, using parallel calls where possible.
<details class='tool-usage-details'>
```json
{
"id": "toolu_01KjnQH2Nsz2viQ7XYpLW3Ta",
"call": { "function": "simple_add", "arguments": { "a": 10, "b": 5 } },
"result": "15"
}
```
</details>
<details class='tool-usage-details'>
```json
{
"id": "toolu_01Koi2EZrGZsBbnQ13wuuvzY",
"call": { "function": "simple_add", "arguments": { "a": 2, "b": 1 } },
"result": "3"
}
```
</details>
Now I need to multiply 15 * 3 before I can do the final division:
<details class='tool-usage-details'>
```json
{
"id": "toolu_0141NRaWUjmGtwxZjWkyiq6C",
"call": { "function": "multiply", "arguments": { "a": 15, "b": 3 } },
"result": "45"
}
```
</details>
'''We can split into chunks of (text,toolstr,json):
sp = re_tools.split(fmt_outp)
for o in list(chunked(sp, 3, pad=True)): print('- ', o)- ["\nI'll solve this step-by-step, using parallel calls where possible.\n\n", '<details class=\'tool-usage-details\'>\n\n```json\n{\n "id": "toolu_01KjnQH2Nsz2viQ7XYpLW3Ta",\n "call": { "function": "simple_add", "arguments": { "a": 10, "b": 5 } },\n "result": "15"\n}\n```\n\n</details>', '{\n "id": "toolu_01KjnQH2Nsz2viQ7XYpLW3Ta",\n "call": { "function": "simple_add", "arguments": { "a": 10, "b": 5 } },\n "result": "15"\n}']
- ['\n\n', '<details class=\'tool-usage-details\'>\n\n```json\n{\n "id": "toolu_01Koi2EZrGZsBbnQ13wuuvzY",\n "call": { "function": "simple_add", "arguments": { "a": 2, "b": 1 } },\n "result": "3"\n}\n```\n\n</details>', '{\n "id": "toolu_01Koi2EZrGZsBbnQ13wuuvzY",\n "call": { "function": "simple_add", "arguments": { "a": 2, "b": 1 } },\n "result": "3"\n}']
- ['\n\nNow I need to multiply 15 * 3 before I can do the final division:\n\n', '<details class=\'tool-usage-details\'>\n\n```json\n{\n "id": "toolu_0141NRaWUjmGtwxZjWkyiq6C",\n "call": { "function": "multiply", "arguments": { "a": 15, "b": 3 } },\n "result": "45"\n}\n```\n\n</details>', '{\n "id": "toolu_0141NRaWUjmGtwxZjWkyiq6C",\n "call": { "function": "multiply", "arguments": { "a": 15, "b": 3 } },\n "result": "45"\n}']
- ['\n', None, None]
fmt2hist
fmt2hist (outp:str)
Transform a formatted output into a LiteLLM compatible history
See how we can turn that one formatted output string back into a list of Messages:
from pprint import pprinth = fmt2hist(fmt_outp)
pprint(h)[Message(content="I'll solve this step-by-step, using parallel calls where possible.", role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":10,"b":5}', name='simple_add'), id='toolu_01KjnQH2Nsz2viQ7XYpLW3Ta', type='function')], function_call=None, provider_specific_fields=None),
{'content': '15',
'name': 'simple_add',
'role': 'tool',
'tool_call_id': 'toolu_01KjnQH2Nsz2viQ7XYpLW3Ta'},
Message(content='', role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":2,"b":1}', name='simple_add'), id='toolu_01Koi2EZrGZsBbnQ13wuuvzY', type='function')], function_call=None, provider_specific_fields=None),
{'content': '3',
'name': 'simple_add',
'role': 'tool',
'tool_call_id': 'toolu_01Koi2EZrGZsBbnQ13wuuvzY'},
Message(content='Now I need to multiply 15 * 3 before I can do the final division:', role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":15,"b":3}', name='multiply'), id='toolu_0141NRaWUjmGtwxZjWkyiq6C', type='function')], function_call=None, provider_specific_fields=None),
{'content': '45',
'name': 'multiply',
'role': 'tool',
'tool_call_id': 'toolu_0141NRaWUjmGtwxZjWkyiq6C'},
Message(content='.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None)]
mk_msgs
We will skip tool use blocks and tool results during caching
Now lets make it easy to provide entire conversations:
mk_msgs
mk_msgs (msgs, cache=False, cache_idxs=[-1], ttl=None)
Create a list of LiteLLM compatible messages.
| Type | Default | Details | |
|---|---|---|---|
| msgs | List of messages (each: str, bytes, list, or dict w ‘role’ and ‘content’ fields) | ||
| cache | bool | False | Enable Anthropic caching |
| cache_idxs | list | [-1] | Cache breakpoint idxs |
| ttl | NoneType | None | Cache TTL: ‘5m’ (default) or ‘1h’ |
With mk_msgs you can easily provide a whole conversation:
msgs = mk_msgs(['Hey!',"Hi there!","How are you?","I'm doing fine and you?"])
msgs[{'role': 'user', 'content': 'Hey!'},
{'role': 'assistant', 'content': 'Hi there!'},
{'role': 'user', 'content': 'How are you?'},
{'role': 'assistant', 'content': "I'm doing fine and you?"}]
By defualt the last message will be cached when cache=True:
msgs = mk_msgs(['Hey!',"Hi there!","How are you?","I'm doing fine and you?"], cache=True)
msgs[{'role': 'user', 'content': 'Hey!'},
{'role': 'assistant', 'content': 'Hi there!'},
{'role': 'user', 'content': 'How are you?'},
{'role': 'assistant',
'content': [{'type': 'text',
'text': "I'm doing fine and you?",
'cache_control': {'type': 'ephemeral'}}]}]
test_eq('cache_control' in msgs[-1]['content'][0], True)Alternatively, users can provide custom cache_idxs. Tool call blocks and results are skipped during caching:
msgs = mk_msgs(['Hello!','Hi! How can I help you?','Call some functions!',fmt_outp], cache=True, cache_idxs=[0,-2,-1])
msgs[{'role': 'user',
'content': [{'type': 'text',
'text': 'Hello!',
'cache_control': {'type': 'ephemeral'}}]},
{'role': 'assistant', 'content': 'Hi! How can I help you?'},
{'role': 'user',
'content': [{'type': 'text',
'text': 'Call some functions!',
'cache_control': {'type': 'ephemeral'}}]},
Message(content="I'll solve this step-by-step, using parallel calls where possible.", role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":10,"b":5}', name='simple_add'), id='toolu_01KjnQH2Nsz2viQ7XYpLW3Ta', type='function')], function_call=None, provider_specific_fields=None),
{'role': 'tool',
'tool_call_id': 'toolu_01KjnQH2Nsz2viQ7XYpLW3Ta',
'name': 'simple_add',
'content': '15'},
Message(content='', role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":2,"b":1}', name='simple_add'), id='toolu_01Koi2EZrGZsBbnQ13wuuvzY', type='function')], function_call=None, provider_specific_fields=None),
{'role': 'tool',
'tool_call_id': 'toolu_01Koi2EZrGZsBbnQ13wuuvzY',
'name': 'simple_add',
'content': '3'},
Message(content='Now I need to multiply 15 * 3 before I can do the final division:', role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a":15,"b":3}', name='multiply'), id='toolu_0141NRaWUjmGtwxZjWkyiq6C', type='function')], function_call=None, provider_specific_fields=None),
{'role': 'tool',
'tool_call_id': 'toolu_0141NRaWUjmGtwxZjWkyiq6C',
'name': 'multiply',
'content': '45'},
Message(content=[{'type': 'text', 'text': '.', 'cache_control': {'type': 'ephemeral'}}], role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None)]
test_eq('cache_control' in msgs[0]['content'][0], True)
test_eq('cache_control' in msgs[2]['content'][0], True) # shifted idxs to skip tools
test_eq('cache_control' in msgs[-1]['content'][0], True)Who’s speaking at when is automatically inferred. Even when there are multiple tools being called in parallel (which LiteLLM supports!).
msgs = mk_msgs(['Tell me the weather in Paris and Rome',
'Assistant calls weather tool two times',
{'role':'tool','content':'Weather in Paris is ...'},
{'role':'tool','content':'Weather in Rome is ...'},
'Assistant returns weather',
'Thanks!'])
msgs[{'role': 'user', 'content': 'Tell me the weather in Paris and Rome'},
{'role': 'assistant', 'content': 'Assistant calls weather tool two times'},
{'role': 'tool', 'content': 'Weather in Paris is ...'},
{'role': 'tool', 'content': 'Weather in Rome is ...'},
{'role': 'assistant', 'content': 'Assistant returns weather'},
{'role': 'user', 'content': 'Thanks!'}]
For ease of use, if msgs is not already in a list, it will automatically be wrapped inside one. This way you can pass a single prompt into mk_msgs and get back a LiteLLM compatible msg history.
msgs = mk_msgs("Hey")
msgs[{'role': 'user', 'content': 'Hey'}]
msgs = mk_msgs(['Hey!',"Hi there!","How are you?","I'm fine, you?"])
msgs[{'role': 'user', 'content': 'Hey!'},
{'role': 'assistant', 'content': 'Hi there!'},
{'role': 'user', 'content': 'How are you?'},
{'role': 'assistant', 'content': "I'm fine, you?"}]
However, beware that if you use mk_msgs for a single message, consisting of multiple parts. Then you should be explicit, and make sure to wrap those multiple messages in two lists:
- One list to show that they belong together in one message (the inner list).
- Another, because mk_msgs expects a list of multiple messages (the outer list).
This is common when working with images for example:
msgs = mk_msgs([['Whats in this img?',img_fn.read_bytes()]])
print(json.dumps(msgs,indent=1)[:200]+"...")[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Whats in this img?"
},
{
"type": "image_url",
"image_url": "...
Streaming
LiteLLM supports streaming responses. That’s really useful if you want to show intermediate results, instead of having to wait until the whole response is finished.
We create this helper function that returns the entire response at the end of the stream. This is useful when you want to store the whole response somewhere after having displayed the intermediate results.
stream_with_complete
stream_with_complete (gen, postproc=<function noop>)
Extend streaming response chunks with the complete response
r = c(mk_msgs("Hey!"), stream=True)
r2 = SaveReturn(stream_with_complete(r))for o in r2:
cts = o.choices[0].delta.content
if cts: print(cts, end='')Hey! How's it going? 😊 What can I help you with today?
r2.valueHey! How’s it going? 😊 What can I help you with today?
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=22, prompt_tokens=9, total_tokens=31, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Tools
lite_mk_func
lite_mk_func (f)
def simple_add(
a: int, # first operand
b: int=0 # second operand
) -> int:
"Add two numbers together"
return a + btoolsc = lite_mk_func(simple_add)
toolsc{'type': 'function',
'function': {'name': 'simple_add',
'description': 'Add two numbers together\n\nReturns:\n- type: integer',
'parameters': {'type': 'object',
'properties': {'a': {'type': 'integer', 'description': 'first operand'},
'b': {'type': 'integer', 'description': 'second operand', 'default': 0}},
'required': ['a']}}}
tmsg = mk_msg("What is 5478954793+547982745? How about 5479749754+9875438979? Always use tools for calculations, and describe what you'll do before using a tool. Where multiple tool calls are required, do them in a single response where possible. ")
r = c(tmsg, tools=[toolsc])display(r)I’ll help you calculate both of those sums using the addition tool.
Let me break down what I’ll do: 1. First calculation: 5478954793 + 547982745 2. Second calculation: 5479749754 + 9875438979
Since these are independent calculations, I’ll perform both at the same time.
🔧 simple_add({“a”: 5478954793, “b”: 547982745})
🔧 simple_add({“a”: 5479749754, “b”: 9875438979})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=211, prompt_tokens=659, total_tokens=870, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
A tool response can be a string or a list of tool blocks (e.g., an image url block). To allow users to specify if a response should not be immediately stringified, we provide the ToolResponse datatype users can wrap their return statement in.
ToolResponse
ToolResponse (content:list[str,str])
tcs = [_lite_call_func(o, ns=globals()) for o in r.choices[0].message.tool_calls]
tcs[{'tool_call_id': 'toolu_01KATe5b5tmd4tK5D9BUZE5S',
'role': 'tool',
'name': 'simple_add',
'content': '6026937538'},
{'tool_call_id': 'toolu_01E4WQj8RkQj8Z7QLJ6ireTe',
'role': 'tool',
'name': 'simple_add',
'content': '15355188733'}]
def delta_text(msg):
"Extract printable content from streaming delta, return None if nothing to print"
c = msg.choices[0]
if not c: return c
if not hasattr(c,'delta'): return None #f'{c}'
delta = c.delta
if delta.content: return delta.content
if delta.tool_calls:
res = ''.join(f"🔧 {tc.function.name}" for tc in delta.tool_calls if tc.id and tc.function.name)
if res: return f'\n{res}\n'
if hasattr(delta,'reasoning_content'): return '🧠' if delta.reasoning_content else '\n\n'
return Noner = c(tmsg, stream=True, tools=[toolsc])
r2 = SaveReturn(stream_with_complete(r))
for o in r2: print(delta_text(o) or '', end='')I'll help you calculate those two sums using the addition tool.
Let me break down what I need to do:
1. Calculate 5478954793 + 547982745
2. Calculate 5479749754 + 9875438979
Since these are independent calculations, I'll perform both additions at once.
🔧 simple_add
🔧 simple_add
r2.valueI’ll help you calculate those two sums using the addition tool.
Let me break down what I need to do: 1. Calculate 5478954793 + 547982745 2. Calculate 5479749754 + 9875438979
Since these are independent calculations, I’ll perform both additions at once.
🔧 simple_add({“a”: 5478954793, “b”: 547982745})
🔧 simple_add({“a”: 5479749754, “b”: 9875438979})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=206, prompt_tokens=659, total_tokens=865, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
msg = mk_msg("Solve this complex math problem: What is the derivative of x^3 + 2x^2 - 5x + 1?")
r = c(msg, stream=True, reasoning_effort="low")
r2 = SaveReturn(stream_with_complete(r))
for o in r2: print(delta_text(o) or '', end='')🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
# Derivative Solution
To find the derivative of **f(x) = x³ + 2x² - 5x + 1**, I'll apply the power rule to each term.
## Using the Power Rule: d/dx(xⁿ) = n·xⁿ⁻¹
**Term by term:**
- d/dx(x³) = 3x²
- d/dx(2x²) = 4x
- d/dx(-5x) = -5
- d/dx(1) = 0
## Answer:
**f'(x) = 3x² + 4x - 5**
r2.valueDerivative Solution
To find the derivative of f(x) = x³ + 2x² - 5x + 1, I’ll apply the power rule to each term.
Using the Power Rule: d/dx(xⁿ) = n·xⁿ⁻¹
Term by term: - d/dx(x³) = 3x² - d/dx(2x²) = 4x - d/dx(-5x) = -5 - d/dx(1) = 0
Answer:
f’(x) = 3x² + 4x - 5
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=328, prompt_tokens=66, total_tokens=394, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=148, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Search
LiteLLM provides search, not via tools, but via the special web_search_options param.
Note: Not all models support web search. LiteLLM’s supports_web_search field should indicate this, but it’s unreliable for some models like claude-sonnet-4-20250514. Checking both supports_web_search and search_context_cost_per_query provides more accurate detection.
for m in ms: print(m, _has_search(m))gemini/gemini-2.5-flash True
claude-sonnet-4-5 True
openai/gpt-4.1 False
When search is supported it can be used like this:
smsg = mk_msg("Search the web and tell me very briefly about otters")
r = c(smsg, web_search_options={"search_context_size": "low"}) # or 'medium' / 'high'
rOtters are carnivorous mammals in the subfamily Lutrinae and members of the weasel family. The 14 extant otter species are all semiaquatic, both freshwater and marine. They’re found on every continent except Australia and Antarctica.
Otters are distinguished by their long, slim bodies, powerful webbed feet for swimming, and their dense fur, which keeps them warm and buoyant in water. In fact, otters have the densest fur of any animal—as many as a million hairs per square inch in places.
All otters are expert hunters that eat fish, crustaceans, and other critters. They’re known for being playful animals and sea otters famously use rocks as tools to crack open shellfish. When it’s time to nap, sea otters entangle themselves in kelp so they don’t float away.
Many otter species were historically hunted nearly to extinction for their fur but have since recovered in some areas, though several species remain threatened by pollution and habitat loss.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=382, prompt_tokens=18089, total_tokens=18471, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), server_tool_use=ServerToolUse(web_search_requests=1), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Citations
Next, lets handle Anthropic’s search citations.
When not using streaming, all citations are placed in a separate key in the response:
r.choices[0].message.provider_specific_fields['citations'][0][{'type': 'web_search_result_location',
'cited_text': 'Otters are carnivorous mammals in the subfamily Lutrinae. ',
'url': 'https://en.wikipedia.org/wiki/Otter',
'title': 'Otter - Wikipedia',
'encrypted_index': 'Eo8BCioICBgCIiQ4ODk4YTFkYy0yMTNkLTRhNmYtOTljYi03ZTBlNTUzZDc0NWISDMlacTT8THSDML7nuhoMyB3Xp2StEfWJOx72IjATEIYmZbwZDH+a0KRLuOHQx4nipGzmvy//B4ItZEaDN4t55aF0a+SnmlUY390IN18qE+y/CtqixJ/kgvGL2GCYkFhQRxMYBA=='}]
However, when streaming the results are not captured this way. Instead, we provide this helper function that adds the citation to the content field in markdown format:
cite_footnotes
cite_footnotes (stream_list)
Add markdown footnote citations to stream deltas
cite_footnote
cite_footnote (msg)
r = list(c(smsg, stream=True, web_search_options={"search_context_size": "low"}))
cite_footnotes(r)
stream_chunk_builder(r)Otters are * charismatic members of the weasel family, found on every continent except Australia and Antarctica. * * There are 13-14 species in total, ranging from the small-clawed otter to the giant otter.
These aquatic mammals are known for * their short ears and noses, elongated bodies, long tails, and soft, dense fur. In fact, * otters have the densest fur of any animal—as many as a million hairs per square inch, which keeps them warm in water since they lack blubber.
* All otters are expert hunters that eat fish, crustaceans, and other critters. * Sea otters will float on their backs, place a rock on their chests, then smash mollusks down on it until they break open. * River otters are especially playful, gamboling on land and splashing into rivers and streams. They’re highly adapted for water with webbed feet, and * can stay submerged for more than 5 minutes, with river otters able to hold their breath for up to 8 minutes.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=431, prompt_tokens=15055, total_tokens=15486, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Chat
LiteLLM is pretty bare bones. It doesnt keep track of conversation history or what tools have been added in the conversation so far.
So lets make a Claudette style wrapper so we can do streaming, toolcalling, and toolloops without problems.
When the tool uses are about to be exhausted it is important to alert the AI so that it knows to use its final steps for communicating the user current progress and next steps
Chat
Chat (model:str, sp='', temp=0, search=False, tools:list=None, hist:list=None, ns:Optional[dict]=None, cache=False, cache_idxs:list=[-1], ttl=None)
LiteLLM chat client.
| Type | Default | Details | |
|---|---|---|---|
| model | str | LiteLLM compatible model name | |
| sp | str | System prompt | |
| temp | int | 0 | Temperature |
| search | bool | False | Search (l,m,h), if model supports it |
| tools | list | None | Add tools |
| hist | list | None | Chat history |
| ns | Optional | None | Custom namespace for tool calling |
| cache | bool | False | Anthropic prompt caching |
| cache_idxs | list | [-1] | Anthropic cache breakpoint idxs, use 0 for sys prompt if provided |
| ttl | NoneType | None | Anthropic prompt caching ttl |
@patch(as_prop=True)
def cost(self: Chat):
"Total cost of all responses in conversation history"
return sum(getattr(r, '_hidden_params', {}).get('response_cost') or 0
for r in self.h if hasattr(r, 'choices'))Examples
History tracking
chat = Chat(model)
res = chat("Hey my name is Rens")
resHey Rens! Nice to meet you. How can I help you today?
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=20, prompt_tokens=14, total_tokens=34, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
chat("Whats my name")Your name is Rens! You told me that when you introduced yourself at the start of our conversation.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=25, prompt_tokens=42, total_tokens=67, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
See now we keep track of history!
History is stored in the hist attribute:
chat.hist[{'role': 'user', 'content': 'Hey my name is Rens'},
Message(content='Hey Rens! Nice to meet you. How can I help you today?', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
{'role': 'user', 'content': 'Whats my name'},
Message(content='Your name is Rens! You told me that when you introduced yourself at the start of our conversation.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]
You can also pass an old chat history into new Chat objects:
chat2 = Chat(model, hist=chat.hist)
chat2("What was my name again?")Your name is Rens! You’ve asked me a couple times now - just checking if I’m paying attention? 😊
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=30, prompt_tokens=76, total_tokens=106, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Synthetic History Creation
Lets build chat history step by step. That way we can tweak anything we need to during testing.
pr = "What is 5 + 7? Use the tool to calculate it."
c = Chat(model, tools=[simple_add])
res = c(pr)Chat.print_hist
Chat.print_hist ()
Print each message on a different line
Whereas normally without tools we would get one user input and one assistant response. Here we get two extra messages in between. - An assistant message requesting the tools with arguments. - A tool response with the result to the tool call.
c.print_hist(){'role': 'user', 'content': 'What is 5 + 7? Use the tool to calculate it.'}
Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_012bi9eSyzhwaG3TgGpytJbc', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})
{'tool_call_id': 'toolu_012bi9eSyzhwaG3TgGpytJbc', 'role': 'tool', 'name': 'simple_add', 'content': '12'}
{'role': 'assistant', 'content': 'You have no more tool uses. Please summarize your findings. If you did not complete your goal please tell the user what further work needs to be done so they can choose how best to proceed.'}
Message(content='\n\nThe result of 5 + 7 is **12**.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})
Lets try to build this up manually so we have full control over the inputs.
random_tool_id
random_tool_id ()
Generate a random tool ID with ‘toolu_’ prefix
random_tool_id()'toolu_0UAqFzWsDK4FrUMp48Y3tT3QD'
A tool call request can contain one more or more tool calls. Lets make one.
mk_tc
mk_tc (func, args, tcid=None, idx=1)
tc = mk_tc(simple_add.__name__, json.dumps(dict(a=5, b=7)))
tc{'index': 1,
'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'},
'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7',
'type': 'function'}
This can then be packged into the full Message object produced by the assitant.
def mk_tc_req(content, tcs): return Message(content=content, role='assistant', tool_calls=tcs, function_call=None)tc_cts = "I'll use the simple_add tool to calculate 5 + 7 for you."
tcq = mk_tc_req(tc_cts, [tc])
tcqMessage(content="I'll use the simple_add tool to calculate 5 + 7 for you.", role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=1, function=Function(arguments='{"a": 5, "b": 7}', name='simple_add'), id='toolu_gAL47D1qXIaSyZPaE1pu1lJo7', type='function')], function_call=None, provider_specific_fields=None)
Notice how Message instantiation creates a list of ChatCompletionMessageToolCalls by default. When the tools are executed this is converted back to a dictionary, for consistency we want to keep these as dictionaries from the beginning.
mk_tc_req
mk_tc_req (content, tcs)
tcq = mk_tc_req(tc_cts, [tc])
tcqMessage(content="I'll use the simple_add tool to calculate 5 + 7 for you.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7', 'type': 'function'}], function_call=None, provider_specific_fields=None)
c = Chat(model, tools=[simple_add], hist=[pr, tcq])c.print_hist(){'role': 'user', 'content': 'What is 5 + 7? Use the tool to calculate it.'}
Message(content="I'll use the simple_add tool to calculate 5 + 7 for you.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7', 'type': 'function'}], function_call=None, provider_specific_fields=None)
Looks good so far! Now we will want to provide the actual result!
mk_tc_result
mk_tc_result (tc, result)
Note we might have more than one tool call if more than one was passed in, here we just will make one result.
tcq.tool_calls[0]{'index': 1,
'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'},
'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7',
'type': 'function'}
mk_tc_result(tcq.tool_calls[0], '12'){'tool_call_id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7',
'role': 'tool',
'name': 'simple_add',
'content': '12'}
mk_tc_results
mk_tc_results (tcq, results)
Same for here tcq.tool_calls will match the number of results passed in the results list.
tcqMessage(content="I'll use the simple_add tool to calculate 5 + 7 for you.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7', 'type': 'function'}], function_call=None, provider_specific_fields=None)
tcr = mk_tc_results(tcq, ['12'])
tcr[{'tool_call_id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7',
'role': 'tool',
'name': 'simple_add',
'content': '12'}]
Now we can call it with this synthetic data to see what the response is!
c(tcr[0])The result of 5 + 7 is 12.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=17, prompt_tokens=720, total_tokens=737, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
c.print_hist(){'role': 'user', 'content': 'What is 5 + 7? Use the tool to calculate it.'}
Message(content="I'll use the simple_add tool to calculate 5 + 7 for you.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7', 'type': 'function'}], function_call=None, provider_specific_fields=None)
{'tool_call_id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7', 'role': 'tool', 'name': 'simple_add', 'content': '12'}
Message(content='The result of 5 + 7 is **12**.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})
Lets try this again, but lets give it something that is clearly wrong for fun.
c = Chat(model, tools=[simple_add], hist=[pr, tcq])tcr = mk_tc_results(tcq, ['13'])
tcr[{'tool_call_id': 'toolu_gAL47D1qXIaSyZPaE1pu1lJo7',
'role': 'tool',
'name': 'simple_add',
'content': '13'}]
c(tcr[0])The result of 5 + 7 is 12.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=17, prompt_tokens=720, total_tokens=737, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Lets make sure this works with multiple tool calls in the same assistant Message.
tcs = [
mk_tc(simple_add.__name__, json.dumps({"a": 5, "b": 7})),
mk_tc(simple_add.__name__, json.dumps({"a": 6, "b": 7})),
]tcq = mk_tc_req("I will calculate these for you!", tcs)
tcqMessage(content='I will calculate these for you!', role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_XBetF5gIRHYH7LKBKxJsllLOD', 'type': 'function'}, {'index': 1, 'function': {'arguments': '{"a": 6, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_fU25035HyRrY03K6JBO94XfLE', 'type': 'function'}], function_call=None, provider_specific_fields=None)
tcr = mk_tc_results(tcq, ['12', '13'])c = Chat(model, tools=[simple_add], hist=[pr, tcq, tcr[0]])c(tcr[1])5 + 7 = 12
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=13, prompt_tokens=812, total_tokens=825, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
c.print_hist(){'role': 'user', 'content': 'What is 5 + 7? Use the tool to calculate it.'}
Message(content='I will calculate these for you!', role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"a": 5, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_XBetF5gIRHYH7LKBKxJsllLOD', 'type': 'function'}, {'index': 1, 'function': {'arguments': '{"a": 6, "b": 7}', 'name': 'simple_add'}, 'id': 'toolu_fU25035HyRrY03K6JBO94XfLE', 'type': 'function'}], function_call=None, provider_specific_fields=None)
{'tool_call_id': 'toolu_XBetF5gIRHYH7LKBKxJsllLOD', 'role': 'tool', 'name': 'simple_add', 'content': '12'}
{'tool_call_id': 'toolu_fU25035HyRrY03K6JBO94XfLE', 'role': 'tool', 'name': 'simple_add', 'content': '13'}
Message(content='5 + 7 = **12**', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})
chat = Chat(ms[1], tools=[simple_add])
res = chat("What's 5 + 3? Use the `simple_add` tool.")
resThe result of 5 + 3 is 8.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=18, prompt_tokens=742, total_tokens=760, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
res = chat("Now, tell me a joke based on that result.")
resHere’s a joke based on the number 8:
Why was 6 afraid of 7?
Because 7 8 (ate) 9!
But since we got 8 as our answer, here’s another one:
What do you call an 8 that’s been working out?
An “ate” with great figure! 💪
(Get it? Because 8 already has a great figure with those curves!)
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=100, prompt_tokens=774, total_tokens=874, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
chat.hist[{'role': 'user', 'content': "What's 5 + 3? Use the `simple_add` tool."},
Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"a": 5, "b": 3}', 'name': 'simple_add'}, 'id': 'toolu_016dgFwdeaQXSwLPnJzufcWq', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
{'tool_call_id': 'toolu_016dgFwdeaQXSwLPnJzufcWq',
'role': 'tool',
'name': 'simple_add',
'content': '8'},
{'role': 'assistant',
'content': 'You have no more tool uses. Please summarize your findings. If you did not complete your goal please tell the user what further work needs to be done so they can choose how best to proceed.'},
Message(content='\n\nThe result of 5 + 3 is **8**.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
{'role': 'user', 'content': 'Now, tell me a joke based on that result.'},
Message(content='Here\'s a joke based on the number 8:\n\nWhy was 6 afraid of 7?\n\nBecause 7 8 (ate) 9!\n\nBut since we got 8 as our answer, here\'s another one:\n\nWhat do you call an 8 that\'s been working out?\n\nAn "ate" with great figure! 💪\n\n(Get it? Because 8 already has a great figure with those curves!)', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]
Images
chat = Chat(ms[1])
chat(['Whats in this img?',img_fn.read_bytes()])Image Description
This adorable image shows a Cavalier King Charles Spaniel puppy with the classic Blenheim coloring (chestnut and white markings).
Key features visible:
- Puppy with expressive brown eyes looking directly at the camera
- Soft, fluffy coat with rich brown/chestnut patches on the ears and around the eyes
- White blaze down the center of the face
- Lying on grass in what appears to be a garden setting
- Purple flowers (possibly asters) visible in the background
- The puppy has a sweet, gentle expression typical of the breed
The photo has a warm, professional quality with nice depth of field that keeps the focus on the puppy’s endearing face while softly blurring the floral background.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=188, prompt_tokens=105, total_tokens=293, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Prefill
Prefill works as expected:
chat = Chat(ms[1])
chat("Spell my name",prefill="Your name is R E")Your name is R E D A C T E D
I don’t actually know your name - you haven’t told me what it is yet! If you’d like me to spell your name, please let me know what it is first.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=47, prompt_tokens=16, total_tokens=63, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
And the entire message is stored in the history, not just the generated part:
chat.hist[-1]Message(content="Your name is R E D A C T E D\n\nI don't actually know your name - you haven't told me what it is yet! If you'd like me to spell your name, please let me know what it is first.", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})
Streaming
from time import sleepchat = Chat(model)
stream_gen = chat("Count to 5", stream=True)
for chunk in stream_gen:
if isinstance(chunk, ModelResponse): display(chunk)
else: print(delta_text(chunk) or '',end='')1, 2, 3, 4, 5
1, 2, 3, 4, 5
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=17, prompt_tokens=11, total_tokens=28, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Lets try prefill with streaming too:
stream_gen = chat("Continue counting to 10","Okay! 6, 7",stream=True)
for chunk in stream_gen:
if isinstance(chunk, ModelResponse): display(chunk)
else: print(delta_text(chunk) or '',end='')Okay! 6, 7, 8, 9, 10
Okay! 6, 7, 8, 9, 10
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=12, prompt_tokens=44, total_tokens=56, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Tool use
Ok now lets test tool use
for m in ms:
display(Markdown(f'**{m}:**'))
chat = Chat(m, tools=[simple_add])
res = chat("What's 5 + 3? Use the `simple_add` tool. Explain.")
display(res)gemini/gemini-2.5-flash:
I used the simple_add tool with a=5 and b=3. The tool returned 8.
Therefore, 5 + 3 = 8.
- id:
chatcmpl-xxx - model:
gemini-2.5-flash - finish_reason:
stop - usage:
Usage(completion_tokens=118, prompt_tokens=159, total_tokens=277, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=79, rejected_prediction_tokens=None, text_tokens=39), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=159, image_tokens=None))
claude-sonnet-4-5:
Result: 5 + 3 = 8
Explanation: The simple_add function takes two parameters: - a (first operand): I provided 5 - b (second operand): I provided 3
The function added these two numbers together and returned 8, which is the correct sum of 5 and 3.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=89, prompt_tokens=764, total_tokens=853, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
openai/gpt-4.1:
The result of 5 + 3 is 8.
Explanation: I used the simple_add tool, which takes two numbers and adds them together. By inputting 5 and 3, the tool calculated the sum as 8.
- id:
chatcmpl-xxx - model:
gpt-4.1-2025-04-14 - finish_reason:
stop - usage:
Usage(completion_tokens=48, prompt_tokens=155, total_tokens=203, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None))
Thinking w tool use
chat = Chat(model, tools=[simple_add])
res = chat("What's 5 + 3?",think='l',return_all=True)
display(*res)🔧 simple_add({“a”: 5, “b”: 3})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=125, prompt_tokens=638, total_tokens=763, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=43, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01SY1R38L37vhWpgNgQz2B5h',
'role': 'tool',
'name': 'simple_add',
'content': '8'}
5 + 3 = 8
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=14, prompt_tokens=816, total_tokens=830, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Search
chat = Chat(model)
res = chat("Search the web and tell me very briefly about otters", search='l', stream=True)
for o in res:
if isinstance(o, ModelResponse): sleep(0.01); display(o)
else: print(delta_text(o) or '',end='')Otters are charismatic members of the weasel family found on every continent except Australia and Antarctica. There are 13 species in total, including sea otters and river otters.
These aquatic mammals have elongated bodies, long tails, and soft, dense fur. In fact, otters have the densest fur of any animal—as many as a million hairs per square inch. Webbed feet and powerful tails make otters strong swimmers.
All otters are expert hunters that eat fish, crustaceans, and other critters. Sea otters float on their backs, place a rock on their chest, then smash mollusks down on it until it breaks open. They're also known for being playful animals, engaging in activities like sliding into water on natural slides.
Otters are * charismatic members of the weasel family found on every continent except Australia and Antarctica. * There are 13 species in total, including sea otters and river otters.
These aquatic mammals have * elongated bodies, long tails, and soft, dense fur. In fact, * otters have the densest fur of any animal—as many as a million hairs per square inch. * Webbed feet and powerful tails make otters strong swimmers.
* All otters are expert hunters that eat fish, crustaceans, and other critters. * Sea otters float on their backs, place a rock on their chest, then smash mollusks down on it until it breaks open. They’re also known for being * playful animals, engaging in activities like sliding into water on natural slides.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5 - finish_reason:
stop - usage:
Usage(completion_tokens=362, prompt_tokens=15055, total_tokens=15417, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None), prompt_tokens_details=None)
Multi tool calling
We can let the model call multiple tools in sequence using the max_steps parameter.
chat = Chat(model, tools=[simple_add])
res = chat("What's ((5 + 3)+7)+11? Work step by step", return_all=True, max_steps=5)
for r in res: display(r)I’ll solve this step by step using the addition function.
Step 1: First, let me calculate 5 + 3
🔧 simple_add({“a”: 5, “b”: 3})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=100, prompt_tokens=617, total_tokens=717, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01SykhkA2BGKXm9J56KCkz2B',
'role': 'tool',
'name': 'simple_add',
'content': '8'}
Step 2: Now I’ll add 7 to that result (8 + 7)
🔧 simple_add({“a”: 8, “b”: 7})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=93, prompt_tokens=730, total_tokens=823, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_013LrGqASqf9Bsk38scV5Pu7',
'role': 'tool',
'name': 'simple_add',
'content': '15'}
Step 3: Finally, I’ll add 11 to that result (15 + 11)
🔧 simple_add({“a”: 15, “b”: 11})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=94, prompt_tokens=836, total_tokens=930, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01RtpzYFxji9ZbQJtTjKwaCi',
'role': 'tool',
'name': 'simple_add',
'content': '26'}
Answer: ((5 + 3) + 7) + 11 = 26
Here’s the breakdown: - 5 + 3 = 8 - 8 + 7 = 15 - 15 + 11 = 26
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=67, prompt_tokens=943, total_tokens=1010, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Some models support parallel tool calling. I.e. sending multiple tool call requests in one conversation step.
def multiply(a: int, b: int) -> int:
"Multiply two numbers"
return a * b
chat = Chat('openai/gpt-4.1', tools=[simple_add, multiply])
res = chat("Calculate (5 + 3) * (7 + 2)", max_steps=5, return_all=True)
for r in res: display(r)🔧 simple_add({“a”: 5, “b”: 3})
🔧 simple_add({“a”: 7, “b”: 2})
- id:
chatcmpl-xxx - model:
gpt-4.1-2025-04-14 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=52, prompt_tokens=110, total_tokens=162, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None))
{'tool_call_id': 'call_qJXSxYvc2ZVHmyIxqQ9OocWM',
'role': 'tool',
'name': 'simple_add',
'content': '8'}
{'tool_call_id': 'call_hCgeAPtd0RhmeADBRWRvY0sG',
'role': 'tool',
'name': 'simple_add',
'content': '9'}
🔧 multiply({“a”:8,“b”:9})
- id:
chatcmpl-xxx - model:
gpt-4.1-2025-04-14 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=17, prompt_tokens=178, total_tokens=195, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None))
{'tool_call_id': 'call_1nwxhn7RXLNl9FcsS8pfn6OZ',
'role': 'tool',
'name': 'multiply',
'content': '72'}
(5 + 3) = 8 and (7 + 2) = 9. Multiplying them gives: 8 × 9 = 72.
So, (5 + 3) × (7 + 2) = 72.
- id:
chatcmpl-xxx - model:
gpt-4.1-2025-04-14 - finish_reason:
stop - usage:
Usage(completion_tokens=55, prompt_tokens=203, total_tokens=258, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None))
See it did the additions in one go!
We don’t want the model to keep running tools indefinitely. Lets showcase how we can force thee model to stop after our specified number of toolcall rounds:
def divide(a: int, b: int) -> float:
"Divide two numbers"
return a / b
chat = Chat(model, tools=[simple_add, multiply, divide])
res = chat("Calculate ((10 + 5) * 3) / (2 + 1) step by step.",
max_steps=3, return_all=True,
final_prompt="Please wrap-up for now and summarize how far we got.")
for r in res: display(r)I’ll calculate this step by step, following the order of operations.
Step 1: Calculate the inner parentheses first - (10 + 5) = ? - (2 + 1) = ?
🔧 simple_add({“a”: 10, “b”: 5})
🔧 simple_add({“a”: 2, “b”: 1})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=173, prompt_tokens=792, total_tokens=965, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01NZjJc2q4tMJZcS93T1WQHM',
'role': 'tool',
'name': 'simple_add',
'content': '15'}
{'tool_call_id': 'toolu_013qQVARNY8a6shg4zo2TpNr',
'role': 'tool',
'name': 'simple_add',
'content': '3'}
Step 2: Multiply 15 * 3
🔧 multiply({“a”: 15, “b”: 3})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=82, prompt_tokens=1030, total_tokens=1112, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01Uf17eEfZPHcqFo1C3PYZ5E',
'role': 'tool',
'name': 'multiply',
'content': '45'}
Step 3: Divide 45 / 3
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=23, prompt_tokens=1139, total_tokens=1162, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
Tool call exhaustion
pr = "What is 1+2, and then the result of adding +2, and then +3 to it? Use tools to calculate!"
c = Chat(model, tools=[simple_add])res = c(pr, max_steps=2)
resLet me continue with the next calculation. Now I’ll add 2 to the result (3+2):
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=33, prompt_tokens=777, total_tokens=810, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
assert c.hist[-2]['content'] == _final_promptAsync
AsyncChat
If you want to use LiteLLM in a webapp you probably want to use their async function acompletion. To make that easier we will implement our version of AsyncChat to complement it. It follows the same implementation as Chat as much as possible:
astream_with_complete
astream_with_complete (agen, postproc=<function noop>)
AsyncChat
AsyncChat (model:str, sp='', temp=0, search=False, tools:list=None, hist:list=None, ns:Optional[dict]=None, cache=False, cache_idxs:list=[-1], ttl=None)
LiteLLM chat client.
| Type | Default | Details | |
|---|---|---|---|
| model | str | LiteLLM compatible model name | |
| sp | str | System prompt | |
| temp | int | 0 | Temperature |
| search | bool | False | Search (l,m,h), if model supports it |
| tools | list | None | Add tools |
| hist | list | None | Chat history |
| ns | Optional | None | Custom namespace for tool calling |
| cache | bool | False | Anthropic prompt caching |
| cache_idxs | list | [-1] | Anthropic cache breakpoint idxs, use 0 for sys prompt if provided |
| ttl | NoneType | None | Anthropic prompt caching ttl |
Examples
Basic example
chat = AsyncChat(model)
await chat("What is 2+2?")2+2 = 4
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=11, prompt_tokens=14, total_tokens=25, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
With tool calls
async def async_add(a: int, b: int) -> int:
"Add two numbers asynchronously"
await asyncio.sleep(0.1)
return a + bchat_with_tools = AsyncChat(model, tools=[async_add])
res = await chat_with_tools("What is 5 + 7? Use the tool to calculate it.", return_all=True)
async for r in res: display(r)🔧 async_add({“a”: 5, “b”: 7})
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
tool_calls - usage:
Usage(completion_tokens=70, prompt_tokens=607, total_tokens=677, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
{'tool_call_id': 'toolu_01NHDNkcpwxW66XRuRFChLxe',
'role': 'tool',
'name': 'async_add',
'content': '12'}
The result of 5 + 7 is 12.
- id:
chatcmpl-xxx - model:
claude-sonnet-4-5-20250929 - finish_reason:
stop - usage:
Usage(completion_tokens=18, prompt_tokens=731, total_tokens=749, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)
chat.hist[{'role': 'user', 'content': 'What is 2+2?'},
Message(content='2+2 = 4', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]
Async Streaming Display
This is what our outputs look like with streaming results:
chat_with_tools = AsyncChat(model, tools=[async_add])
res = await chat_with_tools("What is 5 + 7? Use the tool to calculate it.", stream=True)
async for o in res:
if isinstance(o,ModelResponseStream): print(delta_text(o) or '',end='')
elif isinstance(o,dict): print(o)
🔧 async_add
{'tool_call_id': 'toolu_011RxwEK3HSc3VQwwsBZnXnV', 'role': 'tool', 'name': 'async_add', 'content': '12'}
The result of 5 + 7 is **12**.
We use this one quite a bit so we want to provide some utilities to better format these outputs:
Here’s a complete ModelResponse taken from the response stream:
resp = ModelResponse(id='chatcmpl-xxx', created=1000000000, model='claude-sonnet-4-5', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content="I'll calculate ((10 + 5) * 3) / (2 + 1) step by step:", role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a": 10, "b": 5}', name='simple_add'), id='toolu_018BGyenjiRkDQFU1jWP6qRo', type='function'), ChatCompletionMessageToolCall(function=Function(arguments='{"a": 2, "b": 1}', name='simple_add'), id='toolu_01CWqrNQvoRjf1Q1GLpTUgQR', type='function')], function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=228, prompt_tokens=794, total_tokens=1022, prompt_tokens_details=None))
print(repr(resp))ModelResponse(id='chatcmpl-xxx', created=1000000000, model='claude-sonnet-4-5', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='tool_calls', index=0, message=Message(content="I'll calculate ((10 + 5) * 3) / (2 + 1) step by step:", role='assistant', tool_calls=[ChatCompletionMessageToolCall(function=Function(arguments='{"a": 10, "b": 5}', name='simple_add'), id='toolu_018BGyenjiRkDQFU1jWP6qRo', type='function'), ChatCompletionMessageToolCall(function=Function(arguments='{"a": 2, "b": 1}', name='simple_add'), id='toolu_01CWqrNQvoRjf1Q1GLpTUgQR', type='function')], function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=228, prompt_tokens=794, total_tokens=1022, completion_tokens_details=None, prompt_tokens_details=None))
tc=resp.choices[0].message.tool_calls[0]
tcChatCompletionMessageToolCall(function=Function(arguments='{"a": 10, "b": 5}', name='simple_add'), id='toolu_018BGyenjiRkDQFU1jWP6qRo', type='function')
tr={'tool_call_id': 'toolu_018BGyenjiRkDQFU1jWP6qRo', 'role': 'tool','name': 'simple_add',
'content': '15 is the answerrrr' +'r'*2000}mk_tr_details
*Createmk_tr_details (tr, tc, mx=2000)
block for tool call as JSON*
mk_tr_details(tr,tc)'\n\n<details class=\'tool-usage-details\'>\n\n```json\n{\n "id": "toolu_018BGyenjiRkDQFU1jWP6qRo",\n "call": {\n "function": "simple_add",\n "arguments": {\n "a": "10",\n "b": "5"\n }\n },\n "result": "15 is the answerrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr<TRUNCATED>"\n}\n```\n\n</details>\n\n'
AsyncStreamFormatter
AsyncStreamFormatter (include_usage=False, mx=2000)
Initialize self. See help(type(self)) for accurate signature.
stream_msg = ModelResponseStream([StreamingChoices(delta=Delta(content="Hello world!"))])
print(repr(AsyncStreamFormatter().format_item(stream_msg)))'Hello world!'
reasoning_msg = ModelResponseStream([StreamingChoices(delta=Delta(reasoning_content="thinking..."))])
print(repr(AsyncStreamFormatter().format_item(reasoning_msg)))'🧠'
mock_tool_call = ChatCompletionMessageToolCall(
id="toolu_123abc456def", type="function",
function=Function( name="simple_add", arguments='{"a": 5, "b": 3}' )
)
mock_response = ModelResponse()
mock_response.choices = [type('Choice', (), {
'message': type('Message', (), {
'tool_calls': [mock_tool_call]
})()
})()]
mock_tool_result = {
'tool_call_id': 'toolu_123abc456def', 'role': 'tool',
'name': 'simple_add', 'content': '8'
}fmt = AsyncStreamFormatter()
fmt.format_item(mock_response)
print(fmt.format_item(mock_tool_result))
<details class='tool-usage-details'>
```json
{
"id": "toolu_123abc456def",
"call": {
"function": "simple_add",
"arguments": {
"a": "5",
"b": "3"
}
},
"result": "8"
}
```
</details>
In jupyter it’s nice to use this AsyncStreamFormatter in combination with the Markdown display:
adisplay_stream
adisplay_stream (rs)
Use IPython.display to markdown display the response stream.
Streaming examples
Now we can demonstrate AsyncChat with stream=True!
Tool call
chat = AsyncChat(model, tools=[async_add])
res = await chat("What is 5 + 7? Use the tool to calculate it.", stream=True)
fmt = await adisplay_stream(res){
"id": "toolu_011RxwEK3HSc3VQwwsBZnXnV",
"call": {
"function": "async_add",
"arguments": {
"a": "5",
"b": "7"
}
},
"result": "12"
}The result of 5 + 7 is 12.
Thinking tool call
chat = AsyncChat(model)
res = await chat("Briefly, what's the most efficient way to sort a list of 1000 random integers?",
think='l',stream=True)
_ = await adisplay_stream(res)🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠
Use your language’s built-in sort
For 1000 random integers, use your language’s built-in sort function (e.g., Python’s sorted(), Java’s Arrays.sort(), C++’s std::sort()).
These implementations use highly optimized algorithms like: - Timsort (Python/Java) - Introsort (C++) - Dual-pivot Quicksort (Java primitives)
All are O(n log n) and will outperform hand-coded solutions for this dataset size.
If implementing yourself: Use Quicksort or Mergesort — both O(n log n) average case and efficient for this size.
Multiple tool calls
chat.hist[1]chat.hist[2]chat.hist[3]chat.hist[4]chat.hist[5]Now to demonstrate that we can load back the formatted output back into a new Chat object:
chat5 = Chat(model,hist=fmt2hist(fmt.outp),tools=[simple_add, multiply, divide])
chat5('what did we just do?')Search
chat_stream_tools = AsyncChat(model, search='l')
res = await chat_stream_tools("Search the web and tell me very briefly about otters", stream=True)
_=await adisplay_stream(res)Caching
a,b = random.randint(0,100), random.randint(0,100)
hist = [[f"What is {a}+{b}?\n" * 200], f"It's {a+b}", ['hi'], "Hello"]chat = AsyncChat(model, cache=True, hist=hist)
rs = await chat('hi again', stream=True, stream_options={"include_usage": True})async for o in rs:
if isinstance(o, ModelResponse): print(o.usage)In this first api call we will see cache creation until the last user msg:
cache_read_toks = o.usage.cache_creation_input_tokens
test_eq(cache_read_toks > 1000, True)
test_eq(o.usage.cache_read_input_tokens, 0)hist.extend([['hi again'], 'how may i help you?'])
chat = AsyncChat(model, cache=True, hist=hist)
rs = await chat('bye!', stream=True, stream_options={"include_usage": True})async for o in rs:
if isinstance(o, ModelResponse): print(o.usage)The subsequent call should re-use the existing cache:
test_eq(o.usage.cache_read_input_tokens, cache_read_toks)import nbdev; nbdev.nbdev_export()