After I got Gemma 4 E2B running on my Raspberry Pi 4B, the next thing I wanted was to give it tools — the ability to check system stats, run commands, tell the time. Not just answer questions from its training data, but actually do things.
I got it working with a Telegram bot, and along the way I captured every layer of the tool calling process — from the high-level API requests down to the raw tokens the model generates. Most explanations of tool calling stop at the API level. This post goes deeper.
This setup is not fast interns of token generation but it turned out to be one of the best way to learn how an AI system works. Here’s what actually happens when a 2-billion parameter model on a $60 computer decides it needs to call a function.
Table of Contents
- The Setup
- How the Bot Sends Tools to the LLM
- What the LLM Actually Returns
- The Full Round Trip — Real Logs
- Going Deeper: What llama.cpp Does Under the Hood
- The Raw Tokens — What the Model Actually Generates
- The Complete Conversion Chain
The Setup
I’m running Gemma 4 E2B (Q4_K_M quantization, ~2.9 GB) on a Raspberry Pi 4B with 8 GB RAM. The inference server is llama.cpp‘s llama-server, exposing an OpenAI-compatible API on port 8080. A Python Telegram bot sends user messages to this local API and returns the responses.
The bot has three tools defined:
- get_current_datetime — returns the Pi’s current date, time, and timezone
- get_system_stats — returns CPU temp, RAM, disk, uptime
- run_command — executes a whitelisted shell command on the Pi
The architecture is simple:
┌──────────────────────────┐│ 📨 Telegram message │└────────────┬─────────────┘ ▼ ┌──────────────────────────┐│ 🐍 Python bot │ │ python-telegram-bot │└────────────┬─────────────┘ ▼ POST /v1/chat/completions + tool schemas┌──────────────────────────┐│ ⚙️ llama-server :8080 ││ Gemma 4 E2B, Q4_K_M │└────────────┬─────────────┘ ▼ ┌──────────────────────────┐│ 📤 Response ││ text reply OR tool call │└──────────────────────────┘
How the Bot Sends Tools to the LLM
Every request to the LLM includes the tool definitions alongside the conversation messages. The bot doesn’t decide when to use tools — the model does. The bot just tells the model what tools exist.
Here’s what the bot sends when a user asks “How is the Pi doing? What time is it?”:
POST /v1/chat/completions{ "model": "gemma-4-E2B", "messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "How is the Pi doing? What time is it?"} ], "tools": [ { "type": "function", "function": { "name": "get_current_datetime", "description": "Get current date/time/timezone.", "parameters": {"type": "object", "properties": {}, "required": []} } }, { "type": "function", "function": { "name": "get_system_stats", "description": "Get Pi stats: CPU temp, RAM, disk, uptime.", "parameters": {"type": "object", "properties": {}, "required": []} } } ], "temperature": 0.7, "max_tokens": 1024}
The key thing: those tools definitions are sent with every request. The model sees them every time and decides on its own whether any are needed.
What the LLM Actually Returns
The LLM’s response always comes back as JSON from the API. But the shape of that JSON differs based on what the model decided to do.
When the model wants to call a tool, the response looks like this (captured from a real request):
{ "finish_reason": "tool_calls", "message": { "role": "assistant", "content": "", "tool_calls": [ { "type": "function", "function": { "name": "get_system_stats", "arguments": "{}" }, "id": "AfUuK1ViPpmOyHXsvDOUxnjkoYbeZCgr" } ] }}
Notice: content is empty. The model isn’t talking to the user — it’s requesting an action. The finish_reason is "tool_calls" instead of the usual "stop".
When the model replies with text (no tools needed), it looks like this:
{ "finish_reason": "stop", "message": { "role": "assistant", "content": "The Pi is doing great! CPU is at 42.8°C..." }}
The bot’s detection logic is dead simple — one field check:
tool_calls = msg.get("tool_calls")if not tool_calls: # text reply → show to userelse: # execute tools → send results back to LLM
The Full Round Trip — Real Logs
Here’s a real tool calling session captured from the bot’s logs. The user asked “How is the Pi doing? What time is it?” — a question that requires two different tools.
Round 1: The bot sends the user message. The LLM comes back requesting get_system_stats.
[Round 1] Sending 2 messages to LLM [system] You are a helpful assistant running on a Raspberry Pi 4B... [user] How is the pi doingHTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"[Round 1] LLM responded: finish_reason=tool_calls[Round 1] LLM wants 1 tool call(s): → Calling get_system_stats({}) ← Result: {"cpu_temp": "47.7'C", "cpu_usage_percent": "0.0", "ram_used": "582Mi/7.7Gi", "disk_used": "13G/229G (6%)", "uptime": "up 1 hour, 54 minutes"}
The bot executed get_system_stats locally on the Pi — it ran vcgencmd measure_temp, free -h, df -h, and uptime -p, packaged the results as JSON, and appended them to the conversation.
Round 2: The bot sends the tool result back. Now the LLM has the system stats but still needs the time, so it requests get_current_datetime.
[Round 2] Sending 4 messages to LLM [system] You are a helpful assistant... [user] How is the pi doing [assistant] (tool_calls: get_system_stats) [tool] {"cpu_temp": "47.7'C", ...}[Round 2] LLM responded: finish_reason=tool_calls → Calling get_current_datetime({}) ← Result: {"datetime": "2026-04-12 15:02:13", "timezone": "CEST", "weekday": "Sunday"}
Round 3: Both tool results are now in the conversation. The LLM has everything it needs and generates the final text response.
[Round 3] Sending 6 messages to LLM [system] You are a helpful assistant... [user] How is the Pi doing? What time is it? [assistant] (tool_calls: get_system_stats) [tool] {"cpu_temp": "43.3'C", ...} [assistant] (tool_calls: get_current_datetime) [tool] {"datetime": "2026-04-12 15:05:17", ...}[Round 3] LLM responded: finish_reason=stopFinal answer: "The Pi's stats are: CPU temperature is 43.3°C,RAM usage is 576Mi/7.7Gi, disk space is 13G/229G, and it has beenup for 1 hour and 48 minutes. The current time is Sunday,April 12, 2026, at 15:05:17 CEST."
Three LLM round trips for one user message. The E2B model (2B parameters) calls tools sequentially — one per round — rather than batching both into a single response. Larger models would likely batch them. On the Pi 4B, each round takes about 20-30 seconds, so this whole exchange took roughly 76 seconds.
Going Deeper: What llama.cpp Does Under the Hood
Everything above is the API level. But the model doesn’t actually think in JSON. There’s a translation layer inside llama-server that converts between the structured API and the raw tokens the model understands.
I ran llama-server with --verbose to capture this layer. But before we look at what it does, let’s cover a few basics about how LLMs actually work. These concepts sound technical, but they’re simpler than you think.
First, Some Basics: How an LLM Generates Text
Every model ships with a fixed dictionary — baked in at training time, never changes at runtime. It maps words (and word-fragments) to numbers: “hello” might be 17534, “the” might be 623, “:” might be 236787. Text goes in → split into pieces → each piece looked up in the dictionary → out comes a sequence of numbers. That’s tokenization. The model never sees your actual text — only the numbers.
Tokens are just numbers. An LLM doesn’t read text the way we do. It works entirely with numbers. The word “hello” might be token 17534. The word “the” might be token 623. A piece of punctuation like “:” might be token 236787. Even special markers like <|tool_call> are just numbers — in Gemma 4’s case, that’s token 48. Everything the model reads and writes is a sequence of these numbers.
The tokenizer is just a dictionary. The tokenizer is just a dictionary. Gemma 4’s has ~250,000 entries — one for each token the model was trained on. When you type a message, the tokenizer splits your text into pieces and looks up the number for each piece. When the model generates output, the tokenizer does the reverse — takes the numbers and looks up the text. That’s it. No magic, just a dictionary.
The model predicts one number at a time. This is the core of how every LLM works. Given all the numbers so far, the model predicts the next number. Then that number gets added to the sequence, and the model predicts the next one. And the next one. It’s like autocomplete on your phone — except instead of predicting the next word, it’s predicting the next token (number). One at a time, in a loop.
Sampling is how the next token gets picked. The model doesn’t output a single “answer” for the next token. It outputs a probability for every possible token — like saying “there’s a 40% chance the next token is ‘the’, a 15% chance it’s ‘a’, a 3% chance it’s ‘get’…” and so on for all 250,000+ tokens in its vocabulary. Something needs to actually pick one. That’s what sampling does. Parameters like temperature control how adventurous the pick is — low temperature means “almost always pick the highest probability”, high temperature means “be more random”. This is handled by llama.cpp, not the model itself.
A PEG grammar is just a set of rules that acts as a referee. Normally, the model is free to generate any token it wants. But when we need structured output — like a valid tool call — we need guardrails. A PEG grammar is a set of rules that says “at this point in the output, only these tokens are allowed.” Think of it as a referee watching each token as it’s generated: “yes, that’s valid” or “no, pick again.” llama.cpp builds this grammar automatically from the tool schemas you provide. The model still predicts tokens normally, but the grammar filters out anything that wouldn’t form a valid tool call.
With that context, here’s what llama-server actually does with your API request.
Step 1: llama-server converts the tool schemas into a grammar.
When the bot (a python script) sends tool definitions as JSON, llama-server doesn’t pass them directly to the model. It builds a PEG grammar — those rules we just talked about — that constrains the model’s output:
Grammar (tool_calls):root ::= tool-calltool-call ::= ("<|tool_call>call:" (tool-get-current-datetime) "<tool_call|>")?tool-get-current-datetime ::= ("get_current_datetime" ) gemma4-dict
This grammar says: if the model decides to call a tool, it must output the exact pattern <|tool_call>call:FUNCTION_NAME{ARGS}<tool_call|>. No hallucinated function names. No malformed arguments. The referee won’t allow it.
Step 2: llama-server converts the conversation into the model’s prompt format.
Our JSON messages get transformed into Gemma 4’s native prompt format using a Jinja chat template. This is what the model actually sees as its input (captured from the verbose log):
<bos><|turn>systemYou are a helpful assistant.<|tool>declaration:get_current_datetime{description:<|"|>Get current date and time.<|"|>,parameters:{type:<|"|>OBJECT<|"|>}}<tool|><turn|><|turn>userWhat time is it?<turn|><|turn>model
Look at that. The tool definition got converted from JSON into Gemma 4’s own markup format: <|tool>declaration:...<tool|>. The model was trained to understand this syntax during fine-tuning. It knows that when it sees a <|tool>declaration:... block, that’s a tool it can call.
Step 3: llama-server sets up a grammar trigger.
Grammar lazy: trueGrammar trigger token: 48 (`<|tool_call>`)
The grammar is lazy — it only activates when the model emits token 48 (<|tool_call>). If the model starts generating normal text instead, the grammar stays dormant and the model responds freely. The grammar is a safety net, not a cage.
The Raw Tokens — What the Model Actually Generates
This is the interesting part. Here’s the actual token-by-token generation from the verbose log when the model decides to call get_current_datetime:
n_decoded = 1 → token 48 '<|tool_call>' ← grammar activates heren_decoded = 2 → token 6639 'call'n_decoded = 3 → token 236787 ':'n_decoded = 4 → token 828 'get'n_decoded = 5 → token 236779 '_'n_decoded = 6 → token 4002 'current'n_decoded = 7 → token 236779 '_'n_decoded = 8 → token 19361 'datetime'n_decoded = 9 → token 16454 '{}'n_decoded = 10 → token 49 '<tool_call|>'n_decoded = 11 → token 50 '' ← stopped by EOS
11 tokens. That’s it. The model generated the <|tool_call> special token (remember — just the number 48), then the function name piece by piece, then the closing token. Each of those “words” in the log is what the tokenizer dictionary looked up for the number the model predicted. Total generation time: 4.1 seconds at 2.65 tokens per second on the Pi 4B.
The model doesn’t know about JSON. It doesn’t know about OpenAI’s API format. It was trained to emit these specific token patterns when it wants to invoke a function. That’s the fundamental mechanism.
Then llama-server parses those tokens back into JSON:
Parsing chat message: <|tool_call>call:get_current_datetime{}<tool_call|>Parsed message: {"role":"assistant","content":"","tool_calls":[{"type":"function","function":{"name":"get_current_datetime","arguments":"{}"}}]}
Raw tokens → special syntax (matched) → structured JSON. That’s the conversion.
The Complete Conversion Chain
Here’s the full picture — every layer, from bot to model and back:
BOT → llama-server (inbound)─────────────────────────────JSON tool schemas → PEG grammar constraining outputJSON messages → Gemma 4 prompt format with <|tool> tagsOpenAI API request → Tokenized prompt + grammar triggerMODEL (generation)─────────────────────────────Model emits token 48 → <|tool_call> (grammar activates)Model emits function name → get_current_datetimeModel emits arguments → {}Model emits token 49 → <tool_call|> (grammar closes)Model emits token 50 → EOS (generation stops)llama-server → BOT (outbound)─────────────────────────────Raw: <|tool_call>call:get_current_datetime{}<tool_call|> → Parsed into structured JSON → Wrapped in OpenAI-compatible response → finish_reason: "tool_calls"
The model doesn’t speak JSON. It speaks tokens. Everything else is translation.
Summary
| Layer | What Happens | Format |
|---|---|---|
| Bot | Sends tool schemas + messages | OpenAI JSON |
| llama-server (in) | Converts to prompt + grammar | Gemma 4 markup + PEG |
| Model | Predicts tokens one at a time, grammar filters invalid ones | <|tool_call>...<tool_call|> |
| llama-server (out) | Parses tokens back to JSON | OpenAI JSON |
| Bot | Detects tool_calls, executes, loops | Python |
The LLM basically does lots of calculation and assigns a probability to each token in its dictonary. Thats the oly non-deterministic part of the AI system (model + Inference engine) . The inference engine decides which token to pick based on its configuration (temperature etc.)
Tool calling isn’t magic. It’s just token prediction with guardrails.
Akash Gupta
Senior VoIP Engineer and AI Enthusiast

AI and VoIP Blog
Thank you for visiting the Blog. Hit the subscribe button to receive the next post right in your inbox. If you find this article helpful don't forget to share your feedback in the comments and hit the like/clap button. This will helps in knowing what topics resonate with you, allowing me to create more that keeps you informed.
Thank you for reading, and stay tuned for more insights and guides!

Leave a Reply