How LLM Tool Calling Actually Works: A peek under the hood

After I got Gemma 4 E2B running on my Raspberry Pi 4B, the next thing I wanted was to give it tools — the ability to check system stats, run commands, tell the time. Not just answer questions from its training data, but actually do things.

I got it working with a Telegram bot, and along the way I captured every layer of the tool calling process — from the high-level API requests down to the raw tokens the model generates. Most explanations of tool calling stop at the API level. This post goes deeper.

This setup is not fast interns of token generation but it turned out to be one of the best way to learn how an AI system works. Here’s what actually happens when a 2-billion parameter model on a $60 computer decides it needs to call a function.

The Setup
How the Bot Sends Tools to the LLM
What the LLM Actually Returns
The Full Round Trip — Real Logs
Going Deeper: What llama.cpp Does Under the Hood
- First, Some Basics: How an LLM Generates Text
The Raw Tokens — What the Model Actually Generates
The Complete Conversion Chain

The Setup

I’m running Gemma 4 E2B (Q4_K_M quantization, ~2.9 GB) on a Raspberry Pi 4B with 8 GB RAM. The inference server is llama.cpp‘s llama-server, exposing an OpenAI-compatible API on port 8080. A Python Telegram bot sends user messages to this local API and returns the responses.

The bot has three tools defined:

get_current_datetime — returns the Pi’s current date, time, and timezone
get_system_stats — returns CPU temp, RAM, disk, uptime
run_command — executes a whitelisted shell command on the Pi

The architecture is simple:

			
┌──────────────────────────┐
│  📨  Telegram message    │
└────────────┬─────────────┘
             ▼  
┌──────────────────────────┐
│  🐍  Python bot          │ 
│  python-telegram-bot     │
└────────────┬─────────────┘
             ▼  POST /v1/chat/completions + tool schemas
┌──────────────────────────┐
│  ⚙️  llama-server :8080  │
│  Gemma 4 E2B, Q4_K_M     │
└────────────┬─────────────┘
             ▼ 
┌──────────────────────────┐
│  📤  Response            │
│  text reply OR tool call │
└──────────────────────────┘

		

How the Bot Sends Tools to the LLM

Every request to the LLM includes the tool definitions alongside the conversation messages. The bot doesn’t decide when to use tools — the model does. The bot just tells the model what tools exist.

Here’s what the bot sends when a user asks “How is the Pi doing? What time is it?”:

			
POST /v1/chat/completions
{
  "model": "gemma-4-E2B",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "How is the Pi doing? What time is it?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_datetime",
        "description": "Get current date/time/timezone.",
        "parameters": {"type": "object", "properties": {}, "required": []}
      }
    },
    {
      "type": "function",
      "function": {
        "name": "get_system_stats",
        "description": "Get Pi stats: CPU temp, RAM, disk, uptime.",
        "parameters": {"type": "object", "properties": {}, "required": []}
      }
    }
  ],
  "temperature": 0.7,
  "max_tokens": 1024
}

		

The key thing: those tools definitions are sent with every request. The model sees them every time and decides on its own whether any are needed.

What the LLM Actually Returns

The LLM’s response always comes back as JSON from the API. But the shape of that JSON differs based on what the model decided to do.

When the model wants to call a tool, the response looks like this (captured from a real request):

			
{
  "finish_reason": "tool_calls",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "type": "function",
        "function": {
          "name": "get_system_stats",
          "arguments": "{}"
        },
        "id": "AfUuK1ViPpmOyHXsvDOUxnjkoYbeZCgr"
      }
    ]
  }
}

		

Notice: content is empty. The model isn’t talking to the user — it’s requesting an action. The finish_reason is "tool_calls" instead of the usual "stop".

When the model replies with text (no tools needed), it looks like this:

			
{
  "finish_reason": "stop",
  "message": {
    "role": "assistant",
    "content": "The Pi is doing great! CPU is at 42.8°C..."
  }
}

		

The bot’s detection logic is dead simple — one field check:

			
tool_calls = msg.get("tool_calls")
if not tool_calls:
    # text reply → show to user
else:
    # execute tools → send results back to LLM

		

The Full Round Trip — Real Logs

Here’s a real tool calling session captured from the bot’s logs. The user asked “How is the Pi doing? What time is it?” — a question that requires two different tools.

Round 1: The bot sends the user message. The LLM comes back requesting get_system_stats.

			
[Round 1] Sending 2 messages to LLM
  [system] You are a helpful assistant running on a Raspberry Pi 4B...
  [user]   How is the pi doing
HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
[Round 1] LLM responded: finish_reason=tool_calls
[Round 1] LLM wants 1 tool call(s):
  → Calling get_system_stats({})
  ← Result: {"cpu_temp": "47.7'C", "cpu_usage_percent": "0.0",
             "ram_used": "582Mi/7.7Gi", "disk_used": "13G/229G (6%)",
             "uptime": "up 1 hour, 54 minutes"}

		

The bot executed get_system_stats locally on the Pi — it ran vcgencmd measure_temp, free -h, df -h, and uptime -p, packaged the results as JSON, and appended them to the conversation.

Round 2: The bot sends the tool result back. Now the LLM has the system stats but still needs the time, so it requests get_current_datetime.

			
[Round 2] Sending 4 messages to LLM
  [system]    You are a helpful assistant...
  [user]      How is the pi doing
  [assistant] (tool_calls: get_system_stats)
  [tool]      {"cpu_temp": "47.7'C", ...}
[Round 2] LLM responded: finish_reason=tool_calls
  → Calling get_current_datetime({})
  ← Result: {"datetime": "2026-04-12 15:02:13",
             "timezone": "CEST", "weekday": "Sunday"}

		

Round 3: Both tool results are now in the conversation. The LLM has everything it needs and generates the final text response.

			
[Round 3] Sending 6 messages to LLM
  [system]    You are a helpful assistant...
  [user]      How is the Pi doing? What time is it?
  [assistant] (tool_calls: get_system_stats)
  [tool]      {"cpu_temp": "43.3'C", ...}
  [assistant] (tool_calls: get_current_datetime)
  [tool]      {"datetime": "2026-04-12 15:05:17", ...}
[Round 3] LLM responded: finish_reason=stop
Final answer: "The Pi's stats are: CPU temperature is 43.3°C,
RAM usage is 576Mi/7.7Gi, disk space is 13G/229G, and it has been
up for 1 hour and 48 minutes. The current time is Sunday,
April 12, 2026, at 15:05:17 CEST."

		

Three LLM round trips for one user message. The E2B model (2B parameters) calls tools sequentially — one per round — rather than batching both into a single response. Larger models would likely batch them. On the Pi 4B, each round takes about 20-30 seconds, so this whole exchange took roughly 76 seconds.

Going Deeper: What llama.cpp Does Under the Hood

Everything above is the API level. But the model doesn’t actually think in JSON. There’s a translation layer inside llama-server that converts between the structured API and the raw tokens the model understands.

I ran llama-server with --verbose to capture this layer. But before we look at what it does, let’s cover a few basics about how LLMs actually work. These concepts sound technical, but they’re simpler than you think.

First, Some Basics: How an LLM Generates Text

Every model ships with a fixed dictionary — baked in at training time, never changes at runtime. It maps words (and word-fragments) to numbers: “hello” might be 17534, “the” might be 623, “:” might be 236787. Text goes in → split into pieces → each piece looked up in the dictionary → out comes a sequence of numbers. That’s tokenization. The model never sees your actual text — only the numbers.

Tokens are just numbers. An LLM doesn’t read text the way we do. It works entirely with numbers. The word “hello” might be token 17534. The word “the” might be token 623. A piece of punctuation like “:” might be token 236787. Even special markers like <|tool_call> are just numbers — in Gemma 4’s case, that’s token 48. Everything the model reads and writes is a sequence of these numbers.

The tokenizer is just a dictionary. The tokenizer is just a dictionary. Gemma 4’s has ~250,000 entries — one for each token the model was trained on. When you type a message, the tokenizer splits your text into pieces and looks up the number for each piece. When the model generates output, the tokenizer does the reverse — takes the numbers and looks up the text. That’s it. No magic, just a dictionary.

The model predicts one number at a time. This is the core of how every LLM works. Given all the numbers so far, the model predicts the next number. Then that number gets added to the sequence, and the model predicts the next one. And the next one. It’s like autocomplete on your phone — except instead of predicting the next word, it’s predicting the next token (number). One at a time, in a loop.

Sampling is how the next token gets picked. The model doesn’t output a single “answer” for the next token. It outputs a probability for every possible token — like saying “there’s a 40% chance the next token is ‘the’, a 15% chance it’s ‘a’, a 3% chance it’s ‘get’…” and so on for all 250,000+ tokens in its vocabulary. Something needs to actually pick one. That’s what sampling does. Parameters like temperature control how adventurous the pick is — low temperature means “almost always pick the highest probability”, high temperature means “be more random”. This is handled by llama.cpp, not the model itself.

A PEG grammar is just a set of rules that acts as a referee. Normally, the model is free to generate any token it wants. But when we need structured output — like a valid tool call — we need guardrails. A PEG grammar is a set of rules that says “at this point in the output, only these tokens are allowed.” Think of it as a referee watching each token as it’s generated: “yes, that’s valid” or “no, pick again.” llama.cpp builds this grammar automatically from the tool schemas you provide. The model still predicts tokens normally, but the grammar filters out anything that wouldn’t form a valid tool call.

With that context, here’s what llama-server actually does with your API request.

Step 1: llama-server converts the tool schemas into a grammar.

When the bot (a python script) sends tool definitions as JSON, llama-server doesn’t pass them directly to the model. It builds a PEG grammar — those rules we just talked about — that constrains the model’s output:

			
Grammar (tool_calls):
root ::= tool-call
tool-call ::= ("<|tool_call>call:" (tool-get-current-datetime) "<tool_call|>")?
tool-get-current-datetime ::= ("get_current_datetime" ) gemma4-dict

This grammar says: if the model decides to call a tool, it must output the exact pattern <|tool_call>call:FUNCTION_NAME{ARGS}<tool_call|>. No hallucinated function names. No malformed arguments. The referee won’t allow it.

Step 2: llama-server converts the conversation into the model’s prompt format.

Our JSON messages get transformed into Gemma 4’s native prompt format using a Jinja chat template. This is what the model actually sees as its input (captured from the verbose log):

			
<bos><|turn>system
You are a helpful assistant.<|tool>declaration:get_current_datetime{description:<|"|>Get current date and time.<|"|>,parameters:{type:<|"|>OBJECT<|"|>}}<tool|><turn|>
<|turn>user
What time is it?<turn|>
<|turn>model

		

Look at that. The tool definition got converted from JSON into Gemma 4’s own markup format: <|tool>declaration:...<tool|>. The model was trained to understand this syntax during fine-tuning. It knows that when it sees a <|tool>declaration:... block, that’s a tool it can call.

Step 3: llama-server sets up a grammar trigger.

			
Grammar lazy: true
Grammar trigger token: 48 (`<|tool_call>`)

The grammar is lazy — it only activates when the model emits token 48 (<|tool_call>). If the model starts generating normal text instead, the grammar stays dormant and the model responds freely. The grammar is a safety net, not a cage.

The Raw Tokens — What the Model Actually Generates

This is the interesting part. Here’s the actual token-by-token generation from the verbose log when the model decides to call get_current_datetime:

			
n_decoded = 1  → token 48    '<|tool_call>'    ← grammar activates here
n_decoded = 2  → token 6639  'call'
n_decoded = 3  → token 236787 ':'
n_decoded = 4  → token 828   'get'
n_decoded = 5  → token 236779 '_'
n_decoded = 6  → token 4002  'current'
n_decoded = 7  → token 236779 '_'
n_decoded = 8  → token 19361 'datetime'
n_decoded = 9  → token 16454 '{}'
n_decoded = 10 → token 49    '<tool_call|>'
n_decoded = 11 → token 50    ''              ← stopped by EOS

		

11 tokens. That’s it. The model generated the <|tool_call> special token (remember — just the number 48), then the function name piece by piece, then the closing token. Each of those “words” in the log is what the tokenizer dictionary looked up for the number the model predicted. Total generation time: 4.1 seconds at 2.65 tokens per second on the Pi 4B.

The model doesn’t know about JSON. It doesn’t know about OpenAI’s API format. It was trained to emit these specific token patterns when it wants to invoke a function. That’s the fundamental mechanism.

Then llama-server parses those tokens back into JSON:

			
Parsing chat message: <|tool_call>call:get_current_datetime{}<tool_call|>
Parsed message: {"role":"assistant","content":"","tool_calls":[{"type":"function",
"function":{"name":"get_current_datetime","arguments":"{}"}}]}

Raw tokens → special syntax (matched) → structured JSON. That’s the conversion.

The Complete Conversion Chain

Here’s the full picture — every layer, from bot to model and back:

			
BOT → llama-server (inbound)
─────────────────────────────
JSON tool schemas            → PEG grammar constraining output
JSON messages                → Gemma 4 prompt format with <|tool> tags
OpenAI API request           → Tokenized prompt + grammar trigger
MODEL (generation)
─────────────────────────────
Model emits token 48         → <|tool_call> (grammar activates)
Model emits function name    → get_current_datetime
Model emits arguments        → {}
Model emits token 49         → <tool_call|> (grammar closes)
Model emits token 50         → EOS (generation stops)
llama-server → BOT (outbound)
─────────────────────────────
Raw: <|tool_call>call:get_current_datetime{}<tool_call|>
  → Parsed into structured JSON
  → Wrapped in OpenAI-compatible response
  → finish_reason: "tool_calls"

		

The model doesn’t speak JSON. It speaks tokens. Everything else is translation.

Summary

Layer	What Happens	Format
Bot	Sends tool schemas + messages	OpenAI JSON
llama-server (in)	Converts to prompt + grammar	Gemma 4 markup + PEG
Model	Predicts tokens one at a time, grammar filters invalid ones	`<\|tool_call>...<tool_call\|>`
llama-server (out)	Parses tokens back to JSON	OpenAI JSON
Bot	Detects tool_calls, executes, loops	Python

The LLM basically does lots of calculation and assigns a probability to each token in its dictonary. Thats the oly non-deterministic part of the AI system (model + Inference engine) . The inference engine decides which token to pick based on its configuration (temperature etc.)

Tool calling isn’t magic. It’s just token prediction with guardrails.

Akash Gupta
Senior VoIP Engineer and AI Enthusiast

AI and VoIP Blog

Thank you for visiting the Blog. Hit the subscribe button to receive the next post right in your inbox. If you find this article helpful don't forget to share your feedback in the comments and hit the like/clap button. This will helps in knowing what topics resonate with you, allowing me to create more that keeps you informed.

Thank you for reading, and stay tuned for more insights and guides!