I wanted to find out if I can run a local LLM on my Raspberry pi that I could query without sending data to any cloud. That runs 24/7, and costs essentially nothing after the hardware. I already had a Raspberry Pi 4B (8GB) sitting on my desk doing light work. I decided to see how far I could push it.
The answer: far enough to run Google’s newest Gemma 4 E2B model, fully locally, at 2.5 tokens per second. Here’s the full setup for running Gemma 4 on Raspberry Pi 4B.
Table of Contents
- What Is Gemma 4?
- What You Need
- The Tech Stack
- Step 1: Install Build Dependencies
- Step 2: Build llama.cpp from Source
- Step 3: Download the Model
- Step 4: Start the Server
- Step 5: Auto-Start on Boot
- Using the Gemma 4 Raspberry Pi API
- Apps That Work with Your Pi
- Real Numbers
What Is Gemma 4?
Gemma 4 is Google DeepMind’s latest family of open-weight models, released in April 2026. It’s fully permissive for commercial use and ships in four sizes. The one we care about for Gemma 4 Raspberry Pi deployments is the E2B — 2.3 billion effective parameters, built specifically for mobile and edge hardware.
The “E” stands for “effective parameters.” The model uses a technique called Per-Layer Embeddings (PLE) — each decoder layer gets its own small embedding table per token. This is why the model punches above its weight while keeping memory requirements manageable.
Beyond just text, Gemma 4 E2B handles images, audio, and a 128K context window. It also has a built-in reasoning mode — the model thinks through a problem step-by-step before answering, similar to chain-of-thought but native.
What You Need to Run Gemma 4 on Raspberry Pi
- Raspberry Pi 4B — 8GB RAM recommended (4GB will work but is tight)
- 64-bit OS — Raspberry Pi OS Bookworm (aarch64)
- ~5GB free disk space for the model + build artifacts
- SSH access to the Pi
- Internet connection for the initial download
No GPU. No cloud account. No API key.
The Tech Stack
The stack for running Gemma 4 on Raspberry Pi is lean: llama.cpp runs the model on CPU, exposes an OpenAI-compatible REST API via llama-server, and loads the model from a compressed GGUF file.
Your app / curl / any OpenAI client
↓
llama-server (HTTP, port 8080)
OpenAI-compatible REST API
↓
llama.cpp (CPU inference, ARM NEON + OpenBLAS)
↓
gemma-4-E2B-it-Q4_K_M.gguf (~2.9GB in RAM)
↓
Raspberry Pi 4B (aarch64, 8GB RAM, 4× ARM Cortex-A72)
llama.cpp is a pure C++ inference engine with ARM NEON optimizations and OpenBLAS support for matrix operations — exactly what you need when there is no GPU.
Step 1: Install Build Dependencies
Before we can build anything, we need a C++ compiler, cmake, and OpenBLAS — an optimized math library that accelerates matrix operations on the Pi’s ARM CPU. Without OpenBLAS, prompt processing is noticeably slower.
SSH into your Pi and run:
sudo apt update && sudo apt install -y cmake git build-essential \
libopenblas-dev python3-pip wget
This installs everything needed for the build in one shot. If you already have some of these, apt will skip them.
Step 2: Build llama.cpp from Source
llama.cpp is available as a package, but you must build from source. The packaged versions lag behind by months and don’t include the Gemma 4 fixes that were merged in April 2026. Building from source ensures you get the latest chat template support and model compatibility.
Clone the repo and build with OpenBLAS enabled:
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4
The -j4 flag uses all four cores. This takes about 20–25 minutes on a Pi 4B. Go make a coffee. When it finishes, the binaries land in ~/llama.cpp/build/bin/ — the two we care about are llama-cli (interactive chat) and llama-server (HTTP API).
Step 3: Download the Gemma 4 Model
Now we need the actual model file. Unsloth maintains pre-quantized GGUF versions of the full Gemma 4 family on HuggingFace. We’ll use the Q4_K_M quantization — it’s 2.9GB, fits easily in RAM, and gives the best quality-to-speed tradeoff for Pi hardware.
Install the HuggingFace download tool and pull the model:
pip3 install huggingface_hub --break-system-packages
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='unsloth/gemma-4-E2B-it-GGUF',
filename='gemma-4-E2B-it-Q4_K_M.gguf',
local_dir='/home/pi/models'
)
print('Done.')
"
This takes 5–15 minutes depending on your connection. The model file lands in ~/models/gemma-4-E2B-it-Q4_K_M.gguf.
Step 4: Start the Server
With the model downloaded and llama.cpp built, we can start llama-server. This spins up an HTTP server that exposes an OpenAI-compatible chat completions API on port 8080. Anything that can talk to ChatGPT can talk to this endpoint instead.
Start it up:
~/llama.cpp/build/bin/llama-server \
-m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
--threads 4 \
--ctx-size 4096
After about 15 seconds of loading, you’ll see:
main: model loaded
main: server is listening on http://0.0.0.0:8080
Pro tip: Keep --ctx-size at 4096 or lower. The model supports 128K but each token in the KV cache eats RAM — on Pi you feel it. Setting --threads 4 matches the Pi 4B’s four CPU cores exactly.
Step 5: Auto-Start on Boot
You probably don’t want to SSH into the Pi and start the server by hand every time it reboots. A systemd service solves this — it starts the server automatically, restarts it if it crashes, and logs output to the system journal.
Create the service file and enable it:
sudo tee /etc/systemd/system/llama-server.service > /dev/null << 'EOF'
[Unit]
Description=llama-server (Gemma 4 E2B)
After=network.target
[Service]
User=pi
ExecStart=/home/pi/llama.cpp/build/bin/llama-server \
-m /home/pi/models/gemma-4-E2B-it-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 --threads 4 --ctx-size 4096
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
This is the same pattern I used in the Raspberry Pi web page watcher post — a systemd service keeps things running without babysitting. You can check the status anytime with sudo systemctl status llama-server.
Using the Gemma 4 Raspberry Pi API
The server exposes an OpenAI-compatible API. Anything that talks to GPT-4 can talk to your Pi — just change the base URL.
curl
The quickest way to test is a simple curl request from any machine on your network. Replace the IP with your Pi’s address:
curl http://192.168.0.10:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "Explain gradient descent in plain English."}],
"max_tokens": 300,
"temperature": 1.0,
"chat_template_kwargs": {"enable_thinking": false}
}'
Python (openai SDK)
If you’re building something in Python, the official OpenAI SDK works out of the box — just point it at your Pi instead of the OpenAI endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://192.168.0.9:8080/v1", api_key="none")
response = client.chat.completions.create(
model="gemma4",
messages=[{"role": "user", "content": "Write a haiku about ARM processors."}],
max_tokens=100,
temperature=1.0,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(response.choices[0].message.content)
Enabling Thinking Mode
Gemma 4 has a built-in reasoning mode. Drop chat_template_kwargs entirely and the model works through the problem step-by-step before giving a final answer. The internal reasoning lands in the reasoning_content field of the response.
On the Gemma 4 Raspberry Pi setup, thinking mode means significantly more tokens — budget for it. The per-token generation speed stays the same (~2.5 t/s), but the model produces hundreds of extra reasoning tokens before the actual answer. A question that takes 20 seconds with thinking off can take 2–4 minutes with it on. Keep it off for simple questions, turn it on for math or logic problems.
Apps That Work with Your Gemma 4 Raspberry Pi
Because llama-server speaks the OpenAI API, you’re not limited to curl and Python scripts. Any application that supports a custom OpenAI-compatible endpoint can connect to your Pi. Here are two I’ve used.
Open WebUI — Full Chat Interface
Open WebUI gives you a ChatGPT-style web interface that runs locally. I wrote a full guide on setting it up: Open Web UI: Run local chat app with Image Generation using phi4, deepseek-r1 and stable diffusion. The setup in that post uses Ollama, but you can point Open WebUI at your Pi’s llama-server instead.
In the Open WebUI admin panel, go to Settings → Connections and add a new OpenAI-compatible connection:
URL: http://<pi-ip>:8080/v1
API Key: none
Open WebUI will discover the model automatically. You get conversation history, markdown rendering, and multi-turn chat — all hitting your Pi.
Zed Editor — AI Assistant in Your Code Editor
Zed is a fast, modern code editor with a built-in AI assistant. It supports custom OpenAI-compatible providers out of the box, which means you can use your Gemma 4 Raspberry Pi as a free, private coding assistant.
Open Zed’s settings (Cmd+,) and add this under language_models:
{
"language_models": {
"openai_compatible": {
"Gemma 4 Pi": {
"api_url": "http://<pi-ip>:8080/v1",
"available_models": [
{
"name": "gemma4",
"display_name": "Gemma 4 E2B (Raspberry Pi)",
"max_tokens": 4096
}
]
}
}
}
}
Zed will ask for an API key on first use — type anything, the Pi doesn’t check it. After that, you can open the AI assistant panel and chat with Gemma 4 directly from your editor. Ask it to explain code, generate functions, or review your work — all processed on your local network.
Pro tip: Any tool that lets you set a custom OpenAI base URL works the same way — aichat, Aider, Continue (VS Code), and others. Just point the base URL at http://<pi-ip>:8080/v1 and you’re in.
Real Numbers
These are from the actual Pi, not a benchmark sheet:
| Metric | Value |
|---|---|
| Prompt processing | 5.5 tokens/sec |
| Generation speed | ~2.5 tokens/sec |
| RAM used | ~3.1 GB |
| Model load time | ~15 seconds |
| Power draw | ~5W |
A 200-word response takes roughly 60–80 seconds. Slow by cloud standards. Free and private by any standard.
Summary
| Component | Tool |
|---|---|
| Hardware | Raspberry Pi 4B (8GB) |
| Model | Gemma 4 E2B (Q4_K_M, 2.9GB) |
| Inference engine | llama.cpp (built from source) |
| API | llama-server (OpenAI-compatible) |
| Auto-start | systemd service |
A private, always-on, OpenAI-compatible LLM endpoint on your home network — no subscriptions, no data leaving your house, no GPU required. That’s Gemma 4 on Raspberry Pi in a nutshell.
Akash Gupta
Senior VoIP Engineer and AI Enthusiast

AI and VoIP Blog
Thank you for visiting the Blog. Hit the subscribe button to receive the next post right in your inbox. If you find this article helpful don't forget to share your feedback in the comments and hit the like/clap button. This will helps in knowing what topics resonate with you, allowing me to create more that keeps you informed.
Thank you for reading, and stay tuned for more insights and guides!

Leave a Reply