🧠 Gemma3 Smart Q4 — Bilingual Offline Assistant for Raspberry Pi

Gemma3 Smart Q4 is a quantized bilingual (Italian–English) variant of Google's Gemma 3 1B model, optimized for edge devices like the Raspberry Pi 4 & 5. It runs completely offline with Ollama or llama.cpp, ensuring privacy and speed without external dependencies.

💻 Optimized for Raspberry Pi

✅ Tested on Raspberry Pi 4 (4GB) — average speed 3.56-3.67 tokens/s ✅ Fully offline — no external APIs, no internet required ✅ Lightweight — under 800 MB in Q4 quantization ✅ Bilingual — seamlessly switches between Italian and English

🔍 Key Features

🗣️ Bilingual AI — Automatically detects and responds in Italian or English
⚡ Edge-optimized — Fine-tuned parameters for low-power ARM devices
🔒 Privacy-first — All inference happens locally on your device
🧩 Two quantizations available:
- Q4_K_M (≈769 MB) → Better quality, more coherent reasoning
- Q4_0 (≈687 MB) → 15-20% faster, ideal for real-time interactions

📊 Benchmark Results

Tested on Raspberry Pi 4 (4GB RAM) with Ollama:

Model	Avg Speed	Individual Results	File Size	Use Case
gemma3-1b-q4_k_m.gguf	3.56 tokens/s	3.71, 3.58, 3.40 t/s	769 MB	Better quality, long conversations
gemma3-1b-q4_0.gguf	3.67 tokens/s	3.65, 3.67, 3.70 t/s	687 MB	Default choice, general use

Test details:

Hardware: Raspberry Pi 4 (4GB RAM)
OS: Raspberry Pi OS (Debian Bookworm)
Runtime: Ollama 0.x
Prompts: Mixed Italian/English, typical assistant queries

Recommendation: Use Q4_0 as default (3% faster, 82MB smaller, same quality). Use Q4_K_M only if you need slightly better coherence in very long conversations (1000+ tokens).

🛠️ Quick Start with Ollama

Option 1: Pull from Hugging Face

Create a Modelfile:

cat > Modelfile <<'MODELFILE'
FROM hf.co/antonio/gemma3-smart-q4/gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful. If a task requires internet access or external services, clearly state this and suggest local alternatives when possible.

Sei un assistente AI offline che opera su Raspberry Pi. Rileva automaticamente la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile. Se un compito richiede accesso a internet o servizi esterni, indicalo chiaramente e suggerisci alternative locali quando possibile.
"""
MODELFILE

Then run:

ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Ciao! Chi sei?"

Option 2: Download and Use Locally

# Download the model
wget https://huggingface.co/antonio/gemma3-smart-q4/resolve/main/gemma3-1b-q4_0.gguf

# Create Modelfile
cat > Modelfile <<'MODELFILE'
FROM ./gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful.

Sei un assistente AI offline su Raspberry Pi. Rileva la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile.
"""
MODELFILE

# Create and run
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Hello! Introduce yourself."

⚙️ Recommended Parameters

For Raspberry Pi 4/5, use these optimized settings:

Temperature: 0.7          # Balanced creativity vs consistency
Top-p: 0.9                # Nucleus sampling for diverse responses
Context Length: 1024      # Optimal for Pi 4 memory
Threads: 4                # Utilizes all Pi 4 cores
Batch Size: 32            # Optimized for throughput
Repeat Penalty: 1.05      # Reduces repetitive outputs

For faster responses (e.g., voice assistant), reduce num_ctx to 512.

📦 Files Included

gemma3-1b-q4_k_m.gguf — Q4_K_M quantization (~769 MB) - Better quality
gemma3-1b-q4_0.gguf — Q4_0 quantization (~687 MB) - Faster speed

🔖 License & Attribution

This is a derivative work of Google's Gemma 3 1B. Please review and comply with the Gemma License.

Quantization, optimization, and bilingual configuration by Antonio.

🔗 Links

GitHub Repository: antonio/gemma3-smart-q4 — Code, demos, benchmark scripts
Original Model: Google Gemma 3 1B IT
Ollama Library: Coming soon (pending submission)

🚀 Use Cases

Privacy-focused personal assistant — All data stays on your device
Offline home automation — Control IoT devices without cloud dependencies
Educational projects — Learn AI/ML without expensive hardware
Voice assistants — Fast enough for real-time speech interaction
Embedded systems — Industrial applications requiring offline inference

Built with ❤️ by Antonio 🇮🇹 Empowering privacy and edge computing, one model at a time.

Downloads last month: 31

GGUF

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support