๐ง Gemma3 Smart Q4 โ Bilingual Offline Assistant for Raspberry Pi
Gemma3 Smart Q4 is a quantized bilingual (ItalianโEnglish) variant of Google's Gemma 3 1B model, optimized for edge devices like the Raspberry Pi 4 & 5. It runs completely offline with Ollama or llama.cpp, ensuring privacy and speed without external dependencies.
๐ป Optimized for Raspberry Pi
โ Tested on Raspberry Pi 4 (4GB) โ average speed 3.56-3.67 tokens/s โ Fully offline โ no external APIs, no internet required โ Lightweight โ under 800 MB in Q4 quantization โ Bilingual โ seamlessly switches between Italian and English
๐ Key Features
- ๐ฃ๏ธ Bilingual AI โ Automatically detects and responds in Italian or English
- โก Edge-optimized โ Fine-tuned parameters for low-power ARM devices
- ๐ Privacy-first โ All inference happens locally on your device
- ๐งฉ Two quantizations available:
- Q4_K_M (โ769 MB) โ Better quality, more coherent reasoning
- Q4_0 (โ687 MB) โ 15-20% faster, ideal for real-time interactions
๐ Benchmark Results
Tested on Raspberry Pi 4 (4GB RAM) with Ollama:
| Model | Avg Speed | Individual Results | File Size | Use Case |
|---|---|---|---|---|
| gemma3-1b-q4_k_m.gguf | 3.56 tokens/s | 3.71, 3.58, 3.40 t/s | 769 MB | Better quality, long conversations |
| gemma3-1b-q4_0.gguf | 3.67 tokens/s | 3.65, 3.67, 3.70 t/s | 687 MB | Default choice, general use |
Test details:
- Hardware: Raspberry Pi 4 (4GB RAM)
- OS: Raspberry Pi OS (Debian Bookworm)
- Runtime: Ollama 0.x
- Prompts: Mixed Italian/English, typical assistant queries
Recommendation: Use Q4_0 as default (3% faster, 82MB smaller, same quality). Use Q4_K_M only if you need slightly better coherence in very long conversations (1000+ tokens).
๐ ๏ธ Quick Start with Ollama
Option 1: Pull from Hugging Face
Create a Modelfile:
cat > Modelfile <<'MODELFILE'
FROM hf.co/antonio/gemma3-smart-q4/gemma3-1b-q4_0.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05
SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful. If a task requires internet access or external services, clearly state this and suggest local alternatives when possible.
Sei un assistente AI offline che opera su Raspberry Pi. Rileva automaticamente la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile. Se un compito richiede accesso a internet o servizi esterni, indicalo chiaramente e suggerisci alternative locali quando possibile.
"""
MODELFILE
Then run:
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Ciao! Chi sei?"
Option 2: Download and Use Locally
# Download the model
wget https://huggingface.co/antonio/gemma3-smart-q4/resolve/main/gemma3-1b-q4_0.gguf
# Create Modelfile
cat > Modelfile <<'MODELFILE'
FROM ./gemma3-1b-q4_0.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05
SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful.
Sei un assistente AI offline su Raspberry Pi. Rileva la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile.
"""
MODELFILE
# Create and run
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Hello! Introduce yourself."
โ๏ธ Recommended Parameters
For Raspberry Pi 4/5, use these optimized settings:
Temperature: 0.7 # Balanced creativity vs consistency
Top-p: 0.9 # Nucleus sampling for diverse responses
Context Length: 1024 # Optimal for Pi 4 memory
Threads: 4 # Utilizes all Pi 4 cores
Batch Size: 32 # Optimized for throughput
Repeat Penalty: 1.05 # Reduces repetitive outputs
For faster responses (e.g., voice assistant), reduce num_ctx to 512.
๐ฆ Files Included
gemma3-1b-q4_k_m.ggufโ Q4_K_M quantization (~769 MB) - Better qualitygemma3-1b-q4_0.ggufโ Q4_0 quantization (~687 MB) - Faster speed
๐ License & Attribution
This is a derivative work of Google's Gemma 3 1B. Please review and comply with the Gemma License.
Quantization, optimization, and bilingual configuration by Antonio.
๐ Links
- GitHub Repository: antonio/gemma3-smart-q4 โ Code, demos, benchmark scripts
- Original Model: Google Gemma 3 1B IT
- Ollama Library: Coming soon (pending submission)
๐ Use Cases
- Privacy-focused personal assistant โ All data stays on your device
- Offline home automation โ Control IoT devices without cloud dependencies
- Educational projects โ Learn AI/ML without expensive hardware
- Voice assistants โ Fast enough for real-time speech interaction
- Embedded systems โ Industrial applications requiring offline inference
Built with โค๏ธ by Antonio ๐ฎ๐น Empowering privacy and edge computing, one model at a time.
- Downloads last month
- 31
4-bit