๐Ÿง  Gemma3 Smart Q4 โ€” Bilingual Offline Assistant for Raspberry Pi

Gemma3 Smart Q4 is a quantized bilingual (Italianโ€“English) variant of Google's Gemma 3 1B model, optimized for edge devices like the Raspberry Pi 4 & 5. It runs completely offline with Ollama or llama.cpp, ensuring privacy and speed without external dependencies.


๐Ÿ’ป Optimized for Raspberry Pi

โœ… Tested on Raspberry Pi 4 (4GB) โ€” average speed 3.56-3.67 tokens/s โœ… Fully offline โ€” no external APIs, no internet required โœ… Lightweight โ€” under 800 MB in Q4 quantization โœ… Bilingual โ€” seamlessly switches between Italian and English


๐Ÿ” Key Features

  • ๐Ÿ—ฃ๏ธ Bilingual AI โ€” Automatically detects and responds in Italian or English
  • โšก Edge-optimized โ€” Fine-tuned parameters for low-power ARM devices
  • ๐Ÿ”’ Privacy-first โ€” All inference happens locally on your device
  • ๐Ÿงฉ Two quantizations available:
    • Q4_K_M (โ‰ˆ769 MB) โ†’ Better quality, more coherent reasoning
    • Q4_0 (โ‰ˆ687 MB) โ†’ 15-20% faster, ideal for real-time interactions

๐Ÿ“Š Benchmark Results

Tested on Raspberry Pi 4 (4GB RAM) with Ollama:

Model Avg Speed Individual Results File Size Use Case
gemma3-1b-q4_k_m.gguf 3.56 tokens/s 3.71, 3.58, 3.40 t/s 769 MB Better quality, long conversations
gemma3-1b-q4_0.gguf 3.67 tokens/s 3.65, 3.67, 3.70 t/s 687 MB Default choice, general use

Test details:

  • Hardware: Raspberry Pi 4 (4GB RAM)
  • OS: Raspberry Pi OS (Debian Bookworm)
  • Runtime: Ollama 0.x
  • Prompts: Mixed Italian/English, typical assistant queries

Recommendation: Use Q4_0 as default (3% faster, 82MB smaller, same quality). Use Q4_K_M only if you need slightly better coherence in very long conversations (1000+ tokens).


๐Ÿ› ๏ธ Quick Start with Ollama

Option 1: Pull from Hugging Face

Create a Modelfile:

cat > Modelfile <<'MODELFILE'
FROM hf.co/antonio/gemma3-smart-q4/gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful. If a task requires internet access or external services, clearly state this and suggest local alternatives when possible.

Sei un assistente AI offline che opera su Raspberry Pi. Rileva automaticamente la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile. Se un compito richiede accesso a internet o servizi esterni, indicalo chiaramente e suggerisci alternative locali quando possibile.
"""
MODELFILE

Then run:

ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Ciao! Chi sei?"

Option 2: Download and Use Locally

# Download the model
wget https://huggingface.co/antonio/gemma3-smart-q4/resolve/main/gemma3-1b-q4_0.gguf

# Create Modelfile
cat > Modelfile <<'MODELFILE'
FROM ./gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful.

Sei un assistente AI offline su Raspberry Pi. Rileva la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile.
"""
MODELFILE

# Create and run
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Hello! Introduce yourself."

โš™๏ธ Recommended Parameters

For Raspberry Pi 4/5, use these optimized settings:

Temperature: 0.7          # Balanced creativity vs consistency
Top-p: 0.9                # Nucleus sampling for diverse responses
Context Length: 1024      # Optimal for Pi 4 memory
Threads: 4                # Utilizes all Pi 4 cores
Batch Size: 32            # Optimized for throughput
Repeat Penalty: 1.05      # Reduces repetitive outputs

For faster responses (e.g., voice assistant), reduce num_ctx to 512.


๐Ÿ“ฆ Files Included

  • gemma3-1b-q4_k_m.gguf โ€” Q4_K_M quantization (~769 MB) - Better quality
  • gemma3-1b-q4_0.gguf โ€” Q4_0 quantization (~687 MB) - Faster speed

๐Ÿ”– License & Attribution

This is a derivative work of Google's Gemma 3 1B. Please review and comply with the Gemma License.

Quantization, optimization, and bilingual configuration by Antonio.


๐Ÿ”— Links


๐Ÿš€ Use Cases

  • Privacy-focused personal assistant โ€” All data stays on your device
  • Offline home automation โ€” Control IoT devices without cloud dependencies
  • Educational projects โ€” Learn AI/ML without expensive hardware
  • Voice assistants โ€” Fast enough for real-time speech interaction
  • Embedded systems โ€” Industrial applications requiring offline inference

Built with โค๏ธ by Antonio ๐Ÿ‡ฎ๐Ÿ‡น Empowering privacy and edge computing, one model at a time.

Downloads last month
31
GGUF
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support