Qwen3-4B Tool Calling with llama-cpp-python
A specialized 4B parameter model fine-tuned for function calling and tool usage, optimized for local deployment with llama-cpp-python.
Features
- 4B Parameters - Sweet spot for local deployment
- Function Calling - Fine-tuned on 60K function calling examples
- GGUF Format - Optimized for CPU/GPU inference
- 3.99GB Download - Fits on any modern system
- 262K Context - Large context window for complex tasks
- VRAM - Full context within 6GB!
Model Details
- Base Model: Qwen3-4B-Instruct-2507
- Fine-tuning: LoRA on Salesforce xlam-function-calling-60k dataset
- Quantization: Q8_0 (8-bit) for optimal performance/size ratio
- Architecture: Qwen3 with specialized tool calling tokens
- License: Apache 2.0
Installation
Quick Install
# Clone the repository
git clone https://huggingface.co/Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex
cd qwen3-4b-toolcall-llamacpp-codex
# Run the installation script
./install.sh
Manual Installation
Prerequisites
- Python 3.8+
- 6GB+ RAM (8GB+ recommended)
- 5GB+ free disk space
Install Dependencies
pip install -r requirements.txt
Download Model
# Download the model file
huggingface-cli download Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex Qwen3-4B-Function-Calling-Pro.gguf
Alternative: Install with specific llama-cpp-python build
For better performance, you can install llama-cpp-python with specific optimizations:
# For CPU-only (default)
pip install llama-cpp-python
# For CUDA support (if you have NVIDIA GPU)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# For OpenBLAS support
CMAKE_ARGS="-DLLAMA_BLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
Quick Start
Option 1: Using the Run Script
# Interactive mode (default)
./run_model.sh
# or
source ./run_model.sh
# Start Codex server
./run_model.sh server
# or
source ./run_model.sh server
# Show help
./run_model.sh help
# or
source ./run_model.sh help
Option 2: Direct Python Usage
from llama_cpp import Llama
# Load the model
llm = Llama(
    model_path="Qwen3-4B-Function-Calling-Pro.gguf",
    n_ctx=2048,
    n_threads=8,
    temperature=0.7
)
# Simple chat
response = llm("What's the weather like in London?", max_tokens=200)
print(response['choices'][0]['text'])
Option 3: Quick Start Demo
python3 quick_start.py
Tool Calling Example
import json
import re
from llama_cpp import Llama
def extract_tool_calls(text):
    """Extract tool calls from model response"""
    tool_calls = []
    json_pattern = r'\[.*?\]'
    matches = re.findall(json_pattern, text)
    
    for match in matches:
        try:
            parsed = json.loads(match)
            if isinstance(parsed, list):
                for item in parsed:
                    if isinstance(item, dict) and 'name' in item:
                        tool_calls.append(item)
        except json.JSONDecodeError:
            continue
    return tool_calls
# Initialize model
llm = Llama(
    model_path="Qwen3-4B-Function-Calling-Pro.gguf",
    n_ctx=2048,
    temperature=0.7
)
# Chat with tool calling
prompt = "Get the weather for New York"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
response = llm(formatted_prompt, max_tokens=200, stop=["<|im_end|>", "<|im_start|>"])
response_text = response['choices'][0]['text']
# Extract tool calls
tool_calls = extract_tool_calls(response_text)
print(f"Tool calls: {tool_calls}")
Examples
1. Weather Tool Calling
# The model will generate:
# [{"name": "get_weather", "arguments": {"q": "London"}}]
2. Hotel Search
# The model will generate:
# [{"name": "search_stays", "arguments": {"check_in": "2023-04-01", "check_out": "2023-04-08", "city": "Paris"}}]
3. Flight Booking
# The model will generate:
# [{"name": "flights_search", "arguments": {"q": "New York to Tokyo"}}]
4. News Search
# The model will generate:
# [{"name": "search_news", "arguments": {"q": "AI", "gl": "us"}}]
Codex Integration
Setting up Codex Server
To use this model with Codex, you need to run a local server that Codex can connect to:
1. Install llama-cpp-python with server support
pip install llama-cpp-python[server]
2. Start the Codex-compatible server
python -m llama_cpp.server \
    --model Qwen3-4B-Function-Calling-Pro.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --n_ctx 2048 \
    --n_threads 8 \
    --temperature 0.7
3. Configure Codex to use the local server
In your Codex configuration, set:
- Server URL: http://localhost:8000
- API Key: (not required for local server)
- Model: Qwen3-4B-Function-Calling-Pro
Codex Integration Example
# codex_integration.py
import requests
import json
class CodexClient:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def chat_completion(self, messages, tools=None, temperature=0.7):
        """Send chat completion request to Codex"""
        payload = {
            "model": "Qwen3-4B-Function-Calling-Pro",
            "messages": messages,
            "temperature": temperature,
            "max_tokens": 512,
            "stop": ["<|im_end|>", "<|im_start|>"]
        }
        
        if tools:
            payload["tools"] = tools
        
        response = self.session.post(
            f"{self.base_url}/v1/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        
        return response.json()
    
    def extract_tool_calls(self, response):
        """Extract tool calls from Codex response"""
        tool_calls = []
        if "choices" in response and len(response["choices"]) > 0:
            message = response["choices"][0]["message"]
            if "tool_calls" in message:
                tool_calls = message["tool_calls"]
        return tool_calls
# Usage with Codex
codex = CodexClient()
# Define tools for Codex
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]
# Send request
messages = [{"role": "user", "content": "What's the weather in London?"}]
response = codex.chat_completion(messages, tools=tools)
tool_calls = codex.extract_tool_calls(response)
print(f"Response: {response}")
print(f"Tool calls: {tool_calls}")
Docker Setup for Codex
Create a Dockerfile for easy deployment:
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install llama-cpp-python with server support
RUN pip install llama-cpp-python[server]
# Copy model and scripts
COPY . .
# Expose port
EXPOSE 8000
# Start server
CMD ["python", "-m", "llama_cpp.server", \
     "--model", "Qwen3-4B-Function-Calling-Pro.gguf", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--n_ctx", "2048"]
Build and run:
docker build -t qwen3-codex-server .
docker run -p 8000:8000 qwen3-codex-server
Advanced Usage
Custom Tool Calling Class
class Qwen3ToolCalling:
    def __init__(self, model_path):
        self.llm = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_threads=8,
            temperature=0.7,
            verbose=False
        )
    
    def chat(self, message, system_message=None):
        # Build prompt with proper formatting
        prompt_parts = []
        if system_message:
            prompt_parts.append(f"<|im_start|>system\n{system_message}<|im_end|>")
        prompt_parts.append(f"<|im_start|>user\n{message}<|im_end|>")
        prompt_parts.append("<|im_start|>assistant\n")
        
        formatted_prompt = "\n".join(prompt_parts)
        
        # Generate response
        response = self.llm(
            formatted_prompt,
            max_tokens=512,
            stop=["<|im_end|>", "<|im_start|>"],
            temperature=0.7
        )
        
        response_text = response['choices'][0]['text']
        tool_calls = self.extract_tool_calls(response_text)
        
        return {
            'response': response_text,
            'tool_calls': tool_calls
        }
Performance
System Requirements
| Component | Minimum | Recommended | 
|---|---|---|
| RAM | 6GB | 8GB+ | 
| Storage | 5GB | 10GB+ | 
| CPU | 4 cores | 8+ cores | 
| GPU | Optional | NVIDIA RTX 3060+ | 
Benchmarks
- Inference Speed: ~75-100 tokens/second (CPU)
- Memory Usage: ~4GB RAM
- Model Size: 3.99GB (Q8_0 quantized)
- Context Length: 262K tokens
- Function Call Accuracy: 94%+ on test set
Use Cases
- AI Agents - Building intelligent agents that can use tools
- Local Coding Assistants - Function calling without cloud dependencies
- API Integration - Seamless tool orchestration
- Privacy-Sensitive Development - 100% local processing
- Learning Function Calling - Educational purposes
Model Architecture
Special Tokens
The model includes specialized tokens for tool calling:
- <tool_call>- Start of tool call
- </tool_call>- End of tool call
- <tool_response>- Start of tool response
- </tool_response>- End of tool response
Chat Template
The model uses a custom chat template optimized for tool calling:
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
Repository Structure
qwen3-4b-toolcall-llamacpp/
βββ Qwen3-4B-Function-Calling-Pro.gguf    # Main model file
βββ qwen3_toolcalling_example.py          # Complete example
βββ quick_start.py                        # Quick start demo
βββ codex_integration.py                  # Codex integration example
βββ run_model.sh                          # Run script for llama-cpp
βββ install.sh                            # Installation script
βββ requirements.txt                      # Python dependencies
βββ README.md                             # This file
βββ config.json                           # Model configuration
βββ tokenizer_config.json                 # Tokenizer configuration
βββ special_tokens_map.json               # Special tokens mapping
βββ added_tokens.json                     # Added tokens
βββ chat_template.jinja                   # Chat template
βββ Dockerfile                            # Docker configuration
βββ docker-compose.yml                    # Docker Compose setup
βββ .gitignore                            # Git ignore file
@model{Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex,
  title={Qwen3-4B-toolcalling-gguf-codex: Local Function Calling},
  author={Manojb},
  year={2025},
  url={https://huggingface.co/Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Related Projects
- llama-cpp-python - Python bindings for llama.cpp
- Qwen3 - Base model
- xlam-function-calling-60k - Training dataset
Built with β€οΈ for the developer community
- Downloads last month
- 1,331
							Hardware compatibility
						Log In
								
								to view the estimation
We're not able to determine the quantization variants.
Model tree for Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex
Base model
Qwen/Qwen3-4B-Instruct-2507