Qwen3-4B Tool Calling with llama-cpp-python

A specialized 4B parameter model fine-tuned for function calling and tool usage, optimized for local deployment with llama-cpp-python.

Features

4B Parameters - Sweet spot for local deployment
Function Calling - Fine-tuned on 60K function calling examples
GGUF Format - Optimized for CPU/GPU inference
3.99GB Download - Fits on any modern system
262K Context - Large context window for complex tasks
VRAM - Full context within 6GB!

Model Details

Base Model: Qwen3-4B-Instruct-2507
Fine-tuning: LoRA on Salesforce xlam-function-calling-60k dataset
Quantization: Q8_0 (8-bit) for optimal performance/size ratio
Architecture: Qwen3 with specialized tool calling tokens
License: Apache 2.0

Installation

Quick Install

# Clone the repository
git clone https://huggingface.co/Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex
cd qwen3-4b-toolcall-llamacpp-codex

# Run the installation script
./install.sh

Manual Installation

Prerequisites

Python 3.8+
6GB+ RAM (8GB+ recommended)
5GB+ free disk space

Install Dependencies

pip install -r requirements.txt

Download Model

# Download the model file
huggingface-cli download Manojb/qwen3-4b-toolcall-gguf-llamacpp-codex Qwen3-4B-Function-Calling-Pro.gguf

Alternative: Install with specific llama-cpp-python build

For better performance, you can install llama-cpp-python with specific optimizations:

# For CPU-only (default)
pip install llama-cpp-python

# For CUDA support (if you have NVIDIA GPU)
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

# For OpenBLAS support
CMAKE_ARGS="-DLLAMA_BLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

Quick Start

Option 1: Using the Run Script

# Interactive mode (default)
./run_model.sh
# or
source ./run_model.sh

# Start Codex server
./run_model.sh server
# or
source ./run_model.sh server

# Show help
./run_model.sh help
# or
source ./run_model.sh help

Option 2: Direct Python Usage

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="Qwen3-4B-Function-Calling-Pro.gguf",
    n_ctx=2048,
    n_threads=8,
    temperature=0.7
)

# Simple chat
response = llm("What's the weather like in London?", max_tokens=200)
print(response['choices'][0]['text'])

Option 3: Quick Start Demo

python3 quick_start.py

Tool Calling Example

import json
import re
from llama_cpp import Llama

def extract_tool_calls(text):
    """Extract tool calls from model response"""
    tool_calls = []
    json_pattern = r'\[.*?\]'
    matches = re.findall(json_pattern, text)
    
    for match in matches:
        try:
            parsed = json.loads(match)
            if isinstance(parsed, list):
                for item in parsed:
                    if isinstance(item, dict) and 'name' in item:
                        tool_calls.append(item)
        except json.JSONDecodeError:
            continue
    return tool_calls

# Initialize model
llm = Llama(
    model_path="Qwen3-4B-Function-Calling-Pro.gguf",
    n_ctx=2048,
    temperature=0.7
)

# Chat with tool calling
prompt = "Get the weather for New York"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

response = llm(formatted_prompt, max_tokens=200, stop=["<|im_end|>", "<|im_start|>"])
response_text = response['choices'][0]['text']

# Extract tool calls
tool_calls = extract_tool_calls(response_text)
print(f"Tool calls: {tool_calls}")

Examples

1. Weather Tool Calling

# The model will generate:
# [{"name": "get_weather", "arguments": {"q": "London"}}]

2. Hotel Search

# The model will generate:
# [{"name": "search_stays", "arguments": {"check_in": "2023-04-01", "check_out": "2023-04-08", "city": "Paris"}}]

3. Flight Booking

# The model will generate:
# [{"name": "flights_search", "arguments": {"q": "New York to Tokyo"}}]

4. News Search

# The model will generate:
# [{"name": "search_news", "arguments": {"q": "AI", "gl": "us"}}]

Codex Integration

Setting up Codex Server

To use this model with Codex, you need to run a local server that Codex can connect to:

1. Install llama-cpp-python with server support

pip install llama-cpp-python[server]

2. Start the Codex-compatible server

python -m llama_cpp.server \
    --model Qwen3-4B-Function-Calling-Pro.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --n_ctx 2048 \
    --n_threads 8 \
    --temperature 0.7

3. Configure Codex to use the local server

In your Codex configuration, set:

Server URL: http://localhost:8000
API Key: (not required for local server)
Model: Qwen3-4B-Function-Calling-Pro

Codex Integration Example

# codex_integration.py
import requests
import json

class CodexClient:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url
        self.session = requests.Session()
    
    def chat_completion(self, messages, tools=None, temperature=0.7):
        """Send chat completion request to Codex"""
        payload = {
            "model": "Qwen3-4B-Function-Calling-Pro",
            "messages": messages,
            "temperature": temperature,
            "max_tokens": 512,
            "stop": ["<|im_end|>", "<|im_start|>"]
        }
        
        if tools:
            payload["tools"] = tools
        
        response = self.session.post(
            f"{self.base_url}/v1/chat/completions",
            json=payload,
            headers={"Content-Type": "application/json"}
        )
        
        return response.json()
    
    def extract_tool_calls(self, response):
        """Extract tool calls from Codex response"""
        tool_calls = []
        if "choices" in response and len(response["choices"]) > 0:
            message = response["choices"][0]["message"]
            if "tool_calls" in message:
                tool_calls = message["tool_calls"]
        return tool_calls

# Usage with Codex
codex = CodexClient()

# Define tools for Codex
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Send request
messages = [{"role": "user", "content": "What's the weather in London?"}]
response = codex.chat_completion(messages, tools=tools)
tool_calls = codex.extract_tool_calls(response)

print(f"Response: {response}")
print(f"Tool calls: {tool_calls}")

Docker Setup for Codex

Create a Dockerfile for easy deployment:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Install llama-cpp-python with server support
RUN pip install llama-cpp-python[server]

# Copy model and scripts
COPY . .

# Expose port
EXPOSE 8000

# Start server
CMD ["python", "-m", "llama_cpp.server", \
     "--model", "Qwen3-4B-Function-Calling-Pro.gguf", \
     "--host", "0.0.0.0", \
     "--port", "8000", \
     "--n_ctx", "2048"]

Build and run:

docker build -t qwen3-codex-server .
docker run -p 8000:8000 qwen3-codex-server

Advanced Usage

Custom Tool Calling Class

class Qwen3ToolCalling:
    def __init__(self, model_path):
        self.llm = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_threads=8,
            temperature=0.7,
            verbose=False
        )
    
    def chat(self, message, system_message=None):
        # Build prompt with proper formatting
        prompt_parts = []
        if system_message:
            prompt_parts.append(f"<|im_start|>system\n{system_message}<|im_end|>")
        prompt_parts.append(f"<|im_start|>user\n{message}<|im_end|>")
        prompt_parts.append("<|im_start|>assistant\n")
        
        formatted_prompt = "\n".join(prompt_parts)
        
        # Generate response
        response = self.llm(
            formatted_prompt,
            max_tokens=512,
            stop=["<|im_end|>", "<|im_start|>"],
            temperature=0.7
        )
        
        response_text = response['choices'][0]['text']
        tool_calls = self.extract_tool_calls(response_text)
        
        return {
            'response': response_text,
            'tool_calls': tool_calls
        }

Performance

System Requirements

Component	Minimum	Recommended
RAM	6GB	8GB+
Storage	5GB	10GB+
CPU	4 cores	8+ cores
GPU	Optional	NVIDIA RTX 3060+

Benchmarks

Inference Speed: ~75-100 tokens/second (CPU)
Memory Usage: ~4GB RAM
Model Size: 3.99GB (Q8_0 quantized)
Context Length: 262K tokens
Function Call Accuracy: 94%+ on test set

Use Cases

AI Agents - Building intelligent agents that can use tools
Local Coding Assistants - Function calling without cloud dependencies
API Integration - Seamless tool orchestration
Privacy-Sensitive Development - 100% local processing
Learning Function Calling - Educational purposes

Model Architecture

Special Tokens

The model includes specialized tokens for tool calling:

<tool_call> - Start of tool call
</tool_call> - End of tool call
<tool_response> - Start of tool response
</tool_response> - End of tool response

Chat Template

The model uses a custom chat template optimized for tool calling:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

Repository Structure

qwen3-4b-toolcall-llamacpp/
├── Qwen3-4B-Function-Calling-Pro.gguf    # Main model file
├── qwen3_toolcalling_example.py          # Complete example
├── quick_start.py                        # Quick start demo
├── codex_integration.py                  # Codex integration example
├── run_model.sh                          # Run script for llama-cpp
├── install.sh                            # Installation script
├── requirements.txt                      # Python dependencies
├── README.md                             # This file
├── config.json                           # Model configuration
├── tokenizer_config.json                 # Tokenizer configuration
├── special_tokens_map.json               # Special tokens mapping
├── added_tokens.json                     # Added tokens
├── chat_template.jinja                   # Chat template
├── Dockerfile                            # Docker configuration
├── docker-compose.yml                    # Docker Compose setup
└── .gitignore                            # Git ignore file

@model{Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex,
  title={Qwen3-4B-toolcalling-gguf-codex: Local Function Calling},
  author={Manojb},
  year={2025},
  url={https://huggingface.co/Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

llama-cpp-python - Python bindings for llama.cpp
Qwen3 - Base model
xlam-function-calling-60k - Training dataset

Built with ❤️ for the developer community

Downloads last month: 1,331

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Manojb/Qwen3-4b-toolcall-gguf-llamacpp-codex

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(113)

this model