TinyLlama Function Calling (CPU Optimized)

This is a CPU-optimized version of TinyLlama that has been fine-tuned for function calling capabilities.

Model Details

Base Model: TinyLlama-1.1B-Chat-v1.0
Parameters: 1.1 billion
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Training Data: Function calling examples from Glaive Function Calling v2 dataset
Optimization: Merged LoRA weights, converted to float32 for CPU deployment

Key Features

Function Calling Capabilities: The model can identify when functions should be called and generate appropriate function call syntax
CPU Optimized: Ready to run efficiently on low-end hardware without GPUs
Lightweight: Only 1.1B parameters, making it suitable for older hardware
Low Resource Requirements: Requires only 4-6 GB RAM for loading

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained("tinyllama-function-calling-cpu-optimized")
tokenizer = AutoTokenizer.from_pretrained("tinyllama-function-calling-cpu-optimized")

# Example prompt for function calling
prompt = """### Instruction:
Given the available functions and the user query, determine which function(s) to call and with what arguments.

Available functions:
{
    "name": "get_exchange_rate",
    "description": "Get the exchange rate between two currencies",
    "parameters": {
        "type": "object",
        "properties": {
            "base_currency": {
                "type": "string",
                "description": "The currency to convert from"
            },
            "target_currency": {
                "type": "string",
                "description": "The currency to convert to"
            }
        },
        "required": [
            "base_currency",
            "target_currency"
        ]
    }
}

User query: What is the exchange rate from USD to EUR?

### Response:"""

# Tokenize and generate response
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Performance on Low-End Hardware

The CPU-optimized model requires approximately:

4-6 GB RAM for loading
2-4 CPU cores for inference
No GPU required

This makes it suitable for:

Older laptops (2018 and newer)
Low-end desktops
Edge devices with ARM processors

Training Process

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the Glaive Function Calling v2 dataset. Only a subset of 50 examples was used for demonstration purposes.

License

This model is licensed under the Apache 2.0 license.

Downloads last month: 1

Safetensors

Model size

1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support