TinyLlama Function Calling (CPU Optimized)
This is a CPU-optimized version of TinyLlama that has been fine-tuned for function calling capabilities.
Model Details
- Base Model: TinyLlama-1.1B-Chat-v1.0
- Parameters: 1.1 billion
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Training Data: Function calling examples from Glaive Function Calling v2 dataset
- Optimization: Merged LoRA weights, converted to float32 for CPU deployment
Key Features
- Function Calling Capabilities: The model can identify when functions should be called and generate appropriate function call syntax
- CPU Optimized: Ready to run efficiently on low-end hardware without GPUs
- Lightweight: Only 1.1B parameters, making it suitable for older hardware
- Low Resource Requirements: Requires only 4-6 GB RAM for loading
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model
model = AutoModelForCausalLM.from_pretrained("tinyllama-function-calling-cpu-optimized")
tokenizer = AutoTokenizer.from_pretrained("tinyllama-function-calling-cpu-optimized")
# Example prompt for function calling
prompt = """### Instruction:
Given the available functions and the user query, determine which function(s) to call and with what arguments.
Available functions:
{
    "name": "get_exchange_rate",
    "description": "Get the exchange rate between two currencies",
    "parameters": {
        "type": "object",
        "properties": {
            "base_currency": {
                "type": "string",
                "description": "The currency to convert from"
            },
            "target_currency": {
                "type": "string",
                "description": "The currency to convert to"
            }
        },
        "required": [
            "base_currency",
            "target_currency"
        ]
    }
}
User query: What is the exchange rate from USD to EUR?
### Response:"""
# Tokenize and generate response
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    )
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Performance on Low-End Hardware
The CPU-optimized model requires approximately:
- 4-6 GB RAM for loading
- 2-4 CPU cores for inference
- No GPU required
This makes it suitable for:
- Older laptops (2018 and newer)
- Low-end desktops
- Edge devices with ARM processors
Training Process
The model was fine-tuned using LoRA (Low-Rank Adaptation) on the Glaive Function Calling v2 dataset. Only a subset of 50 examples was used for demonstration purposes.
License
This model is licensed under the Apache 2.0 license.
- Downloads last month
- 1
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support