DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML
This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.
Model Description
DeepSeek-R1-Distill-Llama-8B is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.
This CoreML conversion provides:
- Full compatibility with Apple Silicon devices (M1, M2, M3 series)
 - Stateful inference with KV-caching for efficient text generation
 - Optimized performance for on-device deployment
 
Technical Specifications
- Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 - Parameters: 8 billion
 - Context Length: Configurable (default: 64, expandable based on memory constraints)
 - Quantization: FP16
 - File Format: .mlpackage
 - Deployment Target: macOS 15+
 - Architecture: Stateful LLM with key-value caching
 - Input Features: Flexible input size with dynamic shape handling
 
Key Features
- Stateful Inference: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
 - Dynamic Input Shapes: Supports variable input lengths through RangeDim specification.
 - Optimized Memory Usage: Efficiently manages the key-value cache to minimize memory footprint.
 
Implementation Details
This conversion utilizes:
- A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
 - CoreML's state management capabilities for maintaining KV caches between inference calls
 - Proper buffer registration to ensure state persistence
 - Dynamic tensor shapes to accommodate various input and context lengths
 
Usage
The model can be loaded and used with CoreML in your Swift or Python projects:
import coremltools as ct
# Load the model
model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")
# Prepare inputs for inference
# ...
# Run inference
output = model.predict({
    "inputIds": input_ids,
    "causalMask": causal_mask
})
Conversion Process
The model was converted using CoreML Tools with the following steps:
- Loading the original model from Hugging Face
 - Wrapping it with custom state management
 - Tracing with PyTorch's JIT
 - Converting to CoreML format with state specifications
 - Saving in the .mlpackage format
 
Requirements
To use this model:
- Apple Silicon Mac (M1/M2/M3 series)
 - macOS 15 or later
 - Minimum 16GB RAM recommended
 
Limitations
- The model requires significant memory for inference, especially with longer contexts
 - Performance is highly dependent on the device's Neural Engine capabilities
 - The default configuration supports a context length of 64 tokens, but this can be adjusted
 
License
This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.
Acknowledgments
- DeepSeek-AI for creating and releasing the original model
 - Hugging Face for hosting the model and providing the Transformers library
 - Apple for developing the CoreML framework
 
Citation
If you use this model in your research, please cite both the original DeepSeek model and this conversion.
- Downloads last month
 - 2
 
Model tree for anthonymikinka/DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B