yarenty's picture
Update README.md
3b2e141 verified
metadata
license: apache-2.0
datasets:
  - yarenty/datafusion_QA
base_model:
  - Qwen/Qwen2.5-3B-Instruct
tags:
  - rust
  - datafusion
  - gguf
  - small
  - qwen

Qwen2.5-3B-DataFusion-Instruct Quantized Model

Model Card: Quantized Version

Model Name: Qwen2.5-3B-DataFusion-Instruct (Quantized)
File: qwen2.5-3B-datafusion.gguf
Size: 1.8GB
Type: Quantized GGUF Model
Base Model: Qwen2.5-3B
Specialization: DataFusion SQL Engine and Rust Programming
License: Apache 2.0

Model Overview

This is the quantized version of the Qwen2.5-3B-DataFusion-Instruct model, optimized for production deployment and resource-constrained environments. The quantization process reduces memory usage while maintaining high accuracy for DataFusion and Rust programming tasks.

Quantization Details

Quantization Method

  • Format: GGUF (GGML Universal Format)
  • Quantization Level: Optimized for inference speed and memory efficiency
  • Precision: Reduced from full precision to quantized representation
  • Memory Reduction: ~69% reduction from 5.8GB to 1.8GB

Performance Characteristics

  • Inference Speed: Faster than full precision model
  • Memory Usage: Significantly reduced memory footprint
  • Accuracy: Minimal degradation in specialized domain knowledge
  • Deployment: Optimized for production environments

Technical Specifications

Model Architecture

  • Base Architecture: Qwen2.5-3B transformer model
  • Fine-tuning: Specialized on DataFusion ecosystem data
  • Context Handling: Optimized for technical Q&A format
  • Output Format: Structured responses with stop sequences

Inference Parameters

  • Temperature: 0.7 (balanced creativity vs consistency)
  • Top-p: 0.9 (nucleus sampling for quality)
  • Repeat Penalty: 1.2 (prevents repetitive output)
  • Max Tokens: 1024 (controlled response length)

Performance Metrics

Memory Efficiency

  • Original Size: 5.8GB
  • Quantized Size: 1.8GB
  • Memory Reduction: 69%
  • RAM Usage: Significantly lower during inference

Speed Improvements

  • Inference Speed: 20-40% faster than full precision
  • Loading Time: Reduced model loading time
  • Response Generation: Faster token generation
  • Batch Processing: Improved throughput

Accuracy Trade-offs

  • Domain Knowledge: Maintained DataFusion expertise
  • Code Generation: High quality Rust and SQL output
  • Technical Explanations: Clear and accurate responses
  • Edge Cases: Slight degradation in complex scenarios

Deployment Guidelines

System Requirements

  • Minimum RAM: 4GB (vs 8GB+ for full model)
  • CPU: Modern multi-core processor
  • Storage: 2GB available space
  • OS: Linux, macOS, or Windows

Recommended Configurations

  • Development: 8GB RAM, modern CPU
  • Production: 16GB+ RAM, dedicated CPU cores
  • High-Throughput: 32GB+ RAM, GPU acceleration (optional)

Integration Options

  • Ollama: Native support with optimized performance
  • llama.cpp: Direct GGUF file usage
  • Custom Applications: REST API integration
  • Batch Processing: High-volume inference pipelines

Comparison with Full Model

Metric Quantized Model Full Model
File Size 1.8GB 5.8GB
Memory Usage Lower Higher
Inference Speed Faster Standard
Accuracy High Highest
Deployment Production-ready Development/Production
Resource Efficiency High Standard

Best Practices

For Production Use

  1. Load Testing: Validate performance under expected load
  2. Memory Monitoring: Track RAM usage during operation
  3. Response Validation: Implement quality checks for outputs
  4. Fallback Strategy: Plan for model switching if needed

For Development

  1. Iterative Testing: Test with various input types
  2. Performance Profiling: Monitor inference times
  3. Quality Assessment: Compare outputs with full model
  4. Integration Testing: Validate in target environment

This quantized model provides an excellent balance of performance, accuracy, and resource efficiency, making it ideal for production deployment of DataFusion-specialized AI assistance.