🍎 CalorieCLIP: Accurate Food Calorie Estimation

CalorieCLIP vs Other Models

CalorieCLIP is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.

🎯 Key Results

Metric Value
Mean Absolute Error 51.4 calories
Within 50 calories 67.6%
Within 100 calories 90.5%
Inference Speed <50ms on M1 Mac

Accuracy Breakdown

🍽️ Example Predictions

Real predictions from our validation set across multiple datasets:

Image Food Dataset Actual Predicted Error
Example 1 Hamburger Food-101 558 555 3
Example 2 Ramen Food-101 431 437 6
Example 3 Greek Salad Food-101 144 143 1
Example 4 Sashimi Food-101 156 156 0
Example 5 Cafeteria Meal Nutrition5k 88 88 0
Example 6 Cafeteria Meal Nutrition5k 138 138 0
Example 7 Cafeteria Meal Nutrition5k 330 334 3
Example 8 Cafeteria Meal Nutrition5k 214 217 4

πŸš€ Quick Start

Installation

pip install open-clip-torch torch pillow

Python Usage

# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP

# Load model from local directory
model = CalorieCLIP.from_pretrained(".")

# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")

# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)

Direct Usage (no wrapper)

import torch
import open_clip
from PIL import Image

# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)

# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
            nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
        )
    def forward(self, x): return self.net(x)

head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()

# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
    features = clip.encode_image(img)
    calories = head(features).item()
print(f"{calories:.0f} calories")

Command Line

python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories

πŸ“Š Training Progress

Training Progress

The model was trained for 30 epochs on the Nutrition5k dataset with:

  • Huber Loss for robustness to outliers
  • Strong augmentation (rotation, color jitter, flips)
  • Fine-tuning last 2 CLIP transformer blocks (9.4% of parameters)
  • Differential learning rates (1e-5 for CLIP, 1e-3 for regression head)

πŸ”¬ Technical Details

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Food Image    │────▢│  CLIP ViT-B  │────▢│  Regression │────▢ Calories
β”‚   (224Γ—224)     β”‚     β”‚   Encoder    β”‚     β”‚    Head     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  (fine-tuned)β”‚     β”‚  (4 layers) β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                        512-dim features

Model Specs

  • Base Model: OpenAI CLIP ViT-B/32
  • Fine-tuned Layers: Last 2 transformer blocks + regression head
  • Trainable Parameters: 9.4% (8.5M of 90M)
  • Input Size: 224Γ—224 RGB
  • Output: Single float (calories)

Comparison to VLMs

We tested multiple Vision-Language Models on the same test set:

Error Distribution

Model MAE Notes
CalorieCLIP (Ours) 51.4 Local, fast, accurate
Claude 3.5 Sonnet 71.7 API required
GPT-4o 80.2 API required
Gemini 1.5 Pro 86.7 API required
GPT-4o-mini 88.7 API required
Qwen2-VL-7B (Local) 160.7 Mode collapse issues

Key Finding: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.

πŸ“ Files

CalorieCLIP/
β”œβ”€β”€ config.json           # Model configuration
β”œβ”€β”€ calorie_clip.pt       # Model weights (PyTorch)
β”œβ”€β”€ calorie_clip.py       # Inference code
β”œβ”€β”€ requirements.txt      # Dependencies
└── assets/
    β”œβ”€β”€ training_progress.png
    β”œβ”€β”€ model_comparison.png
    β”œβ”€β”€ accuracy_breakdown.png
    └── error_distribution.png

πŸ“‹ Training Data

Trained on a combined dataset of:

  • Nutrition5k: 5,006 real cafeteria food images with professional calorie measurements
  • Food-101 subset: 8,000+ food images with estimated calories
  • Total: 13,004 samples (11,053 train / 1,951 validation)
  • Diverse foods: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more

⚠️ Limitations

  • Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
  • Single-dish focused; complex multi-item plates may have higher error
  • Portion size estimation is inherently challenging from 2D images
  • Not a replacement for professional nutrition advice

πŸ™ Citation

@software{calorieclip2024,
  author = {Haplo LLC},
  title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
  year = {2024},
  url = {https://huggingface.co/jc-builds/CalorieCLIP}
}

πŸ“„ License

MIT License - free for commercial and personal use.


Made with ❀️ by Haplo LLC

Downloads last month
80
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support