Improve model card: Add metadata, links, and usage example
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,18 +1,73 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
-
datasets:
|
| 4 |
-
- NingLab/MMECInstruct
|
| 5 |
base_model:
|
| 6 |
- meta-llama/Llama-3.2-3B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# CASLIE-S
|
| 10 |
|
| 11 |
-
This repo contains the models for "Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## CASLIE Models
|
| 14 |
The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
## Citation
|
| 17 |
```bibtex
|
| 18 |
@article{ling2024captions,
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- meta-llama/Llama-3.2-3B-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- NingLab/MMECInstruct
|
| 6 |
+
license: cc-by-4.0
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
---
|
| 10 |
|
| 11 |
# CASLIE-S
|
| 12 |
|
| 13 |
+
This repo contains the models for "[Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data](https://huggingface.co/papers/2410.17337)".
|
| 14 |
+
|
| 15 |
+
- π [Paper](https://huggingface.co/papers/2410.17337)
|
| 16 |
+
- π [Project Page](https://ninglab.github.io/CASLIE/)
|
| 17 |
+
- π» [Code](https://github.com/ninglab/CASLIE)
|
| 18 |
+
|
| 19 |
+
## Introduction
|
| 20 |
+
We introduce [MMECInstruct](https://huggingface.co/datasets/NingLab/MMECInstruct), the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information. Leveraging MMECInstruct, we fine-tune a series of e-commerce Multimodal Foundation Models (MFMs) within CASLIE.
|
| 21 |
|
| 22 |
## CASLIE Models
|
| 23 |
The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
|
| 24 |
|
| 25 |
+
## Sample Usage
|
| 26 |
+
|
| 27 |
+
To conduct multimodal inference with the CASLIE-S model using the Hugging Face `transformers` library, you can follow this example. This snippet demonstrates how to load the model and processor, and perform a basic image-text-to-text generation.
|
| 28 |
+
|
| 29 |
+
```python
|
| 30 |
+
import torch
|
| 31 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 32 |
+
from PIL import Image
|
| 33 |
+
|
| 34 |
+
# Load model and processor
|
| 35 |
+
model_path = "NingLab/CASLIE-S"
|
| 36 |
+
# The `trust_remote_code=True` is necessary to load custom model and processor definitions.
|
| 37 |
+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
| 38 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
|
| 39 |
+
|
| 40 |
+
# Example: Image and text input for a product description task
|
| 41 |
+
# Replace "image.png" with the actual path to your image file
|
| 42 |
+
try:
|
| 43 |
+
image = Image.open("image.png").convert("RGB")
|
| 44 |
+
except FileNotFoundError:
|
| 45 |
+
print("Warning: 'image.png' not found. Using a dummy image for demonstration. Please replace with a real image path.")
|
| 46 |
+
# Create a dummy image for demonstration if actual image is not found
|
| 47 |
+
image = Image.new('RGB', (256, 256), color = 'red')
|
| 48 |
+
|
| 49 |
+
question = "Describe the product in detail."
|
| 50 |
+
|
| 51 |
+
# Prepare the conversation in a chat template format
|
| 52 |
+
# The "<image>" token is a placeholder which the processor handles to embed image features.
|
| 53 |
+
messages = [{"role": "user", "content": f"{question} <image>"}]
|
| 54 |
+
|
| 55 |
+
# Apply the chat template and process inputs (image and text)
|
| 56 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 57 |
+
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
|
| 58 |
+
|
| 59 |
+
# Generate response from the model
|
| 60 |
+
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
|
| 61 |
+
response = processor.decode(output_ids[0], skip_special_tokens=True)
|
| 62 |
+
|
| 63 |
+
print(f"Question: {question}")
|
| 64 |
+
print(f"Response: {response}")
|
| 65 |
+
|
| 66 |
+
# For more advanced usage, specific tasks, and detailed inference scripts,
|
| 67 |
+
# please refer to the project's official GitHub repository:
|
| 68 |
+
# https://github.com/ninglab/CASLIE
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
## Citation
|
| 72 |
```bibtex
|
| 73 |
@article{ling2024captions,
|