Safetensors

Improve model card: Add metadata, links, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +59 -4
README.md CHANGED
@@ -1,18 +1,73 @@
1
  ---
2
- license: cc-by-4.0
3
- datasets:
4
- - NingLab/MMECInstruct
5
  base_model:
6
  - meta-llama/Llama-3.2-3B-Instruct
 
 
 
 
 
7
  ---
8
 
9
  # CASLIE-S
10
 
11
- This repo contains the models for "Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data"
 
 
 
 
 
 
 
12
 
13
  ## CASLIE Models
14
  The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Citation
17
  ```bibtex
18
  @article{ling2024captions,
 
1
  ---
 
 
 
2
  base_model:
3
  - meta-llama/Llama-3.2-3B-Instruct
4
+ datasets:
5
+ - NingLab/MMECInstruct
6
+ license: cc-by-4.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  ---
10
 
11
  # CASLIE-S
12
 
13
+ This repo contains the models for "[Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data](https://huggingface.co/papers/2410.17337)".
14
+
15
+ - πŸ“š [Paper](https://huggingface.co/papers/2410.17337)
16
+ - 🌐 [Project Page](https://ninglab.github.io/CASLIE/)
17
+ - πŸ’» [Code](https://github.com/ninglab/CASLIE)
18
+
19
+ ## Introduction
20
+ We introduce [MMECInstruct](https://huggingface.co/datasets/NingLab/MMECInstruct), the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information. Leveraging MMECInstruct, we fine-tune a series of e-commerce Multimodal Foundation Models (MFMs) within CASLIE.
21
 
22
  ## CASLIE Models
23
  The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
24
 
25
+ ## Sample Usage
26
+
27
+ To conduct multimodal inference with the CASLIE-S model using the Hugging Face `transformers` library, you can follow this example. This snippet demonstrates how to load the model and processor, and perform a basic image-text-to-text generation.
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import AutoProcessor, AutoModelForCausalLM
32
+ from PIL import Image
33
+
34
+ # Load model and processor
35
+ model_path = "NingLab/CASLIE-S"
36
+ # The `trust_remote_code=True` is necessary to load custom model and processor definitions.
37
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
38
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
39
+
40
+ # Example: Image and text input for a product description task
41
+ # Replace "image.png" with the actual path to your image file
42
+ try:
43
+ image = Image.open("image.png").convert("RGB")
44
+ except FileNotFoundError:
45
+ print("Warning: 'image.png' not found. Using a dummy image for demonstration. Please replace with a real image path.")
46
+ # Create a dummy image for demonstration if actual image is not found
47
+ image = Image.new('RGB', (256, 256), color = 'red')
48
+
49
+ question = "Describe the product in detail."
50
+
51
+ # Prepare the conversation in a chat template format
52
+ # The "<image>" token is a placeholder which the processor handles to embed image features.
53
+ messages = [{"role": "user", "content": f"{question} <image>"}]
54
+
55
+ # Apply the chat template and process inputs (image and text)
56
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
57
+ inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)
58
+
59
+ # Generate response from the model
60
+ output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
61
+ response = processor.decode(output_ids[0], skip_special_tokens=True)
62
+
63
+ print(f"Question: {question}")
64
+ print(f"Response: {response}")
65
+
66
+ # For more advanced usage, specific tasks, and detailed inference scripts,
67
+ # please refer to the project's official GitHub repository:
68
+ # https://github.com/ninglab/CASLIE
69
+ ```
70
+
71
  ## Citation
72
  ```bibtex
73
  @article{ling2024captions,