Model Card
- Base model: meta-llama/Meta-Llama-3-8B
- Quantization method: FlatQuant
How To Run
Set Environment
git clone https://github.com/ruikangliu/FlatQuant.git
cd FlatQuant
conda create -n flatquant python=3.10 -y
conda activate flatquant
pip install -r requirements.txt
pip install -e .
pip install flash-attn --no-build-isolation
鈿狅笍 CUDA Required: If you encounter CUDA-related errors, please check nvcc --version and install CUDA toolkit or set the path to nvcc correctly.
鈿狅笍 Be sure to use the correct _CUDA.so file that matches your environment and GPU. Using a .so file compiled in a different environment may lead to different kernel outputs. You can compile this file with pip install -e . in your environment.
Test Script
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant", # Update this with your actual HF repo name
trust_remote_code = True,
torch_dtype = torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant")
streamer = TextStreamer(tokenizer)
if torch.cuda.is_available():
device = torch.device("cuda:0")
model = model.to(device)
prompt = "Summarize Barry Bonds's career so far as a legendary tale told by an old baseball coach.\n"
inputs = tokenizer(prompt, return_tensors = "pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens = 50,
do_sample = False,
temperature = 1.0,
streamer = streamer
)
Quantization Details
- Weight bits: 4
- Activation bits: 4
- KV cache bits: 4
- Weight symmetric: True
- Activation symmetric: True
- KV cache symmetric: False
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant
Base model
meta-llama/Meta-Llama-3-8B