Model Card

Base model: meta-llama/Meta-Llama-3-8B
Quantization method: FlatQuant

How To Run

Set Environment

git clone https://github.com/ruikangliu/FlatQuant.git

cd FlatQuant

conda create -n flatquant python=3.10 -y
conda activate flatquant
pip install -r requirements.txt
pip install -e .
pip install flash-attn --no-build-isolation

⚠️ CUDA Required: If you encounter CUDA-related errors, please check nvcc --version and install CUDA toolkit or set the path to nvcc correctly.

⚠️ Be sure to use the correct _CUDA.so file that matches your environment and GPU. Using a .so file compiled in a different environment may lead to different kernel outputs. You can compile this file with pip install -e . in your environment.

Test Script

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant",  # Update this with your actual HF repo name
    trust_remote_code = True,
    torch_dtype = torch.float16,
    
)
tokenizer = AutoTokenizer.from_pretrained("Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant")
streamer = TextStreamer(tokenizer)

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    model = model.to(device)

prompt = "Summarize Barry Bonds's career so far as a legendary tale told by an old baseball coach.\n"

inputs = tokenizer(prompt, return_tensors = "pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens = 50,
        do_sample = False,
        temperature = 1.0,
        streamer = streamer
    )

Quantization Details

Weight bits: 4
Activation bits: 4
KV cache bits: 4
Weight symmetric: True
Activation symmetric: True
KV cache symmetric: False

Downloads last month: 1

Safetensors

Model size

5B params

Tensor type

F16

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant

Base model

meta-llama/Meta-Llama-3-8B

Quantized

(265)

this model