Model Card

  • Base model: meta-llama/Meta-Llama-3-8B
  • Quantization method: FlatQuant

How To Run

Set Environment

git clone https://github.com/ruikangliu/FlatQuant.git

cd FlatQuant

conda create -n flatquant python=3.10 -y
conda activate flatquant
pip install -r requirements.txt
pip install -e .
pip install flash-attn --no-build-isolation

鈿狅笍 CUDA Required: If you encounter CUDA-related errors, please check nvcc --version and install CUDA toolkit or set the path to nvcc correctly.

鈿狅笍 Be sure to use the correct _CUDA.so file that matches your environment and GPU. Using a .so file compiled in a different environment may lead to different kernel outputs. You can compile this file with pip install -e . in your environment.

Test Script

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant",  # Update this with your actual HF repo name
    trust_remote_code = True,
    torch_dtype = torch.float16,
    
)
tokenizer = AutoTokenizer.from_pretrained("Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant")
streamer = TextStreamer(tokenizer)

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    model = model.to(device)

prompt = "Summarize Barry Bonds's career so far as a legendary tale told by an old baseball coach.\n"

inputs = tokenizer(prompt, return_tensors = "pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens = 50,
        do_sample = False,
        temperature = 1.0,
        streamer = streamer
    )

Quantization Details

  • Weight bits: 4
  • Activation bits: 4
  • KV cache bits: 4
  • Weight symmetric: True
  • Activation symmetric: True
  • KV cache symmetric: False
Downloads last month
1
Safetensors
Model size
5B params
Tensor type
F16
U8
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Hyun9junn/Meta-Llama-3-8B-W4A4KV4-FlatQuant

Quantized
(265)
this model