Inference with llama.cpp + Open WebUI gives repeating `?`

by whoisjeremylam - opened 20 days ago

20 days ago

Is there a specific build of llama.cpp that should be used to support AutoRound?

This is the command

CUDA_VISIBLE_DEVICES=1 \
~/llama.cpp/build/bin/llama-server \
  -t 23 \
  -m /home/ai/models/Intel/Ling-flash-2.0-gguf-q2ks-mixed-AutoRound/Ling-flash-Q2_K_S.gguf \
  --alias Ling-flash \
  --no-mmap \
  --host 0.0.0.0 \
  --port 5000 \
  -c 13056 \
  -ngl 999 \
  -ub 4096 -b 4096

llama.cpp build from main:

$ git rev-parse --short HEAD
6de8ed751

saadsafi

18 days ago

same here.
latest llama.cpp (github master) freshly built on ubuntu+cuda, and using llama.cpp built-in UI.
returns repeating '?' no matter what is the prompt.
otherwise, works fine with other models.

wenhuach

Intel org 17 days ago

CPU works fine, but CUDA has issues, we’re investigating the root cause.

luckydevil13

17 days ago

•

edited 16 days ago

corfirmed :(, with -ot exps=CPU works as expected
@wenhuach any updates?

n1ck-guo

Intel org 14 days ago

We re-uploaded the model, and Inference work normally on CUDA. If there are still problems or other issues, please add an issue in our github https://github.com/intel/auto-round.git

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment