Inference with llama.cpp + Open WebUI gives repeating `?`

#1
by whoisjeremylam - opened

Is there a specific build of llama.cpp that should be used to support AutoRound?

This is the command

CUDA_VISIBLE_DEVICES=1 \
~/llama.cpp/build/bin/llama-server \
  -t 23 \
  -m /home/ai/models/Intel/Ling-flash-2.0-gguf-q2ks-mixed-AutoRound/Ling-flash-Q2_K_S.gguf \
  --alias Ling-flash \
  --no-mmap \
  --host 0.0.0.0 \
  --port 5000 \
  -c 13056 \
  -ngl 999 \
  -ub 4096 -b 4096

llama.cpp build from main:

$ git rev-parse --short HEAD
6de8ed751

image

same here.
latest llama.cpp (github master) freshly built on ubuntu+cuda, and using llama.cpp built-in UI.
returns repeating '?' no matter what is the prompt.
otherwise, works fine with other models.

Intel org

CPU works fine, but CUDA has issues, we’re investigating the root cause.

corfirmed :(, with -ot exps=CPU works as expected
@wenhuach any updates?

Intel org

We re-uploaded the model, and Inference work normally on CUDA. If there are still problems or other issues, please add an issue in our github https://github.com/intel/auto-round.git

Sign up or log in to comment