TGI on Gaudi

Text Generation Inference (TGI) on Intel® Gaudi® AI Accelerator is supported via Intel® Gaudi® TGI repository. Start TGI service on Gaudi system simply by pulling a TGI Gaudi Docker image and launching a local TGI service instance.

For example, TGI service on Gaudi for Llama 2 7B model can be started with:

docker run \
  -p 8080:80 \
  -v $PWD/data:/data \
  --runtime=habana \
  -e HABANA_VISIBLE_DEVICES=all \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  --cap-add=sys_nice \
  --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 \
  --model-id meta-llama/Llama-2-7b-hf \
  --max-input-tokens 1024 \
  --max-total-tokens 2048

You can then send a simple request:

curl 127.0.0.1:8080/generate \
  -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
  -H 'Content-Type: application/json'

To run static benchmark test, please refer to TGI’s benchmark tool. More examples of running the service instances on single or multi HPU device system are available here.

Update on GitHub

Optimum Habana

TGI on Gaudi