Instructions to use lelapa/InkubaLM-0.4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lelapa/InkubaLM-0.4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lelapa/InkubaLM-0.4B", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lelapa/InkubaLM-0.4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lelapa/InkubaLM-0.4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lelapa/InkubaLM-0.4B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/lelapa/InkubaLM-0.4B
- SGLang
How to use lelapa/InkubaLM-0.4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lelapa/InkubaLM-0.4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lelapa/InkubaLM-0.4B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lelapa/InkubaLM-0.4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lelapa/InkubaLM-0.4B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use lelapa/InkubaLM-0.4B with Docker Model Runner:
docker model run hf.co/lelapa/InkubaLM-0.4B
Multiple bugs prevent model.generate() from working - AttributeError and KeyError issues
The model's implementation has several bugs that prevent model.generate() from working properly. The issues appear to be related to inadequate null checking when handling cached key-value pairs during generation.
Steps to Reproduce:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True)
model.to('cuda')
text = "Today I planned to"
inputs = tokenizer(text, return_tensors="pt").to('cuda')
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=20,
pad_token_id=tokenizer.eos_token_id
)
Errors Encountered:
- Line 578 in
VulavulaLlamaModel.forward():
AttributeError: 'NoneType' object has no attribute 'shape'
Code: past_key_values_length = past_key_values[0][0].shape[2]
- Line 232 in
VulavulaLlamaAttention.forward():
AttributeError: 'NoneType' object has no attribute 'shape'
Code: kv_seq_len += past_key_value[0].shape[-2]
- Lines 149-150 in
VulavulaLlamaAttention.forward():
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
Code: key_states = torch.cat([past_key_value[0], key_states], dim=2)
- Cache access issues:
KeyError: 'Cache only has 1 layers, attempted to access layer with index 1'
Pattern:
All errors seem to stem from the code assuming that if past_key_values or past_key_value is not None, then it contains valid tensor data. However, during the generation process, these structures can contain None elements or have indexing mismatches.
Current Workaround:
Setting use_cache=False works but significantly impacts generation speed.
Environment:
- transformers version: 4.55.4
- torch version: 2.7.1+cu128
- Python version: 3.12.11