You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Fill in the form below to access the model:
Log in or Sign Up to review the conditions and access this model content.
Description:
AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.
Evaluation Metrics
Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.
Kazakh Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | 
|---|---|---|---|---|---|---|---|
| Yi-Lightning | 0.812 | 0.720 | 0.852 | 0.820 | 0.940 | 0.880 | 0.660 | 
| DeepSeek V3 37A | 0.715 | 0.650 | 0.628 | 0.640 | 0.900 | 0.890 | 0.580 | 
| DeepSeek R1 | 0.798 | 0.753 | 0.764 | 0.680 | 0.868 | 0.937 | 0.784 | 
| Llama-3.1-70b-inst. | 0.639 | 0.610 | 0.585 | 0.520 | 0.820 | 0.780 | 0.520 | 
| KazLLM-1.0-70B | 0.766 | 0.660 | 0.806 | 0.790 | 0.920 | 0.770 | 0.650 | 
| GPT-4o | 0.776 | 0.730 | 0.704 | 0.830 | 0.940 | 0.900 | 0.550 | 
| AlemLLM | 0.826 | 0.757 | 0.837 | 0.775 | 0.949 | 0.917 | 0.719 | 
| QwQ 32ะ | 0.628 | 0.591 | 0.613 | 0.499 | 0.661 | 0.826 | 0.576 | 
Russian Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | 
|---|---|---|---|---|---|---|---|
| Yi-Lightning | 0.834 | 0.750 | 0.854 | 0.870 | 0.960 | 0.890 | 0.680 | 
| DeepSeek V3 37A | 0.818 | 0.784 | 0.756 | 0.840 | 0.960 | 0.910 | 0.660 | 
| DeepSeek R1 | 0.845 | 0.838 | 0.811 | 0.827 | 0.972 | 0.928 | 0.694 | 
| Llama-3.1-70b-inst. | 0.752 | 0.660 | 0.691 | 0.730 | 0.920 | 0.880 | 0.630 | 
| KazLLM-1.0-70B | 0.748 | 0.650 | 0.806 | 0.860 | 0.790 | 0.810 | 0.570 | 
| GPT-4o | 0.808 | 0.776 | 0.771 | 0.880 | 0.960 | 0.890 | 0.570 | 
| AlemLLM | 0.848 | 0.801 | 0.858 | 0.843 | 0.959 | 0.896 | 0.729 | 
| QwQ 32B | 0.840 | 0.810 | 0.807 | 0.823 | 0.964 | 0.926 | 0.709 | 
English Leaderboard
| Model | Average | MMLU | Winogrande | Hellaswag | ARC | GSM8k | DROP | 
|---|---|---|---|---|---|---|---|
| Yi-Lightning | 0.909 | 0.820 | 0.936 | 0.930 | 0.980 | 0.930 | 0.860 | 
| DeepSeek V3 37A | 0.880 | 0.840 | 0.790 | 0.900 | 0.980 | 0.950 | 0.820 | 
| DeepSeek R1 | 0.908 | 0.855 | 0.857 | 0.882 | 0.977 | 0.960 | 0.915 | 
| Llama-3.1-70b-inst. | 0.841 | 0.770 | 0.718 | 0.880 | 0.960 | 0.900 | 0.820 | 
| KazLLM-1.0-70B | 0.855 | 0.820 | 0.843 | 0.920 | 0.970 | 0.820 | 0.760 | 
| GPT-4o | 0.862 | 0.830 | 0.793 | 0.940 | 0.980 | 0.910 | 0.720 | 
| AlemLLM | 0.921 | 0.874 | 0.928 | 0.909 | 0.978 | 0.926 | 0.911 | 
| QwQ 32ะ | 0.914 | 0.864 | 0.886 | 0.897 | 0.969 | 0.969 | 0.896 | 
Model specification
Architecture: Mixture of Experts 
Total Parameters: 247B 
Activated Parameters: 22B 
Tokenizer: SentencePiece 
Quantization: BF16 
Vocabulary Size: 100352 
Number of Layers: 56 
Activation Function: SwiGLU 
Positional Encoding Method: ROPE 
Optimizer: AdamW 
Run in Docker mode
- Ubuntu 24.04
- NVIDIA-SMI 535.247.01
- Driver Version: 535.247.01
- CUDA Version: 12.2
docker run -it --runtime nvidia -d \
  --restart=unless-stopped \
  --gpus all \
  -e OMP_NUM_THREADS=1 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -p 8000:8000 \
  -v shm:/dev/shm \
  -v /alemllm/tmp/:/tmp \
  -v /alemllm/tmp/:/root/.cache \
  -v /alemllm/tmp/:/root/.local \
  -v /alemllm/weights:/alemllm/weights/ \
  astanahubcloud/alemllm:latest \
  python3 -m vllm.entrypoints.openai.api_server \
  --model=/alemllm/weights/ \
  --trust-remote-code \
  --tokenizer-mode=slow \
  --disable-log-requests \
  --max-seq-len-to-capture=131072 \
  --gpu-memory-utilization=0.98 \
  --tensor-parallel-size=8 \
  --port=8000 \
  --host=0.0.0.0 \
  --served-model-name  astanahub/alemllm
Run in Huggingface mode
- ubuntu22.04
- cuda 12.1
- python 3.11
- pytorch==2.1.0
- transformers==4.40.1
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "/path/to/alemllm"
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto",
  rope_scaling=None,
  trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."
messages = [
  {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
  **model_inputs,
  max_new_tokens=16384
)
generated_ids = [
  output_ids[len(input_ids):] for input_ids, output_ids in
  zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Run in TuringInfer mode
- ubuntu22.04
- cuda 12.4
- pytorch==2.6.0
- transformers==4.51.0
python -m turing_serving.launcher \
  --model-path /path/to/alemllm \
  --model-name alemllm \
  --host 0.0.0.0 \
  --port 9528 \
  --solver server_solver \
  --backend vllm \
  --tensor-parallel-size 8 \
  --worker-timeout-seconds 7200 \
  --skip-authorizationcheck \
  --engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98
License
Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us. The License prohibits unlawful, harmful, or abusive uses.
Attribution
Developed with technical support from 01.AI.
Intended Use & Limitations
Intended Use: Research and development in line with Kazakhstan's AI initiatives.
Limitations: The model may generate inaccurate, biased, or unsafe content; users must apply responsible use practices.
Safety & Compliance: Publication is subject to applicable laws, export control, and cybersecurity regulations.
- Downloads last month
- 21
