You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Fill in the form below to access the model:

Description:

AlemLLM is a large language model customized by Astana Hub to improve the helpfulness of LLM generated responses in the Kazakh language.

Evaluation Metrics

Model evaluations were conducted using established benchmarks, employing a systematic process to test performance across various cognitive and technical tasks.

Kazakh Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.812	0.720	0.852	0.820	0.940	0.880	0.660
DeepSeek V3 37A	0.715	0.650	0.628	0.640	0.900	0.890	0.580
DeepSeek R1	0.798	0.753	0.764	0.680	0.868	0.937	0.784
Llama-3.1-70b-inst.	0.639	0.610	0.585	0.520	0.820	0.780	0.520
KazLLM-1.0-70B	0.766	0.660	0.806	0.790	0.920	0.770	0.650
GPT-4o	0.776	0.730	0.704	0.830	0.940	0.900	0.550
AlemLLM	0.826	0.757	0.837	0.775	0.949	0.917	0.719
QwQ 32В	0.628	0.591	0.613	0.499	0.661	0.826	0.576

Russian Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.834	0.750	0.854	0.870	0.960	0.890	0.680
DeepSeek V3 37A	0.818	0.784	0.756	0.840	0.960	0.910	0.660
DeepSeek R1	0.845	0.838	0.811	0.827	0.972	0.928	0.694
Llama-3.1-70b-inst.	0.752	0.660	0.691	0.730	0.920	0.880	0.630
KazLLM-1.0-70B	0.748	0.650	0.806	0.860	0.790	0.810	0.570
GPT-4o	0.808	0.776	0.771	0.880	0.960	0.890	0.570
AlemLLM	0.848	0.801	0.858	0.843	0.959	0.896	0.729
QwQ 32B	0.840	0.810	0.807	0.823	0.964	0.926	0.709

English Leaderboard

Model	Average	MMLU	Winogrande	Hellaswag	ARC	GSM8k	DROP
Yi-Lightning	0.909	0.820	0.936	0.930	0.980	0.930	0.860
DeepSeek V3 37A	0.880	0.840	0.790	0.900	0.980	0.950	0.820
DeepSeek R1	0.908	0.855	0.857	0.882	0.977	0.960	0.915
Llama-3.1-70b-inst.	0.841	0.770	0.718	0.880	0.960	0.900	0.820
KazLLM-1.0-70B	0.855	0.820	0.843	0.920	0.970	0.820	0.760
GPT-4o	0.862	0.830	0.793	0.940	0.980	0.910	0.720
AlemLLM	0.921	0.874	0.928	0.909	0.978	0.926	0.911
QwQ 32В	0.914	0.864	0.886	0.897	0.969	0.969	0.896

Model specification

Architecture: Mixture of Experts
Total Parameters: 247B
Activated Parameters: 22B
Tokenizer: SentencePiece
Quantization: BF16
Vocabulary Size: 100352
Number of Layers: 56
Activation Function: SwiGLU
Positional Encoding Method: ROPE
Optimizer: AdamW

Run in Docker mode

Ubuntu 24.04
NVIDIA-SMI 535.247.01
Driver Version: 535.247.01
CUDA Version: 12.2

docker run -it --runtime nvidia -d \
  --restart=unless-stopped \
  --gpus all \
  -e OMP_NUM_THREADS=1 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -p 8000:8000 \
  -v shm:/dev/shm \
  -v /alemllm/tmp/:/tmp \
  -v /alemllm/tmp/:/root/.cache \
  -v /alemllm/tmp/:/root/.local \
  -v /alemllm/weights:/alemllm/weights/ \
  astanahubcloud/alemllm:latest \
  python3 -m vllm.entrypoints.openai.api_server \
  --model=/alemllm/weights/ \
  --trust-remote-code \
  --tokenizer-mode=slow \
  --disable-log-requests \
  --max-seq-len-to-capture=131072 \
  --gpu-memory-utilization=0.98 \
  --tensor-parallel-size=8 \
  --port=8000 \
  --host=0.0.0.0 \
  --served-model-name  astanahub/alemllm

Run in Huggingface mode

ubuntu22.04
cuda 12.1
python 3.11
pytorch==2.1.0
transformers==4.40.1

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "/path/to/alemllm"

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype="auto",
  device_map="auto",
  rope_scaling=None,
  trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
prompt = "Give me a short introduction to large language model."

messages = [
  {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
  **model_inputs,
  max_new_tokens=16384
)

generated_ids = [
  output_ids[len(input_ids):] for input_ids, output_ids in
  zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Run in TuringInfer mode

ubuntu22.04
cuda 12.4
pytorch==2.6.0
transformers==4.51.0

python -m turing_serving.launcher \
  --model-path /path/to/alemllm \
  --model-name alemllm \
  --host 0.0.0.0 \
  --port 9528 \
  --solver server_solver \
  --backend vllm \
  --tensor-parallel-size 8 \
  --worker-timeout-seconds 7200 \
  --skip-authorizationcheck \
  --engine-args tokenizer-mode=slow disable-log-requests=__NULL__ trustremote-code=__NULL__ kv-cache-dtype=fp8 quantization=fp8 max-seq-len-tocapture=131072 gpu_memory_utilization=0.98

License

Note that the model is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to contact us. The License prohibits unlawful, harmful, or abusive uses.

Attribution

Developed with technical support from 01.AI.

Intended Use & Limitations

Intended Use: Research and development in line with Kazakhstan's AI initiatives.

Limitations: The model may generate inaccurate, biased, or unsafe content; users must apply responsible use practices.

Safety & Compliance: Publication is subject to applicable laws, export control, and cybersecurity regulations.

Downloads last month: 21