Tashkeel-700M

Arabic Diacritization Model | ู†ูŽู…ููˆุฐูุฌูŒ ุชูŽุดู’ูƒููŠู„ู ุงู„ู†ู‘ูุตููˆุตู ุงู„ู’ุนูŽุฑูŽุจููŠู‘ูŽุฉู

ู†ู…ูˆุฐุฌ ุจุญุฌู… 700 ู…ู„ูŠูˆู† ุจุงุฑุงู…ุชุฑ ู…ุฎุตุต ู„ุชุดูƒูŠู„ ุงู„ู†ุตูˆุต ุงู„ุนุฑุจูŠุฉ. ุชู… ุชุฏุฑูŠุจ ู‡ุฐุง ุงู„ู†ู…ูˆุฐุฌ ุจุถุจุท ู†ู…ูˆุฐุฌ

LiquidAI/LFM2-700M

ุนู„ู‰ ู…ุฌู…ูˆุนุฉ ุงู„ุจูŠุงู†ุงุช

arbml/tashkeela.

  • ุงู„ู†ู…ูˆุฐุฌ ุงู„ุฃุณุงุณูŠ: LiquidAI/LFM2-700M
  • ู…ุฌู…ูˆุนุฉ ุงู„ุจูŠุงู†ุงุช: arbml/tashkeela

ูƒูŠููŠุฉ ุงู„ุงุณุชุฎุฏุงู…

from transformers import AutoModelForCausalLM, AutoTokenizer

#ุชุญู…ูŠู„ ุงู„ู†ู…ูˆุฐุฌ
model_id = "Etherll/Tashkeel-700M"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# ุฅุถุงูุฉ ุงู„ุชุดูƒูŠู„
prompt = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู…" 
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
).to(model.device)

output = model.generate(
    input_ids,
    do_sample=False,  
)

print(tokenizer.decode(output[0, input_ids.shape[-1]:], skip_special_tokens=True))

ู…ุซุงู„

  • ุงู„ู†ุต ุงู„ู…ุฏุฎู„: ุงู„ุณู„ุงู… ุนู„ูŠูƒู…
  • ุงู„ู†ุงุชุฌ: ุงู„ุณู‘ูŽู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’


Tashkeel-700M (English)

A 700M parameter model for Arabic diacritization (Tashkeel). This model is a fine-tune of LiquidAI/LFM2-700M on the arbml/tashkeela dataset.

How to Use

The Python code for usage is the same as listed in the Arabic section above.

Example

  • Input: ุงู„ุณู„ุงู… ุนู„ูŠูƒู…
  • Output: ุงู„ุณู‘ูŽู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’

This lfm2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
22
Safetensors
Model size
0.7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Etherll/Tashkeel-700M

Base model

LiquidAI/LFM2-700M
Finetuned
(11)
this model
Quantizations
2 models

Dataset used to train Etherll/Tashkeel-700M