MWirelabs/NortheastNER

NortheastNER is a Named Entity Recognition (NER) model fine-tuned by MWirelabs to recognize entities specific to Northeast India. It is based on xlm-roberta-base and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places).


๐Ÿ”Ž What it can recognize

  • PLACES โ†’ States, districts, villages, regions (e.g., Shillong, Tura, Ri-Bhoi)
  • TRIBES โ†’ Indigenous tribes & sub-tribes (e.g., Khasi, Nyishi, Wancho)
  • FESTIVALS โ†’ Local festivals (e.g., Wangala, Losar, Nyokum Yullo)
  • TOURIST โ†’ Landmarks & tourist spots (e.g., Tawang Monastery, Umiam Lake)
  • FLORA โ†’ Plants & crops of the Himalayan / NE region
  • FAUNA โ†’ Animals, birds, wildlife from NE region

๐Ÿ“Š Evaluation

Evaluated on a 5k-sentence dev set:

Entity Precision Recall F1
PLACES 0.963 0.969 0.966
TRIBES 0.927 0.927 0.927
FESTIVALS (coming soon, fewer examples)
TOURIST 0.167 0.125 0.143
FLORA 1.000 0.800 0.889
FAUNA 0.000 0.000 0.000
Overall 0.962 0.967 0.964

โš ๏ธ Low scores for TOURIST / FAUNA due to very few training examples โ€” performance will improve with more labeled data. Note: The current evaluation set does not include enough examples of NAMES, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation.


โš™๏ธ Training Setup

  • Base model: xlm-roberta-base
  • Max sequence length: 256
  • Batch size: 16
  • Learning rate: 3e-5
  • Epochs: 3
  • Weight decay: 0.01
  • Optimizer: AdamW
  • Framework: HuggingFace Transformers Trainer API

๐Ÿ“ฆ Dataset Size

  • Train set: ~20,000 sentences
  • Dev set: ~5,000 sentences
  • Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions

๐Ÿ”ง Environment

  • Transformers: 4.44.2
  • Datasets: 2.20.0
  • Evaluate: 0.4.2
  • PyTorch: 2.3.0+cu121
  • Python: 3.11
  • Hardware: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU

๐Ÿš€ Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "MWirelabs/NortheastNER"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Wangala festival is celebrated in Garo Hills near Tura."
print(ner(text))

Output:

[{'entity_group': 'FESTIVALS', 'word': 'Wangala', 'score': 0.99},
 {'entity_group': 'PLACES', 'word': 'Garo Hills', 'score': 0.98},
 {'entity_group': 'PLACES', 'word': 'Tura', 'score': 0.97}]

๐Ÿ“œ License

This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to use, share, and adapt the model for non-commercial purposes with attribution.


๐Ÿ—‚ Data Licenses

  • Gazetteers of villages and tribes: compiled by MWirelabs (open reference use).
  • Festivals, tourist sites, and names: curated by MWirelabs team.
    Please ensure attribution when reusing any derived dataset.

๐Ÿ“– Citation

If you use this model in your research, please cite:

@misc{mwirelabs2025northeastner,
  title        = {NortheastNER: A Domain-Specific Named Entity Recognition Model for Northeast India},
  author       = {MWirelabs},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MWirelabs/NortheastNER}},
}

โš ๏ธ Limitations

  • Low support for TOURIST and FAUNA classes (few examples).
  • NAMES entity class trained but not evaluated due to lack of dev set coverage.
  • Possible confusion between TRIBES and PLACES where names overlap (e.g., Garo).
  • Model optimized for Northeast India texts; performance outside this domain may degrade.

๐Ÿ”ฎ Future Work

  • Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist).
  • Explore active learning to identify low-confidence predictions for manual annotation.
  • Expand coverage of festivals and indigenous knowledge domains.

๐Ÿข About

This model is developed by MWirelabs, pioneering AI solutions for the rich cultural and linguistic diversity of Northeast India. Contact: MWirelabs

Downloads last month
7
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • Overall F1 on Custom Northeast India Gazetteers + News Corpus
    self-reported
    0.964
  • Precision on Custom Northeast India Gazetteers + News Corpus
    self-reported
    0.962
  • Recall on Custom Northeast India Gazetteers + News Corpus
    self-reported
    0.967