MWirelabs/NortheastNER
NortheastNER is a Named Entity Recognition (NER) model fine-tuned by MWirelabs to recognize entities specific to Northeast India. It is based on xlm-roberta-base and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places).
๐ What it can recognize
- PLACES โ States, districts, villages, regions (e.g., Shillong, Tura, Ri-Bhoi)
- TRIBES โ Indigenous tribes & sub-tribes (e.g., Khasi, Nyishi, Wancho)
- FESTIVALS โ Local festivals (e.g., Wangala, Losar, Nyokum Yullo)
- TOURIST โ Landmarks & tourist spots (e.g., Tawang Monastery, Umiam Lake)
- FLORA โ Plants & crops of the Himalayan / NE region
- FAUNA โ Animals, birds, wildlife from NE region
๐ Evaluation
Evaluated on a 5k-sentence dev set:
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| PLACES | 0.963 | 0.969 | 0.966 |
| TRIBES | 0.927 | 0.927 | 0.927 |
| FESTIVALS | (coming soon, fewer examples) | ||
| TOURIST | 0.167 | 0.125 | 0.143 |
| FLORA | 1.000 | 0.800 | 0.889 |
| FAUNA | 0.000 | 0.000 | 0.000 |
| Overall | 0.962 | 0.967 | 0.964 |
โ ๏ธ Low scores for TOURIST / FAUNA due to very few training examples โ performance will improve with more labeled data. Note: The current evaluation set does not include enough examples of NAMES, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation.
โ๏ธ Training Setup
- Base model:
xlm-roberta-base - Max sequence length: 256
- Batch size: 16
- Learning rate: 3e-5
- Epochs: 3
- Weight decay: 0.01
- Optimizer: AdamW
- Framework: HuggingFace Transformers Trainer API
๐ฆ Dataset Size
- Train set: ~20,000 sentences
- Dev set: ~5,000 sentences
- Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions
๐ง Environment
- Transformers: 4.44.2
- Datasets: 2.20.0
- Evaluate: 0.4.2
- PyTorch: 2.3.0+cu121
- Python: 3.11
- Hardware: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU
๐ Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "MWirelabs/NortheastNER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Wangala festival is celebrated in Garo Hills near Tura."
print(ner(text))
Output:
[{'entity_group': 'FESTIVALS', 'word': 'Wangala', 'score': 0.99},
{'entity_group': 'PLACES', 'word': 'Garo Hills', 'score': 0.98},
{'entity_group': 'PLACES', 'word': 'Tura', 'score': 0.97}]
๐ License
This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
You are free to use, share, and adapt the model for non-commercial purposes with attribution.
๐ Data Licenses
- Gazetteers of villages and tribes: compiled by MWirelabs (open reference use).
- Festivals, tourist sites, and names: curated by MWirelabs team.
Please ensure attribution when reusing any derived dataset.
๐ Citation
If you use this model in your research, please cite:
@misc{mwirelabs2025northeastner,
title = {NortheastNER: A Domain-Specific Named Entity Recognition Model for Northeast India},
author = {MWirelabs},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MWirelabs/NortheastNER}},
}
โ ๏ธ Limitations
- Low support for TOURIST and FAUNA classes (few examples).
- NAMES entity class trained but not evaluated due to lack of dev set coverage.
- Possible confusion between TRIBES and PLACES where names overlap (e.g., Garo).
- Model optimized for Northeast India texts; performance outside this domain may degrade.
๐ฎ Future Work
- Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist).
- Explore active learning to identify low-confidence predictions for manual annotation.
- Expand coverage of festivals and indigenous knowledge domains.
๐ข About
This model is developed by MWirelabs, pioneering AI solutions for the rich cultural and linguistic diversity of Northeast India. Contact: MWirelabs
- Downloads last month
- 7
Evaluation results
- Overall F1 on Custom Northeast India Gazetteers + News Corpusself-reported0.964
- Precision on Custom Northeast India Gazetteers + News Corpusself-reported0.962
- Recall on Custom Northeast India Gazetteers + News Corpusself-reported0.967