BanglaT5-mHealth

Model Description

BanglaT5-mHealth is a Bengali paraphrasing model fine-tuned on the BanglaHealth dataset—an extensive collection of health-related Bengali sentences with high-quality paraphrase pairs. This model leverages the multilingual T5 (mT5) architecture as its foundation and adapts it to the low-resource, domain-specific context of Bengali healthcare content.

The model is designed to generate paraphrases that retain the semantic meaning of the original sentence while altering its lexical or syntactic form. It is suitable for downstream applications such as:

  • Medical chatbot development

  • Patient instruction simplification

  • Automatic content rewriting for Bengali health education

  • Dataset augmentation for Bengali NLP

  • Developed by: [Faisal Ibn Aziz]

  • Funded by : [Military Institute of Science & Technology(MIST)]

  • Model type: [Text to Text transformer based architecture]

  • Language(s) (NLP): [Bengali]

  • License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)]

  • Finetuned from model : [https://huggingface.co/csebuetnlp/banglat5_banglaparaphrase]

Training Details

Field Description
Base Model csebuetnlp/banglat5_banglaparaphrase
Dataset Link
Size 200,000 sentence pairs
Fine-Tuning Strategy Incremental domain-specific fine-tuning
Epochs 5 (per incremental subset)
Batch Size 12
Learning Rate 3e-5
Loss Function Cross-Entropy Loss
Decoding Strategy Beam Search (beam width = 5)
License CC BY 4.0

Model Sources

Using this model

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the trained model
model_name = 'faisal4590aziz/bangla-t5-mHealth'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Function to generate paraphrase
def generate_paraphrase(model, tokenizer, sentence, max_length=128):
    input_ids = tokenizer.encode(sentence, return_tensors='pt')
    outputs = model.generate(input_ids, max_length=max_length, num_return_sequences=1)
    paraphrase = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return paraphrase

# Load the test dataset
from datasets import load_dataset
dataset = load_dataset("faisal4590aziz/bangla-health-related-paraphrased-dataset")

for idx, row in dataset.iterrows():
  paraphrase = generate_paraphrase(model, tokenizer, row['source_sentence'])

📰 Related Publication

The dataset that was used to train this model is published in Data in Brief (Elsevier):

BanglaHealth: A Bengali paraphrase Dataset on Health Domain
Faisal Ibn Aziz, Dr. Mohammad Nazrul Islam, 2025

Training Data

Dataset Link

Training Procedure

Incremental Training Approach

Since training on a large dataset would require a lot of computing power, an ”Incremental Training Approach” was introducted to develop our custom model. Instead of using the entire dataset at once, we divided it into smaller parts. This allowed the model to gradually learn and improve without putting too much strain on the system. This process of adopting incremental training approach on our proposed model is described in below: For this, firstly, the complete dataset, comprising 200,000 Bengali sentences, was divided into smaller, manageable batches. Each batch contained a subset of the data, typically 20,000 sentences. Secondly, the initial batch was used to train the base model, csebuetnlp/banglat5 banglaparaphrase. This first phase in- volved standard training procedures, where the model learned from the initial batch of 20,000 sentences. Then, upon completion of the initial training, the model was saved and then used as the starting point for the next training iteration. The subsequent batch of 20,000 sentences was then introduced for further training. This process ensured that the model retained knowledge from previous batches while continually integrating new information. After that, this iterative process was repeated for each succes- sive batch. Each iteration leveraged the model trained on all previous batches, progressively refining and augmenting the model’s capabilities with each new set of data. Finally, after each batch training, the updated model was saved and uploaded to the HuggingFace repository. This systematic saving ensured that the model’s incremental progress was pre- served and could be reused for further training or evaluation.

Benchmarking of Domain Specific Paraphrase Generation

Here, Base model: csebuetnlp/banglat5_banglaparaphrase and the proposed model is our model

Metric Base Model Proposed Model
BLEU 0.1919 0.4213
ROUGE-1 (F1) 0.5160 0.6924
ROUGE-2 (F1) 0.2761 0.4969
ROUGE-L (F1) 0.4903 0.6924
METEOR 0.5160 0.6924

Metrics

[BLEU, ROUGUE, METEOR]

Citation

If you are using this model/dataset, please cite this in your research.

@article{AZIZ2025111699,
  title = {BanglaHealth: A Bengali paraphrase Dataset on Health Domain},
  journal = {Data in Brief},
  pages = {111699},
  year = {2025},
  issn = {2352-3409},
  doi = {https://doi.org/10.1016/j.dib.2025.111699},
  url = {https://www.sciencedirect.com/science/article/pii/S2352340925004299},
  author = {Faisal Ibn Aziz and Muhammad Nazrul Islam},
  keywords = {Natural Language Processing (NLP), Paraphrasing, Bengali Paraphrasing, Bengali Language, Health Domain},
}

Model Card Authors

[Faisal Ibn Aziz, Dr. Mohammad Nazrul Islam]

Model Card Contact

[[email protected], [email protected]]

Downloads last month
12
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faisal4590aziz/bangla-t5-mHealth

Finetuned
(1)
this model

Dataset used to train faisal4590aziz/bangla-t5-mHealth