AetherMind_SRL: Self-Reflective Learning for Robust Natural Language Inference

Community Article Published November 21, 2025

By: Sameer S. Najm
Project: AetherMind — Advanced Reasoning System
Model Repo: https://huggingface.co/samerzaher80/AetherMind_SRL
Dataset Repo: https://huggingface.co/datasets/samerzaher80/NLI_DataSets


Overview

AetherMind_SRL is a knowledge-distilled, self-reflective Transformer designed for robust, adversarial, and clinical Natural Language Inference (NLI).
It integrates:

  • Knowledge Distillation (KD) from DeBERTa-v3-base
  • Self-Reflective Learning (SRL) loops
  • ANLI adversarial fine-tuning
  • ADNI clinical reasoning (Alzheimer’s domain)
  • Smart Error Buffers and structured hard-example mining

This model is the Round 12 SRL-ANLI Smart checkpoint, achieving strong generalization across SNLI/MNLI, adversarial ANLI, and clinical Alzheimer’s reasoning.

AetherMind_SRL is part of the broader AetherMind project, a multi-year effort to build an adaptive reasoning engine with human-like error correction.


What Makes AetherMind_SRL Unique?

1. Knowledge Distillation Core

The student is a compact, efficient version of a DeBERTa-v3-base teacher.

2. Self-Reflective Learning (SRL) Engine

SRL is a supervision-driven self-improvement loop:

  1. The model predicts on ANLI + ADNI
  2. Logs all errors (and correct predictions)
  3. Builds a structured error buffer
  4. Retrains on corrected samples
  5. Repeats this until stability

3. Smart Error Buffer (Round 12)

A carefully engineered dataset built from ANLI R1/R2/R3 + corrected SRL samples:

  • Label-balanced
  • Error-heavy (60%) + Anchor samples (40%)
  • Includes negation, multi-hop, lexical overlap challenges
  • Prevents catastrophic forgetting

4. Clinical-Grade Reasoning

The model is aligned with Alzheimer’s NLI tasks (MMSE claims), achieving perfect scores on ADNI Val/Test.


Full SRL Pipeline

Step 1: ANLI Global Error Mining

Model evaluated on:

  • ANLI R1 Dev
  • ANLI R2 Dev
  • ANLI R3 Dev

For each example, logs include:

  • premise, hypothesis
  • gold label
  • predicted label
  • logits + confidence
  • error flag
  • pattern category (if detected)
  • reason of failure (negation, overlap, etc.)

These logs are merged into:

global_error_buffer_anli_round12_train.csv
global_error_buffer_anli_round12_val.csv

Step 2: Error Pattern Classification

Pattern Category Description Frequency
long_premise_multi_hop Multi-step logic across long sentences ~28%
negation_confusion Misinterprets “not”, “never”, “no longer” ~22%
lexical_overlap_confusion Wrongly assumes entailment due to word overlap ~18%
neutral_confusion_other Needs subtle contextual/world knowledge ~15%
Other (temporal, numeric) More advanced cognitive reasoning ~17%

These patterns define next-step training priorities.


Step 3: SMART Training Buffer Construction

SMART = Structured Misclassification-Aware Retraining Technique

Buffer Stats:

  • 1787 training examples
  • 448 validation examples
  • 60% error samples
  • 40% anchor samples
  • Balanced: 40% E / 30% N / 30% C
  • Sequence length: 192

Anchors stabilize behavior on SNLI/MNLI.
Errors provide adversarial pressure.


Step 4: Fine-Tuning Strategy (Round 12 Smart)

  • Epochs: 1
  • LR: 1e-6 to 2e-6
  • Optimizer: AdamW
  • Loss: Cross-Entropy + Class Weights
  • Weight Boost:
    • Historically hard classes (Neutral, Contradiction)
    • Error-flag samples have ×2–×3 weight

Base checkpoint:

student_biomed_kd_fast\adni_srl_round11_smart

Step 5: ADNI Clinical SRL Loop (Special Domain)

Pipeline:

  1. Run on ADNI Cognitive NLI
  2. Extract ADNI-specific errors
  3. Boost memory-related reasoning:
    • Temporal sequences
    • Cognitive score changes
    • Decline/stability patterns
  4. Weighted CE:
    • Correct = 1.0
    • Errors = 3.0
  5. Repeat 2–3 micro-rounds

Result: 100% accuracy on ADNI Val/Test without destabilizing SNLI/MNLI/ANLI.


Evaluation Results — Round 12 (Final)

General NLI

Dataset Accuracy Macro F1 Samples
SNLI 89.64% 89.55% 9824
MNLI-M 90.20% 90.00% 9815
MNLI-MM 89.61% 89.35% 9832
XNLI (en) 90.36% 90.32% 2490

Adversarial NLI (ANLI)

Dataset Accuracy Macro F1
ANLI R1 79.90% 79.89%
ANLI R2 67.50% 67.35%
ANLI R3 67.33% 66.81%

Clinical NLI (ADNI)

Split Accuracy Macro F1
Train 100% 100%
Val 100% 100%
Test 100% 100%

Round-4 → Round-5 SRL Improvements (From Notes)

Dataset Acc⁴ Acc⁵ F1⁴ F1⁵
SNLI 90.1 92.4 90.0 92.3
MNLI-M 84.5 86.7 84.2 86.0
MNLI-MM 84.0 86.0 83.8 85.5
ANLI R1 62.0 65.0 61.5 64.0
ANLI R2 47.0 49.0 46.5 48.0
ANLI R3 45.0 47.0 44.0 46.0
XNLI 78.0 80.0 77.0 79.0
ADNI 83.0 85.0 82.0 84.0

Repository Contents

Included scripts:

build_anli_global_error_buffer_round1.py
analyze_anli_errors_round1.py
evaluate_model_hf_only.py
srl_finetune_round5_smart.py

These implement the SRL-ANLI training engine.


Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "samerzaher80/AetherMind_SRL"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id).cuda()

premise = "The patient scored 28 on the MMSE last year."
hypothesis = "The patient shows signs of cognitive decline."

inputs = tokenizer(premise, hypothesis, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits
    prediction = torch.argmax(logits, dim=-1).item()

print(["entailment", "neutral", "contradiction"][prediction])

Intended Use

  • Research on self-reflective NLI
  • Adversarial reasoning (ANLI)
  • Clinical NLP (Alzheimer’s NLI)
  • Robust text understanding for downstream tasks

Limitations

  • English-only
  • ANLI still extremely challenging
  • Clinical generalization beyond ADNI not guaranteed

Acknowledgments

Thanks to:

  • Hugging Face
  • Open-source research community
  • ADNI dataset contributors
  • Supporters of the AetherMind project

Repos

Model: https://huggingface.co/samerzaher80/AetherMind_SRL
Dataset: https://huggingface.co/datasets/samerzaher80/NLI_DataSets

Community

Sign up or log in to comment