Hebrew Manuscript Genre Classifier

A multi-label sentence classifier that predicts one or more MARC 655 genre/form values from a manuscript's title (MARC 245) plus general notes (MARC 500). Trained via distant supervision on ~26k records where MARC 655 was already populated.

Built for the Mapping Hebrew Manuscripts (MHM) pipeline (Bar-Ilan University) as a fallback for the ~31% of National Library of Israel manuscript records that have no MARC 655 genre headings.

Quick stats

Base dicta-il/dictabert (warm-started from provenance NER checkpoint)
Architecture DictaBERT [CLS] → Dropout(0.3) → Linear(768 → 9) → sigmoid
Classes 8 genre classes + 1 NOTA (none-of-the-above)
Threshold 0.65 (per-class, tuned per fold)
micro-F1 (best fold) 0.9206
micro-F1 (mean fold) 0.9180
Training samples 25,421 records (≥100 examples per class) + 1,629 NOTA
Max length 64 tokens (sliding window for longer text)
Validation 5-fold stratified CV

Genre classes

id label
0 Piyyutim (liturgical poetry)
1 Poetry
2 Illustrated works (Manuscript)
3 Personal correspondence
4 Censored manuscripts
5 Autograph manuscripts
6 Records (Documents)
7 Bibliographies
8 __NOTA__ (explicit "none of the above")

The NOTA class lets the model abstain rather than over-confidently pick one of the 8 classes for an out-of-vocabulary genre.

How to use

from huggingface_hub import hf_hub_download
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

REPO = "alexgoldberg/hebrew-manuscript-genre-classifier"
ckpt = torch.load(hf_hub_download(REPO, "genre_classifier_model.pt"),
                  map_location="cpu", weights_only=False)
label2id = ckpt["genre_label2id"]
id2label = {v: k for k, v in label2id.items()}
threshold = ckpt["threshold"]
max_len = ckpt["max_length"]

class GenreModel(nn.Module):
    def __init__(self, base, n_classes):
        super().__init__()
        self.bert = AutoModel.from_pretrained(base)
        self.dropout = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
    def forward(self, input_ids, attention_mask):
        cls = self.bert(input_ids, attention_mask).last_hidden_state[:,0]
        return self.classifier(self.dropout(cls))

BASE = "dicta-il/dictabert"
tok = AutoTokenizer.from_pretrained(BASE)
model = GenreModel(BASE, n_classes=len(label2id))
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

text = "ספר תהלים. כתב יד עברי, ניקוד תימני."
enc = tok(text, max_length=max_len, padding="max_length",
          truncation=True, return_tensors="pt")
with torch.no_grad():
    probs = torch.sigmoid(model(enc["input_ids"], enc["attention_mask"]))
fired = [(id2label[i], round(float(p), 4))
         for i, p in enumerate(probs[0])
         if p >= threshold and id2label[i] != "__NOTA__"]
print(fired)

For text longer than 64 tokens use sliding-window inference (stride=32, average sigmoid probabilities across windows). See examples.py.

Real input/output examples

Each example was produced by running the model on a real National Library of Israel manuscript record (MARC 245 title + 500 general notes, truncated for readability).

Example 1 — NLI 990000415290205171

MARC 245 (title): תשובות ופסקים.

MARC 500 (notes): קובץ גדול של שו""ת ופסקים מחכמי ארץ-ישראל והמזרח, תורכיה איטליה ועוד, מן המאות טז-יז, רובם אוטוגרפים, ועם חתימות רבות.|התשובות שבס' זרע אנשים, הוסיאטין תרס""ב, לקוחות ברובן מקובץ זה.|בראשו מפתח חלקי, …

Cataloger's MARC 655 label (gold reference): Autograph manuscripts

Top-3 raw scores (sigmoid probabilities):

Rank Class Probability
1 __NOTA__ 0.5837
2 Records (Documents) 0.5669
3 Autograph manuscripts 0.5641

Predictions above threshold (0.65): (none — model abstains)

Example 2 — NLI 990001801390205171

MARC 245 (title): קובץ.

MARC 500 (notes): עם ציורים וקישוטים.|מדף 298-473ב רשימות משפחתיות, בין השאר משנת ""של""ח"".|Photo © The Israel Museum, Jerusalem, by Ardon Bar-Hama|לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: ציורים. ציורים וקישוט…

Cataloger's MARC 655 label (gold reference): Family records|Illustrated works (Manuscript)

Top-3 raw scores (sigmoid probabilities):

Rank Class Probability
1 Illustrated works (Manuscript) 0.9011
2 Records (Documents) 0.4876
3 Bibliographies 0.2549

Predictions above threshold (0.65): Illustrated works (Manuscript) (0.901)

Example 3 — NLI 990000554170205171

MARC 245 (title): קובץ.

MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: כתב-יד. מכירה. תל""ה

Top-3 raw scores (sigmoid probabilities):

Rank Class Probability
1 Bibliographies 0.4277
2 __NOTA__ 0.3930
3 Personal correspondence 0.3760

Predictions above threshold (0.65): (none — model abstains)

Example 4 — NLI 990000569910205171

MARC 245 (title): קובץ.

MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|בראשו: חרוזים ושרבוטים שונים על ידי אחד הבעלים.

Top-3 raw scores (sigmoid probabilities):

Rank Class Probability
1 Poetry 0.6716
2 Illustrated works (Manuscript) 0.5198
3 Bibliographies 0.2277

Predictions above threshold (0.65): Poetry (0.672)

Example 5 — NLI 990000531620205171

MARC 245 (title): קובץ בקבלה.

MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: כתב-יד. מכירה. רכ""ה

Top-3 raw scores (sigmoid probabilities):

Rank Class Probability
1 __NOTA__ 0.4928
2 Bibliographies 0.4108
3 Personal correspondence 0.3536

Predictions above threshold (0.65): (none — model abstains)

Training details

  • Encoder: dicta-il/dictabert, warm-started from the provenance NER checkpoint (already domain-adapted on 12k Hebrew manuscript samples).
  • Frozen layers: bottom 10 of 12 BERT transformer layers frozen; top 2 layers + classifier head fine-tuned with differential learning rates (encoder 2e-6, head 2e-5).
  • Loss: focal loss (γ = 2.0) with per-class pos_weight = n_neg / n_pos to counter class imbalance.
  • Threshold tuning: per-fold scan over [0.20, 0.80] step 0.05, selecting the threshold that maximizes validation micro-F1.
  • Distant supervision: training samples drawn from the 123k-record NLI catalog by whole-token Hebrew keyword matching against MARC 500 notes (see scripts/extract_genre_samples.py in the MHM pipeline).
  • Class size floor: classes with fewer than 100 examples were dropped to keep predictions well-calibrated.
  • NOTA: 1,629 records whose MARC 655 lay outside the top-8 classes were included as explicit "none of the above" examples.

Limitations

  • Distant-supervision label noise: training labels are derived from MARC 655 entries which themselves vary in cataloger consistency. Some records will be slightly mislabeled.
  • Top-8 classes only: anything outside Piyyutim, Poetry, Illustrated works (Manuscript), Personal correspondence, Censored manuscripts, Autograph manuscripts, Records (Documents), Bibliographies will be predicted as __NOTA__. Frequent missing genres (Halakha, Kabbalah, Liturgy) require a future model.
  • NOTA handling: when no class fires above threshold, the parent MHM pipeline writes no MARC 655 / Wikidata P136 claim. Consumers should not silently default to a "miscellaneous" label.
  • Catalog scope: trained on NLI MARC, not validated on Bodleian / Vatican / JTS / Rylands / etc.

Pipeline integration

In the MHM pipeline (converter/wikidata/item_builder.py) this model is consulted only when MARC 655 is empty. MARC-sourced genres take precedence and are emitted without a confidence qualifier; ML-inferred genres get P1480 = Q18122778 ("presumably") and a P887 = Q2539 ("machine learning") reference on the Wikidata claim. This makes the distinction visible to downstream curators.

Pre-deployment estimate: MARC 655 / Wikidata P136 coverage rises from 69% to ~85% when this model is enabled (CLAUDE.md Rule 34).

Citation

@software{mhm_genre_classifier_2025,
  author = {Goldberg, Alexander},
  title  = {Hebrew Manuscript Genre Classifier (multi-label, 8 + NOTA)},
  year   = {2025},
  url    = {https://huggingface.co/alexgoldberg/hebrew-manuscript-genre-classifier},
  note   = {Mapping Hebrew Manuscripts (MHM) Pipeline, Bar-Ilan University},
}

License

Apache-2.0. The base model dicta-il/dictabert is © DICTA, used here under its published license.

Acknowledgments

DICTA (DictaBERT), National Library of Israel (catalog + MARC 655 distant supervision labels), Bar-Ilan University (MHM project).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexgoldberg/hebrew-manuscript-genre-classifier

Finetuned
(9)
this model

Evaluation results