Instructions to use alexgoldberg/hebrew-manuscript-genre-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use alexgoldberg/hebrew-manuscript-genre-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="alexgoldberg/hebrew-manuscript-genre-classifier")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("alexgoldberg/hebrew-manuscript-genre-classifier", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Hebrew Manuscript Genre Classifier
A multi-label sentence classifier that predicts one or more MARC 655 genre/form values from a manuscript's title (MARC 245) plus general notes (MARC 500). Trained via distant supervision on ~26k records where MARC 655 was already populated.
Built for the Mapping Hebrew Manuscripts (MHM) pipeline (Bar-Ilan University) as a fallback for the ~31% of National Library of Israel manuscript records that have no MARC 655 genre headings.
Quick stats
| Base | dicta-il/dictabert (warm-started from provenance NER checkpoint) |
| Architecture | DictaBERT [CLS] → Dropout(0.3) → Linear(768 → 9) → sigmoid |
| Classes | 8 genre classes + 1 NOTA (none-of-the-above) |
| Threshold | 0.65 (per-class, tuned per fold) |
| micro-F1 (best fold) | 0.9206 |
| micro-F1 (mean fold) | 0.9180 |
| Training samples | 25,421 records (≥100 examples per class) + 1,629 NOTA |
| Max length | 64 tokens (sliding window for longer text) |
| Validation | 5-fold stratified CV |
Genre classes
| id | label |
|---|---|
| 0 | Piyyutim (liturgical poetry) |
| 1 | Poetry |
| 2 | Illustrated works (Manuscript) |
| 3 | Personal correspondence |
| 4 | Censored manuscripts |
| 5 | Autograph manuscripts |
| 6 | Records (Documents) |
| 7 | Bibliographies |
| 8 | __NOTA__ (explicit "none of the above") |
The NOTA class lets the model abstain rather than over-confidently pick one of the 8 classes for an out-of-vocabulary genre.
How to use
from huggingface_hub import hf_hub_download
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
REPO = "alexgoldberg/hebrew-manuscript-genre-classifier"
ckpt = torch.load(hf_hub_download(REPO, "genre_classifier_model.pt"),
map_location="cpu", weights_only=False)
label2id = ckpt["genre_label2id"]
id2label = {v: k for k, v in label2id.items()}
threshold = ckpt["threshold"]
max_len = ckpt["max_length"]
class GenreModel(nn.Module):
def __init__(self, base, n_classes):
super().__init__()
self.bert = AutoModel.from_pretrained(base)
self.dropout = nn.Dropout(0.3)
self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
cls = self.bert(input_ids, attention_mask).last_hidden_state[:,0]
return self.classifier(self.dropout(cls))
BASE = "dicta-il/dictabert"
tok = AutoTokenizer.from_pretrained(BASE)
model = GenreModel(BASE, n_classes=len(label2id))
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
text = "ספר תהלים. כתב יד עברי, ניקוד תימני."
enc = tok(text, max_length=max_len, padding="max_length",
truncation=True, return_tensors="pt")
with torch.no_grad():
probs = torch.sigmoid(model(enc["input_ids"], enc["attention_mask"]))
fired = [(id2label[i], round(float(p), 4))
for i, p in enumerate(probs[0])
if p >= threshold and id2label[i] != "__NOTA__"]
print(fired)
For text longer than 64 tokens use sliding-window inference (stride=32,
average sigmoid probabilities across windows). See examples.py.
Real input/output examples
Each example was produced by running the model on a real National Library of Israel manuscript record (MARC 245 title + 500 general notes, truncated for readability).
Example 1 — NLI 990000415290205171
MARC 245 (title): תשובות ופסקים.
MARC 500 (notes): קובץ גדול של שו""ת ופסקים מחכמי ארץ-ישראל והמזרח, תורכיה איטליה ועוד, מן המאות טז-יז, רובם אוטוגרפים, ועם חתימות רבות.|התשובות שבס' זרע אנשים, הוסיאטין תרס""ב, לקוחות ברובן מקובץ זה.|בראשו מפתח חלקי, …
Cataloger's MARC 655 label (gold reference): Autograph manuscripts
Top-3 raw scores (sigmoid probabilities):
| Rank | Class | Probability |
|---|---|---|
| 1 | __NOTA__ |
0.5837 |
| 2 | Records (Documents) |
0.5669 |
| 3 | Autograph manuscripts |
0.5641 |
Predictions above threshold (0.65): (none — model abstains)
Example 2 — NLI 990001801390205171
MARC 245 (title): קובץ.
MARC 500 (notes): עם ציורים וקישוטים.|מדף 298-473ב רשימות משפחתיות, בין השאר משנת ""של""ח"".|Photo © The Israel Museum, Jerusalem, by Ardon Bar-Hama|לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: ציורים. ציורים וקישוט…
Cataloger's MARC 655 label (gold reference): Family records|Illustrated works (Manuscript)
Top-3 raw scores (sigmoid probabilities):
| Rank | Class | Probability |
|---|---|---|
| 1 | Illustrated works (Manuscript) |
0.9011 |
| 2 | Records (Documents) |
0.4876 |
| 3 | Bibliographies |
0.2549 |
Predictions above threshold (0.65): Illustrated works (Manuscript) (0.901)
Example 3 — NLI 990000554170205171
MARC 245 (title): קובץ.
MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: כתב-יד. מכירה. תל""ה
Top-3 raw scores (sigmoid probabilities):
| Rank | Class | Probability |
|---|---|---|
| 1 | Bibliographies |
0.4277 |
| 2 | __NOTA__ |
0.3930 |
| 3 | Personal correspondence |
0.3760 |
Predictions above threshold (0.65): (none — model abstains)
Example 4 — NLI 990000569910205171
MARC 245 (title): קובץ.
MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|בראשו: חרוזים ושרבוטים שונים על ידי אחד הבעלים.
Top-3 raw scores (sigmoid probabilities):
| Rank | Class | Probability |
|---|---|---|
| 1 | Poetry |
0.6716 |
| 2 | Illustrated works (Manuscript) |
0.5198 |
| 3 | Bibliographies |
0.2277 |
Predictions above threshold (0.65): Poetry (0.672)
Example 5 — NLI 990000531620205171
MARC 245 (title): קובץ בקבלה.
MARC 500 (notes): לתוכן וקישור לפריטים בקובץ ראה למטה.|נושא נוסף: כתב-יד. מכירה. רכ""ה
Top-3 raw scores (sigmoid probabilities):
| Rank | Class | Probability |
|---|---|---|
| 1 | __NOTA__ |
0.4928 |
| 2 | Bibliographies |
0.4108 |
| 3 | Personal correspondence |
0.3536 |
Predictions above threshold (0.65): (none — model abstains)
Training details
- Encoder:
dicta-il/dictabert, warm-started from the provenance NER checkpoint (already domain-adapted on 12k Hebrew manuscript samples). - Frozen layers: bottom 10 of 12 BERT transformer layers frozen;
top 2 layers + classifier head fine-tuned with differential
learning rates (encoder
2e-6, head2e-5). - Loss: focal loss (γ = 2.0) with per-class
pos_weight = n_neg / n_posto counter class imbalance. - Threshold tuning: per-fold scan over [0.20, 0.80] step 0.05, selecting the threshold that maximizes validation micro-F1.
- Distant supervision: training samples drawn from the 123k-record
NLI catalog by whole-token Hebrew keyword matching against MARC 500
notes (see
scripts/extract_genre_samples.pyin the MHM pipeline). - Class size floor: classes with fewer than 100 examples were dropped to keep predictions well-calibrated.
- NOTA: 1,629 records whose MARC 655 lay outside the top-8 classes were included as explicit "none of the above" examples.
Limitations
- Distant-supervision label noise: training labels are derived from MARC 655 entries which themselves vary in cataloger consistency. Some records will be slightly mislabeled.
- Top-8 classes only: anything outside
Piyyutim,Poetry,Illustrated works (Manuscript),Personal correspondence,Censored manuscripts,Autograph manuscripts,Records (Documents),Bibliographieswill be predicted as__NOTA__. Frequent missing genres (Halakha,Kabbalah,Liturgy) require a future model. - NOTA handling: when no class fires above threshold, the parent
MHM pipeline writes no MARC 655 / Wikidata
P136claim. Consumers should not silently default to a "miscellaneous" label. - Catalog scope: trained on NLI MARC, not validated on Bodleian / Vatican / JTS / Rylands / etc.
Pipeline integration
In the MHM pipeline (converter/wikidata/item_builder.py) this model
is consulted only when MARC 655 is empty. MARC-sourced genres take
precedence and are emitted without a confidence qualifier; ML-inferred
genres get P1480 = Q18122778 ("presumably") and a P887 = Q2539
("machine learning") reference on the Wikidata claim. This makes the
distinction visible to downstream curators.
Pre-deployment estimate: MARC 655 / Wikidata P136 coverage rises from 69% to ~85% when this model is enabled (CLAUDE.md Rule 34).
Citation
@software{mhm_genre_classifier_2025,
author = {Goldberg, Alexander},
title = {Hebrew Manuscript Genre Classifier (multi-label, 8 + NOTA)},
year = {2025},
url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-genre-classifier},
note = {Mapping Hebrew Manuscripts (MHM) Pipeline, Bar-Ilan University},
}
License
Apache-2.0. The base model dicta-il/dictabert is © DICTA, used here
under its published license.
Acknowledgments
DICTA (DictaBERT), National Library of Israel (catalog + MARC 655 distant supervision labels), Bar-Ilan University (MHM project).
Model tree for alexgoldberg/hebrew-manuscript-genre-classifier
Base model
dicta-il/dictabertEvaluation results
- micro-F1 (best fold, 5-fold CV)self-reported0.921
- micro-F1 (mean across folds)self-reported0.918