📚 Talk2Ref Cited Paper Encoder
This model encodes scientific papers (titles, abstracts, and publication years) into dense embeddings for Reference Prediction from Talks (RPT) within the Talk2Ref framework.
It serves as the key-side encoder in a dual-encoder (DPR-style) retrieval setup, paired with the Talk2Ref Query Talk Encoder.
🎯 Usage
Example with transformers:
from transformers import AutoModel
import torch
# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")
# Example input
title = "Attention Is All You Need"
year = 2017
abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."
# Build input in Talk2Ref format
key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"
# Compute embedding
with torch.no_grad():
embedding = model([key_text])
print(embedding.shape) # (1, hidden_dim)
🧩 Model Overview
| Property | Description |
|---|---|
| Architecture | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| Pooling | Mean pooling |
| Max sequence length | 512 tokens |
| Training data | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| Objective | Contrastive binary (DPR-style) loss |
| Task | Encode cited papers into a shared semantic space with talk transcripts |
🧠Input Features
| Feature | Description |
|---|---|
| Title | Title of the cited paper |
| Abstract | Abstract text content |
| Year | Publication year |
These inputs are short enough to fit within the model’s 512-token limit — no chunking required.
🧮 Training Setup
The cited-paper encoder was trained jointly with the query-talk encoder under a dual-encoder contrastive framework inspired by Dense Passage Retrieval (Karpukhin et al., 2020).
Each talk Ti and paper Rj is encoded into embeddings fT(Ti) and fR(Rj).
Their dot-product similarity sij = fT(Ti)·fR(Rj) is optimized using a sigmoid-based binary loss supporting multiple positives per query:
[ L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))] ]
Negatives are sampled in-batch from other talk–paper pairs.
Before training, a domain adaptation stage aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.
Citation
If you use this dataset, please cite the following paper:
@misc{broy2025talk2refdatasetreferenceprediction,
title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
author = {Frederik Broy and Maike Züfle and Jan Niehues},
year = {2025},
eprint = {2510.24478},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.24478}
}
- Downloads last month
- 11
Model tree for s8frbroy/talk2ref_ref_key_cited_paper_encoder
Base model
sentence-transformers/all-MiniLM-L6-v2