📚 Talk2Ref Cited Paper Encoder

This model encodes scientific papers (titles, abstracts, and publication years) into dense embeddings for Reference Prediction from Talks (RPT) within the Talk2Ref framework.
It serves as the key-side encoder in a dual-encoder (DPR-style) retrieval setup, paired with the Talk2Ref Query Talk Encoder.

🎯 Usage

Example with transformers:

from transformers import AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")

# Example input
title = "Attention Is All You Need"
year = 2017
abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."


# Build input in Talk2Ref format
key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"

# Compute embedding
with torch.no_grad():
    embedding = model([key_text])

print(embedding.shape)  # (1, hidden_dim)

🧩 Model Overview

Property	Description
Architecture	Sentence-BERT (all-MiniLM-L6-v2 backbone)
Pooling	Mean pooling
Max sequence length	512 tokens
Training data	Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks)
Objective	Contrastive binary (DPR-style) loss
Task	Encode cited papers into a shared semantic space with talk transcripts

🧠 Input Features

Feature	Description
Title	Title of the cited paper
Abstract	Abstract text content
Year	Publication year

These inputs are short enough to fit within the model’s 512-token limit — no chunking required.

🧮 Training Setup

The cited-paper encoder was trained jointly with the query-talk encoder under a dual-encoder contrastive framework inspired by Dense Passage Retrieval (Karpukhin et al., 2020).

Each talk Ti and paper Rj is encoded into embeddings fT(Ti) and fR(Rj).
Their dot-product similarity sij = fT(Ti)·fR(Rj) is optimized using a sigmoid-based binary loss supporting multiple positives per query:

[ L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))] ]

Negatives are sampled in-batch from other talk–paper pairs.
Before training, a domain adaptation stage aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.

Citation

If you use this dataset, please cite the following paper:

@misc{broy2025talk2refdatasetreferenceprediction,
  title        = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
  author       = {Frederik Broy and Maike Züfle and Jan Niehues},
  year         = {2025},
  eprint       = {2510.24478},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2510.24478}
}