📚 Talk2Ref Cited Paper Encoder

This model encodes scientific papers (titles, abstracts, and publication years) into dense embeddings for Reference Prediction from Talks (RPT) within the Talk2Ref framework.
It serves as the key-side encoder in a dual-encoder (DPR-style) retrieval setup, paired with the Talk2Ref Query Talk Encoder.



🎯 Usage

Example with transformers:

from transformers import AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_ref_key_cited_paper_encoder")

# Example input
title = "Attention Is All You Need"
year = 2017
abstract = "The Transformer model replaces recurrence with attention mechanisms for ..."


# Build input in Talk2Ref format
key_text = f"Title: {title}. Published in {year}. Abstract: {abstract}"

# Compute embedding
with torch.no_grad():
    embedding = model([key_text])

print(embedding.shape)  # (1, hidden_dim)

🧩 Model Overview

Property Description
Architecture Sentence-BERT (all-MiniLM-L6-v2 backbone)
Pooling Mean pooling
Max sequence length 512 tokens
Training data Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks)
Objective Contrastive binary (DPR-style) loss
Task Encode cited papers into a shared semantic space with talk transcripts

🧠 Input Features

Feature Description
Title Title of the cited paper
Abstract Abstract text content
Year Publication year

These inputs are short enough to fit within the model’s 512-token limit — no chunking required.


🧮 Training Setup

The cited-paper encoder was trained jointly with the query-talk encoder under a dual-encoder contrastive framework inspired by Dense Passage Retrieval (Karpukhin et al., 2020).

Each talk Ti and paper Rj is encoded into embeddings fT(Ti) and fR(Rj).
Their dot-product similarity sij = fT(Ti)·fR(Rj) is optimized using a sigmoid-based binary loss supporting multiple positives per query:

[ L = - \sum_i [y_i \log \sigma(s_i) + (1 - y_i)\log(1 - \sigma(s_i))] ]

Negatives are sampled in-batch from other talk–paper pairs.
Before training, a domain adaptation stage aligned each talk with its own paper’s abstract to adapt to scientific and spoken-language data.


Citation

If you use this dataset, please cite the following paper:

@misc{broy2025talk2refdatasetreferenceprediction,
  title        = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
  author       = {Frederik Broy and Maike Züfle and Jan Niehues},
  year         = {2025},
  eprint       = {2510.24478},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2510.24478}
}
Downloads last month
11
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for s8frbroy/talk2ref_ref_key_cited_paper_encoder

Finetuned
(567)
this model

Dataset used to train s8frbroy/talk2ref_ref_key_cited_paper_encoder

Collection including s8frbroy/talk2ref_ref_key_cited_paper_encoder