SemCSE-Multi-Invasion-Biology Model Card

SemCSE-multi is a multifaceted embedding model that predicts multiple, aspect-specific embeddings for a given scientific text. This version of the model is targeted to the domain of invasion biology. It encodes the aspects: Hypothesis, Species, Ecosystem, Research Question, Methodology, Recommendation.

The individual aspect-specific embeddings can then be used to evaluate the similarity of two studies with regards to just that aspect in isolation. For details, please see our paper.

Model Details

Model Description

Developed by: CLAUSE group at Bielefeld University
Model type: DeBERTa
Languages: English
Finetuned from model: KISTI-AI/Scideberta-full with additional projection heads

Model Sources

Repository: github.com/inas-argumentation/SemCSE-Multi
Paper: https://arxiv.org/abs/2510.11599

How to Get Started with the Model

Minimal example on how to create embeddings with our model:

from transformers import AutoTokenizer, AutoModel

# Invasion biology model
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology")

text = "This is a scientific abstract from the domain of invasion biology."
batch = tokenizer([text], return_tensors='pt')

# Get the embedding for the "species" aspect. Other options are: "hypothesis", "ecosystem", "researchquestion", "methodology" and "recommendation".
output = model(**batch)["species"]

# The resulting embeddings can be used for similarity assessments using cosine similarity.

Training Details

This model was trained on a dataset of summaries for ca. 37000 scientific abstracts from from the domain of invasion biology. We used a contrastive loss to encourage summaries of the same abstract to be placed nearby in the embedding space. This is done for each aspect separately, and the individual models are then distilled into a single, multifaceted embedding model. The dataset and exact training procedure can be found in our GitHub repo.

Evaluation

Our model achieves state-of-the-art scores for performing precise, apsect-specific similarity assessments. The evaluations are included in our paper.

Citation

BibTeX:

@misc{brinner2025semcsemultimultifaceteddecodableembeddings,
      title={SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Interpretable Scientific Domain Mapping}, 
      author={Marc Brinner and Sina Zarrieß},
      year={2025},
      eprint={2510.11599},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.11599}, 
}

Downloads last month: 32

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology

Base model

KISTI-AI/Scideberta-full

Finetuned

(5)

this model