SemCSE-Multi-Invasion-Biology Model Card

SemCSE-multi is a multifaceted embedding model that predicts multiple, aspect-specific embeddings for a given scientific text. This version of the model is targeted to the domain of invasion biology. It encodes the aspects: Hypothesis, Species, Ecosystem, Research Question, Methodology, Recommendation.

The individual aspect-specific embeddings can then be used to evaluate the similarity of two studies with regards to just that aspect in isolation. For details, please see our paper.

Model Details

Model Description

  • Developed by: CLAUSE group at Bielefeld University
  • Model type: DeBERTa
  • Languages: English
  • Finetuned from model: KISTI-AI/Scideberta-full with additional projection heads

Model Sources

How to Get Started with the Model

Minimal example on how to create embeddings with our model:

from transformers import AutoTokenizer, AutoModel

# Invasion biology model
model = AutoModel.from_pretrained("CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology")

text = "This is a scientific abstract from the domain of invasion biology."
batch = tokenizer([text], return_tensors='pt')

# Get the embedding for the "species" aspect. Other options are: "hypothesis", "ecosystem", "researchquestion", "methodology" and "recommendation".
output = model(**batch)["species"]

# The resulting embeddings can be used for similarity assessments using cosine similarity.

Training Details

This model was trained on a dataset of summaries for ca. 37000 scientific abstracts from from the domain of invasion biology. We used a contrastive loss to encourage summaries of the same abstract to be placed nearby in the embedding space. This is done for each aspect separately, and the individual models are then distilled into a single, multifaceted embedding model. The dataset and exact training procedure can be found in our GitHub repo.

Evaluation

Our model achieves state-of-the-art scores for performing precise, apsect-specific similarity assessments. The evaluations are included in our paper.

Citation

BibTeX:

@misc{brinner2025semcsemultimultifaceteddecodableembeddings,
      title={SemCSE-Multi: Multifaceted and Decodable Embeddings for Aspect-Specific and Interpretable Scientific Domain Mapping}, 
      author={Marc Brinner and Sina Zarrieß},
      year={2025},
      eprint={2510.11599},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.11599}, 
}
Downloads last month
32
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CLAUSE-Bielefeld/SemCSE-Multi-Invasion-Biology

Finetuned
(5)
this model