GIRCSE: Generative Iterative Refinement for Contrastive Sentence Embeddings

GIRCSE is a novel generative embedding framework that transforms Large Language Models (LLMs) into text encoders by leveraging their autoregressive generative capabilities. Unlike traditional encoder-only embeddings, GIRCSE generates a sequence of "soft refinement tokens" and utilizes an Iterative Contrastive Refinement (ICR) objective to progressively distill semantics into high-quality embeddings.

Model Details

Model Description

GIRCSE addresses the limitations of static LLM-based embeddings by treating the representation learning process as an iterative refinement task.

Key Innovation: Instead of a single forward pass, the model generates $k$ auxiliary soft tokens. These tokens capture latent concepts and implicit semantics (e.g., task-specific instructions) that are often missed by standard pooling methods.
Iterative Contrastive Refinement (ICR): A stepwise objective that ensures each additional generated token monotonically improves the embedding quality.
Test-time Scaling: An emergent property where generating more tokens at inference time (e.g., 5 to 20 tokens) leads to better performance on downstream tasks, analogous to "Chain-of-Thought" for embeddings.
Developed by: Yu-Che (Roy) Tsai, et al.
Model type: Generative Text Embedding (based on Decoder-only LLM)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-7B

Model Sources

Repository: Roytsai27/GIRCSE
Paper: Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Training Details

Training Data

The model was trained on a curated mix of contrastive datasets (e.g., MS-MARCO, NLI) totaling approximately 200K samples.

Training Procedure

Method: LoRA (Low-Rank Adaptation)
Objective: Iterative Contrastive Refinement (ICR) with Stepwise Contrastive Loss.
Steps: 5 refinement steps were used during training.
Framework: PEFT 0.15.2 + Transformers.

Citation

If you find this work helpful, please cite:

BibTeX:

@article{tsai2025gircse,
  title={Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement},
  author={Tsai, Yu-Che and others},
  journal={arXiv preprint arXiv:2509.24291},
  year={2025}
}

Contact

For questions, please open an issue in the GitHub Repository.

Downloads last month: 8

Model tree for Roytsai27/GIRCSE-QWEN7B

Base model

Qwen/Qwen2.5-7B

Adapter

(462)

this model

Paper for Roytsai27/GIRCSE-QWEN7B

Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Paper • 2509.24291 • Published Sep 29, 2025 • 1