SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-m3
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/finetuned-bge-m3-base-en")
# Run inference
sentences = [
    'Shell integration: bash and zsh don\'t serialize \\n and ; characters Part of https://github.com/microsoft/vscode/issues/155639\r\n\r\nRepro:\r\n\r\n1. Open a bash or zsh session\r\n2. Run:\r\n  ```sh\r\n  echo "a\r\n  … b"\r\n  ```\r\n  ![image](https://github.com/microsoft/vscode/assets/2193314/95ce755b-cec5-4da1-8791-75923a088efa)\r\n3. ctrl+alt+r to run recent command, select the last command, 🐛 it\'s run without the new line\r\n  ![image](https://github.com/microsoft/vscode/assets/2193314/98e81851-3a68-422c-94e8-e3e4bb21ef53)\r\n',
    'TreeView state out of sync Testing #117304\r\n\r\nRepro: Not Sure\r\n\r\nTest state shows passed in file but still running in tree view.\r\n\r\n![image](https://user-images.githubusercontent.com/6561887/108932683-a082c600-75fe-11eb-8ccd-afa5d0e2a91e.png)\r\n',
    'Setting icon and color in createTerminal API no longer works correctly See https://github.com/fabiospampinato/vscode-terminals/issues/77\r\n\r\nLooks like the default tab color/icon change probably regressed this.\r\n\r\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4264, 0.4315],
#         [0.4264, 1.0000, 0.4278],
#         [0.4315, 0.4278, 1.0000]])

Evaluation

Metrics

Triplet

Dataset: bge-base-en-train
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	1.0

Triplet

Dataset: bge-base-en-train
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9524

Training Details

Training Dataset

Unnamed Dataset

Size: 336 training samples
Columns: texts and label

Approximate statistics based on the first 336 samples:

	texts	label
type	string	int
details	min: 12 tokens mean: 340.9 tokens max: 996 tokens	0: ~1.19% 5: ~0.60% 7: ~0.60% 8: ~0.60% 9: ~0.60% 10: ~0.60% 12: ~0.60% 13: ~0.60% 15: ~0.60% 16: ~0.60% 17: ~0.60% 19: ~0.60% 20: ~0.60% 21: ~0.60% 25: ~0.60% 26: ~0.60% 27: ~0.60% 28: ~0.60% 29: ~0.89% 30: ~0.89% 31: ~2.08% 33: ~1.49% 34: ~0.60% 36: ~0.60% 37: ~0.89% 38: ~0.60% 42: ~0.89% 43: ~1.19% 45: ~0.60% 47: ~0.60% 48: ~0.60% 49: ~0.60% 50: ~0.60% 51: ~0.60% 52: ~1.19% 53: ~0.60% 55: ~0.60% 57: ~1.19% 59: ~0.60% 60: ~0.60% 61: ~0.60% 62: ~0.60% 63: ~0.60% 64: ~0.89% 65: ~0.60% 67: ~0.60% 68: ~0.89% 69: ~0.60% 70: ~0.60% 71: ~0.60% 72: ~0.60% 73: ~0.60% 74: ~0.60% 75: ~0.60% 77: ~0.60% 78: ~0.60% 82: ~0.89% 84: ~0.60% 85: ~0.60% 86: ~0.60% 87: ~0.60% 88: ~0.60% 89: ~0.60% 90: ~0.60% 91: ~0.60% 92: ~0.89% 93: ~0.60% 95: ~0.60% 96: ~0.60% 97: ~0.60% 98: ~0.60% 99: ~0.60% 100: ~0.60% 101: ~0.60% 103: ~0.60% 105: ~0.60% 108: ~0.60% 109: ~0.60% 110: ~0.89% 113: ~0.60% 115: ~0.60% 116: ~0.60% 117: ~0.60% 118: ~0.60% 119: ~0.60% 120: ~1.49% 121: ~0.60% 126: ~0.60% 127: ~0.60% 130: ~0.60% 131: ~0.60% 134: ~0.60% 135: ~0.60% 136: ~0.60% 137: ~0.60% 138: ~0.60% 139: ~0.60% 140: ~0.60% 143: ~0.60% 144: ~0.60% 147: ~0.60% 148: ~0.60% 149: ~0.60% 150: ~0.60% 152: ~0.60% 153: ~0.60% 154: ~0.60% 155: ~0.60% 157: ~0.60% 158: ~0.60% 159: ~0.60% 160: ~0.60% 161: ~0.60% 163: ~0.60% 164: ~0.60% 165: ~0.60% 168: ~0.60% 170: ~0.60% 171: ~0.89% 174: ~0.89% 175: ~0.60% 176: ~0.60% 177: ~0.60% 178: ~0.60% 179: ~0.60% 180: ~0.60% 181: ~0.60% 182: ~0.60% 183: ~0.60% 185: ~0.60% 186: ~0.60% 187: ~0.60% 192: ~0.60% 193: ~0.60% 194: ~0.60% 195: ~0.89% 196: ~0.60% 197: ~0.60% 198: ~0.60% 199: ~0.60% 202: ~0.60% 204: ~0.60% 205: ~0.89% 207: ~0.60% 208: ~0.60% 209: ~0.60% 210: ~0.60% 213: ~0.60% 214: ~0.60% 215: ~0.60% 216: ~0.60% 218: ~0.60%

Samples:

texts	label
`Branch list is sometimes out of order`
Type: Bug

1. Open a workspace
2. Quickly open the branch picker and type `main`

Bug
The first time you do this, sometimes you end up with an unordered list:




The correct order shows up when you keep start typing or try doing this again:








VS Code version: Code - Insiders 1.91.0-insider (Universal) (0354163c1c66b950b0762364f5b4cd37937b624a, 2024-06-26T10:12:33.304Z)
OS version: Darwin arm64 23.5.0
Modes:


System Info

\|Item\|Value\|
\|---\|---\|
\|CPUs\|Apple M2 Max (12 x 2400)\|
\|GPU Status\|2d_canvas: unavailable_software canvas_oop_rasterization: disabled_off direct_rendering_display_compositor: disabled_off_ok gpu_compositing: disabled_software multiple_raster_threads: enabled_on ope...	`218`
`Git Branch Picker Race Condition If I paste the branch too quickly and then press enter, it does not switch to it, but creates a new branch.`
This breaks muscle memory, as it works when you do it slowly.




Once loading completes, it should select the branch again.	`218`
`links aren't discoverable to screen reader users in markdown documents They're only discoverable via visual distinction and the action that can be taken (IE opening them) is only indicated in the tooltip AFAICT.`

https://github.com/microsoft/vscode/assets/29464607/09d28b81-c2cc-4477-b1fc-7b1de1baae74

	`177`

Loss: BatchSemiHardTripletLoss

Evaluation Dataset

Unnamed Dataset

Size: 70 evaluation samples
Columns: texts and label

Approximate statistics based on the first 70 samples:

	texts	label
type	string	int
details	min: 58 tokens mean: 303.57 tokens max: 864 tokens	1: ~2.86% 2: ~2.86% 6: ~2.86% 11: ~5.71% 14: ~2.86% 23: ~2.86% 32: ~5.71% 35: ~2.86% 39: ~2.86% 40: ~2.86% 46: ~2.86% 54: ~2.86% 83: ~2.86% 102: ~2.86% 104: ~4.29% 111: ~2.86% 122: ~2.86% 123: ~2.86% 125: ~2.86% 145: ~2.86% 146: ~2.86% 162: ~2.86% 166: ~2.86% 169: ~2.86% 184: ~2.86% 188: ~2.86% 190: ~2.86% 200: ~2.86% 201: ~4.29% 203: ~2.86% 206: ~2.86% 217: ~2.86%

Samples:

texts	label
`Ctrl+I stopped working after first hold+talk+release Testing #213355`

Screencast shows that it seems to be in the wrong context and is trying to stop the session?



Repro was just asking "Testing testing" and then trying to ask something else	`217`
`Ctrl + I does not work when chat input field has focus Testing #213355`

`Ctrl + I` works in the editor and when I hold it, I get into speech mode. But when the chat input field (panel or inline chat) already has focus, `Ctrl + I` does not work.

(Connected to Windows through Remote Desktop in case that matters.)
	`217`
`Terminal renaming not functioning as expected in editor area`






Does this issue occur when all extensions are disabled?: Yes

Downloads last month: 2

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for aaa961/finetuned-bge-m3-base-en

Base model

BAAI/bge-m3

Finetuned

(348)

this model

Evaluation results

Cosine Accuracy on bge base en train
self-reported

1.000
Cosine Accuracy on bge base en train
self-reported

0.952

View on Papers With Code