| | --- |
| | tags: |
| | - ColBERT |
| | - PyLate |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | - generated_from_trainer |
| | - loss:Distillation |
| | base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext |
| | pipeline_tag: sentence-similarity |
| | library_name: PyLate |
| | language: en |
| | license: apache-2.0 |
| | --- |
| | |
| | # BiomedBERT ColBERT |
| |
|
| | This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
| |
|
| | ## Usage (txtai) |
| |
|
| | This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG). |
| |
|
| | ```python |
| | import txtai |
| | |
| | embeddings = txtai.Embeddings( |
| | path="neuml/biomedbert-base-colbert", |
| | content=True |
| | ) |
| | embeddings.index(documents()) |
| | |
| | # Run a query |
| | embeddings.search("query to run") |
| | ``` |
| |
|
| | Late interaction models excel as reranker pipelines. |
| |
|
| | ```python |
| | from txtai.pipeline import Reranker, Similarity |
| | |
| | similarity = Similarity(path="neuml/biomedbert-base-colbert", lateencode=True) |
| | ranker = Reranker(embeddings, similarity) |
| | ranker("query to run") |
| | ``` |
| |
|
| | ## Usage (PyLate) |
| |
|
| | Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate). |
| |
|
| | ```python |
| | from pylate import rank, models |
| | |
| | queries = [ |
| | "query A", |
| | "query B", |
| | ] |
| | |
| | documents = [ |
| | ["document A", "document B"], |
| | ["document 1", "document C", "document B"], |
| | ] |
| | |
| | documents_ids = [ |
| | [1, 2], |
| | [1, 3, 2], |
| | ] |
| | |
| | model = models.ColBERT( |
| | model_name_or_path="neuml/biomedbert-base-colbert", |
| | ) |
| | |
| | queries_embeddings = model.encode( |
| | queries, |
| | is_query=True, |
| | ) |
| | |
| | documents_embeddings = model.encode( |
| | documents, |
| | is_query=False, |
| | ) |
| | |
| | reranked_documents = rank.rerank( |
| | documents_ids=documents_ids, |
| | queries_embeddings=queries_embeddings, |
| | documents_embeddings=documents_embeddings, |
| | ) |
| | ``` |
| |
|
| | ## Evaluation Results |
| |
|
| | Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison. |
| |
|
| | The following datasets were used to evaluate model performance. |
| |
|
| | - [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA) |
| | - Subset: pqa_labeled, Split: train, Pair: (question, long_answer) |
| | - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k) |
| | - Split: test, Pair: (title, text) |
| | - [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers) |
| | - Subset: pubmed, Split: validation, Pair: (article, abstract) |
| |
|
| | Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric. |
| |
|
| | | Model | PubMed QA | PubMed Subset | PubMed Summary | Average | |
| | | ----------------------------------------------------- | --------- | ------------- | -------------- | --------- | |
| | | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 | |
| | | [bioclinical-modernbert-base-embeddings](https://hf.co/neuml/bioclinical-modernbert-base-embeddings) | 92.49 | 97.10 | 97.04 | 95.54 | |
| | | [**biomedbert-base-colbert**](https://hf.co/neuml/biomedbert-base-colbert) | **94.59** | **97.18** | **96.21** | **95.99**| |
| | | [biomedbert-base-reranker](https://hf.co/neuml/biomedbert-base-reranker) | 97.66 | 99.76 | 98.81 | 98.74 | |
| | | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 | |
| | | [pubmedbert-base-embeddings-8M](https://hf.co/neuml/pubmedbert-base-embeddings-8M) | 90.05 | 94.29 | 94.15 | 92.83 | |
| |
|
| | This is the best performing model we've released that's not a cross-encoder. With [MUVERA encoding](https://arxiv.org/abs/2405.19504), this model can be used to index large datasets for semantic search. It can also be used as a faster re-ranker vs. a cross-encoder model. |
| |
|
| | ## Full Model Architecture |
| |
|
| | ``` |
| | ColBERT( |
| | (0): Transformer({'max_seq_length': 511, 'do_lower_case': False, 'architecture': 'BertModel'}) |
| | (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity', 'use_residual': False}) |
| | ) |
| | ``` |