You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Vir2vec: A Genome-Wide Viral Embedding

Model description

vir2vec is a viral genomic language model (gLM) designed to produce fixed-length, genome-level embeddings that can be fine-tuned across downstream tasks such as viral discrimination, host-range prediction, and variant typing. For more details and training scripts check GitHub

Intended use

vir2vec embeddings are intended for tasks including (but not limited to):

Virus vs non-virus genome/read discrimination
DNA vs RNA virus classification
Host-range prediction
Intra-genus separation (e.g., HIV-1 vs HIV-2)
Variant/subtype typing (e.g., SARS-CoV-2 lineages)
Phenotypic signal detection (e.g., tissue tropism proxies)

Model sizes

422M
138M
17M

How to use

Load from Hugging Face

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model = AutoModelForCausalLM.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model.eval()

Compute embeddings

dna = "ACGTAGCATCGCGATGACTGCATCACT"
inputs = tokenizer(dna, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # [1, seq_len, hidden_dim]
embedding = last_hidden.max(dim=1).values[0] # [hidden_dim] (max pooling)

print(embedding.shape)

Access

vir2vec can be loaded upon request, subject to providing an institutional email address, a brief description of the intended use, and the associated IRB protocol number.

Downloads last month: 4

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for pabloarozarenad/Vir2vec

Base model

RaphaelMourad/Mistral-DNA-v1-422M-hg38

Finetuned

(1)

this model