Vir2vec: A Genome-Wide Viral Embedding
Model description
vir2vec is a viral genomic language model (gLM) designed to produce fixed-length, genome-level embeddings that can be fine-tuned across downstream tasks such as viral discrimination, host-range prediction, and variant typing. For more details and training scripts check GitHub
Intended use
vir2vec embeddings are intended for tasks including (but not limited to):
- Virus vs non-virus genome/read discrimination
- DNA vs RNA virus classification
- Host-range prediction
- Intra-genus separation (e.g., HIV-1 vs HIV-2)
- Variant/subtype typing (e.g., SARS-CoV-2 lineages)
- Phenotypic signal detection (e.g., tissue tropism proxies)
Model sizes
- 422M
- 138M
- 17M
How to use
Load from Hugging Face
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model = AutoModelForCausalLM.from_pretrained("pabloarozarenad/Vir2vec", trust_remote_code=True) # Add revision=138M or revision=17M to change model size. 422M is default.
model.eval()
Compute embeddings
dna = "ACGTAGCATCGCGATGACTGCATCACT"
inputs = tokenizer(dna, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1] # [1, seq_len, hidden_dim]
embedding = last_hidden.max(dim=1).values[0] # [hidden_dim] (max pooling)
print(embedding.shape)
Access
vir2vec can be loaded upon request, subject to providing an institutional email address, a brief description of the intended use, and the associated IRB protocol number.
- Downloads last month
- 4
Model tree for pabloarozarenad/Vir2vec
Base model
RaphaelMourad/Mistral-DNA-v1-422M-hg38