Tahoe-x1 / README.md
Shreshth Gandhi
Fix import path for `ComposerTX` in README.md example
d218a58
---
license: apache-2.0
datasets:
- tahoebio/Tahoe-100M
tags:
- biology
- single-cell
- RNA
- chemistry
- tahoebio
- pytorch
---
# Tahoe-x1
Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the _Tahoe-100M_ perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3โ€“30ร— higher compute efficiency than other cell-state models.
**Quick Links:**
- โœจ [**Blog Post**](https://www.tahoebio.ai/news/tahoe-x1-blog) - Read our announcement
- ๐Ÿ“„ [**Preprint**](https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1) - Read our preprint on bioRxiv
- ๐Ÿ’ป [**GitHub Repository**](https://github.com/tahoebio/tahoe-x1) - Access the code
- ๐ŸŽฎ [**Interactive Demo**](https://huggingface.co/spaces/tahoebio/tx1-demo) - Try the model with no code required!
## ๐Ÿค– Model Sizes
We provide pretrained weights for three model sizes:
- **Tx1-70M**: ~70M parameters
- **Tx1-1B**: ~1.3B parameters
- **Tx1-3B**: ~3B parameters
## ๐Ÿš€ Quickstart
Load a model directly from Hugging Face and generate cell embeddings:
```python
from tahoe_x1.model import ComposerTX
import scanpy as sc
# Load model from Hugging Face in a single line
# Options: "70m", "1b", or "3b"
model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
repo_id="tahoebio/Tahoe-x1",
model_size="3b"
)
# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")
# Generate embeddings (see tutorials for full example)
# Cell embeddings are stored in adata.obsm
```
### ๐Ÿ“ฆ Installation
To use the models, install the `tahoex` package from GitHub:
```bash
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1
cd tahoe-x1
# Install using Docker (recommended) or uv
# See installation guide: https://github.com/tahoebio/tahoe-x1#installation
```
**Docker installation** provides better reproducibility and is recommended for the best experience. For native installation, use `uv` or `pip` for dependency management.
Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using `ComposerTX.from_hf()`. Training data is hosted publicly on S3 (s3://tahoe-hackathon-data) and will be downloaded as needed.
## ๐Ÿ“š Tutorials
Please refer to the tutorials in the [GitHub repository](https://github.com/tahoebio/tahoe-x1/tree/main/tutorials) for detailed examples:
| Tutorial | Description | Link |
|----------|-------------|------|
| **Clustering Tutorial** | Cell clustering and UMAP visualization with Tahoe-x1 embeddings | [clustering_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/clustering_tutorial.ipynb) |
| **Training Tutorial** | Step-by-step guide to training and fine-tuning Tahoe-x1 models | [training_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/training_tutorial.ipynb) |
## ๐Ÿงฌ Generating Cell and Gene Embeddings
### Using Configuration Files
1. Create a configuration file (see `scripts/inference/configs/predict.yaml` in the [GitHub repo](https://github.com/tahoebio/tahoe-x1)):
```yaml
# Key configuration options:
# - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
# - paths.hf_model_size: model size (70m, 1b, or 3b)
# - paths.adata_output: where to save AnnData output including embeddings
# - predict.return_gene_embeddings: True (for extracting gene embeddings)
```
See the Clustering Tutorial for a full example of preparing input data and configuration files.
2. Run the embedding script:
```bash
python scripts/inference/predict_embeddings.py path/to/config.yaml
# Optional: override config values via command line
python scripts/inference/predict_embeddings.py path/to/config.yaml \
--paths.model_name=tx --batch_size=128
```
### Advanced Usage
For memory-efficient gene embedding extraction over large datasets, use the lower-level API:
```python
from tahoex.tasks import get_batch_embeddings
cell_embs, gene_embs = get_batch_embeddings(
adata=adata,
model=model,
vocab=vocab,
model_cfg=model_cfg,
collator_cfg=collator_cfg,
return_gene_embeddings=True
)
```
Cell embeddings are saved to `adata.obsm` and gene embeddings to `adata.varm` (if `return_gene_embeddings=True`).
## ๐Ÿ‹๏ธ Training and Fine-tuning
Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the [GitHub repository](https://github.com/tahoebio/tahoe-x1) for detailed instructions.
### Quick Training Example
```bash
# Use the test configuration to train a small model on Tahoe-100M
composer scripts/train.py -f configs/test_config.yaml
# Fine-tune from a pretrained checkpoint
composer scripts/train.py -f configs/finetune_config.yaml \
--load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/
```
For more details on training infrastructure, datasets, and benchmarks, please visit the [GitHub repository](https://github.com/tahoebio/tahoe-x1).
## ๐Ÿ“Š Model Details
### Architecture
- **Base Architecture**: Transformer-based encoder pretrained with masked language modeling (MLM)
- **Training Data**: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
- **Training Infrastructure**: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision
### Benchmarks
Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:
- **DepMap Essentiality**: Predicting gene dependencies in cancer cell lines
- **MSigDB Hallmarks**: Recovering pathway memberships from gene embeddings
- **Cell-Type Classification**: Classifying cell types across multiple tissues
- **Perturbation Prediction**: Predicting transcriptional responses to perturbations
See the [paper](http://www.tahoebio.ai/news/tahoe-x1) for detailed benchmark results.
## ๐Ÿ“„ Citation
If you use Tahoe-x1 in your research, please cite:
```bibtex
@article{gandhi2025tahoe,
title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.23.683759},
url = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
publisher = {Cold Spring Harbor Laboratory}
}
```
## ๐Ÿ“œ License
Model weights and code are released under the Apache 2.0 license.
## ๐Ÿ“ง Contact
For questions or collaboration inquiries, please:
- Open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or [HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1)
- Email us at [[email protected]](mailto:[email protected])