Tahoe-x1 / README.md

Shreshth Gandhi

Fix import path for `ComposerTX` in README.md example

d218a58 11 days ago

6.86 kB

	---
	license: apache-2.0
	datasets:
	- tahoebio/Tahoe-100M
	tags:
	- biology
	- single-cell
	- RNA
	- chemistry
	- tahoebio
	- pytorch
	---

	# Tahoe-x1

	Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the _Tahoe-100M_ perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3–30× higher compute efficiency than other cell-state models.

	Quick Links:
	- ✨ [Blog Post](https://www.tahoebio.ai/news/tahoe-x1-blog) - Read our announcement
	- 📄 [Preprint](https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1) - Read our preprint on bioRxiv
	- 💻 [GitHub Repository](https://github.com/tahoebio/tahoe-x1) - Access the code
	- 🎮 [Interactive Demo](https://huggingface.co/spaces/tahoebio/tx1-demo) - Try the model with no code required!

	## 🤖 Model Sizes

	We provide pretrained weights for three model sizes:

	- Tx1-70M: ~70M parameters
	- Tx1-1B: ~1.3B parameters
	- Tx1-3B: ~3B parameters

	## 🚀 Quickstart

	Load a model directly from Hugging Face and generate cell embeddings:

	```python
	from tahoe_x1.model import ComposerTX
	import scanpy as sc

	# Load model from Hugging Face in a single line
	# Options: "70m", "1b", or "3b"
	model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
	repo_id="tahoebio/Tahoe-x1",
	model_size="3b"
	)

	# Load your single-cell data
	adata = sc.read_h5ad("your_data.h5ad")

	# Generate embeddings (see tutorials for full example)
	# Cell embeddings are stored in adata.obsm
	```

	### 📦 Installation

	To use the models, install the `tahoex` package from GitHub:

	```bash
	# Clone the repository
	git clone https://github.com/tahoebio/tahoe-x1
	cd tahoe-x1

	# Install using Docker (recommended) or uv
	# See installation guide: https://github.com/tahoebio/tahoe-x1#installation
	```

	Docker installation provides better reproducibility and is recommended for the best experience. For native installation, use `uv` or `pip` for dependency management.

	Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using `ComposerTX.from_hf()`. Training data is hosted publicly on S3 (s3://tahoe-hackathon-data) and will be downloaded as needed.

	## 📚 Tutorials

	Please refer to the tutorials in the [GitHub repository](https://github.com/tahoebio/tahoe-x1/tree/main/tutorials) for detailed examples:

	\| Tutorial \| Description \| Link \|
	\|----------\|-------------\|------\|
	\| Clustering Tutorial \| Cell clustering and UMAP visualization with Tahoe-x1 embeddings \| [clustering_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/clustering_tutorial.ipynb) \|
	\| Training Tutorial \| Step-by-step guide to training and fine-tuning Tahoe-x1 models \| [training_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/training_tutorial.ipynb) \|

	## 🧬 Generating Cell and Gene Embeddings

	### Using Configuration Files

	1. Create a configuration file (see `scripts/inference/configs/predict.yaml` in the [GitHub repo](https://github.com/tahoebio/tahoe-x1)):
	```yaml
	# Key configuration options:
	# - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
	# - paths.hf_model_size: model size (70m, 1b, or 3b)
	# - paths.adata_output: where to save AnnData output including embeddings
	# - predict.return_gene_embeddings: True (for extracting gene embeddings)
	```
	See the Clustering Tutorial for a full example of preparing input data and configuration files.
	2. Run the embedding script:
	```bash
	python scripts/inference/predict_embeddings.py path/to/config.yaml

	# Optional: override config values via command line
	python scripts/inference/predict_embeddings.py path/to/config.yaml \
	--paths.model_name=tx --batch_size=128
	```

	### Advanced Usage

	For memory-efficient gene embedding extraction over large datasets, use the lower-level API:

	```python
	from tahoex.tasks import get_batch_embeddings

	cell_embs, gene_embs = get_batch_embeddings(
	adata=adata,
	model=model,
	vocab=vocab,
	model_cfg=model_cfg,
	collator_cfg=collator_cfg,
	return_gene_embeddings=True
	)
	```

	Cell embeddings are saved to `adata.obsm` and gene embeddings to `adata.varm` (if `return_gene_embeddings=True`).

	## 🏋️ Training and Fine-tuning

	Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the [GitHub repository](https://github.com/tahoebio/tahoe-x1) for detailed instructions.

	### Quick Training Example

	```bash
	# Use the test configuration to train a small model on Tahoe-100M
	composer scripts/train.py -f configs/test_config.yaml

	# Fine-tune from a pretrained checkpoint
	composer scripts/train.py -f configs/finetune_config.yaml \
	--load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/
	```

	For more details on training infrastructure, datasets, and benchmarks, please visit the [GitHub repository](https://github.com/tahoebio/tahoe-x1).

	## 📊 Model Details

	### Architecture
	- Base Architecture: Transformer-based encoder pretrained with masked language modeling (MLM)
	- Training Data: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
	- Training Infrastructure: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision

	### Benchmarks

	Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:

	- DepMap Essentiality: Predicting gene dependencies in cancer cell lines
	- MSigDB Hallmarks: Recovering pathway memberships from gene embeddings
	- Cell-Type Classification: Classifying cell types across multiple tissues
	- Perturbation Prediction: Predicting transcriptional responses to perturbations

	See the [paper](http://www.tahoebio.ai/news/tahoe-x1) for detailed benchmark results.

	## 📄 Citation

	If you use Tahoe-x1 in your research, please cite:

	```bibtex
	@article{gandhi2025tahoe,
	title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
	author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
	journal = {bioRxiv},
	year = {2025},
	doi = {10.1101/2025.10.23.683759},
	url = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1},
	publisher = {Cold Spring Harbor Laboratory}
	}
	```

	## 📜 License

	Model weights and code are released under the Apache 2.0 license.

	## 📧 Contact

	For questions or collaboration inquiries, please:
	- Open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or [HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1)
	- Email us at [[email protected]](mailto:[email protected])