--- license: apache-2.0 datasets: - tahoebio/Tahoe-100M tags: - biology - single-cell - RNA - chemistry - tahoebio - pytorch --- # Tahoe-x1 Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the _Tahoe-100M_ perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3–30× higher compute efficiency than other cell-state models. **Quick Links:** - ✨ [**Blog Post**](https://www.tahoebio.ai/news/tahoe-x1-blog) - Read our announcement - 📄 [**Preprint**](https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1) - Read our preprint on bioRxiv - 💻 [**GitHub Repository**](https://github.com/tahoebio/tahoe-x1) - Access the code - 🎮 [**Interactive Demo**](https://huggingface.co/spaces/tahoebio/tx1-demo) - Try the model with no code required! ## 🤖 Model Sizes We provide pretrained weights for three model sizes: - **Tx1-70M**: ~70M parameters - **Tx1-1B**: ~1.3B parameters - **Tx1-3B**: ~3B parameters ## 🚀 Quickstart Load a model directly from Hugging Face and generate cell embeddings: ```python from tahoe_x1.model import ComposerTX import scanpy as sc # Load model from Hugging Face in a single line # Options: "70m", "1b", or "3b" model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf( repo_id="tahoebio/Tahoe-x1", model_size="3b" ) # Load your single-cell data adata = sc.read_h5ad("your_data.h5ad") # Generate embeddings (see tutorials for full example) # Cell embeddings are stored in adata.obsm ``` ### 📦 Installation To use the models, install the `tahoex` package from GitHub: ```bash # Clone the repository git clone https://github.com/tahoebio/tahoe-x1 cd tahoe-x1 # Install using Docker (recommended) or uv # See installation guide: https://github.com/tahoebio/tahoe-x1#installation ``` **Docker installation** provides better reproducibility and is recommended for the best experience. For native installation, use `uv` or `pip` for dependency management. Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using `ComposerTX.from_hf()`. Training data is hosted publicly on S3 (s3://tahoe-hackathon-data) and will be downloaded as needed. ## 📚 Tutorials Please refer to the tutorials in the [GitHub repository](https://github.com/tahoebio/tahoe-x1/tree/main/tutorials) for detailed examples: | Tutorial | Description | Link | |----------|-------------|------| | **Clustering Tutorial** | Cell clustering and UMAP visualization with Tahoe-x1 embeddings | [clustering_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/clustering_tutorial.ipynb) | | **Training Tutorial** | Step-by-step guide to training and fine-tuning Tahoe-x1 models | [training_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/training_tutorial.ipynb) | ## 🧬 Generating Cell and Gene Embeddings ### Using Configuration Files 1. Create a configuration file (see `scripts/inference/configs/predict.yaml` in the [GitHub repo](https://github.com/tahoebio/tahoe-x1)): ```yaml # Key configuration options: # - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1) # - paths.hf_model_size: model size (70m, 1b, or 3b) # - paths.adata_output: where to save AnnData output including embeddings # - predict.return_gene_embeddings: True (for extracting gene embeddings) ``` See the Clustering Tutorial for a full example of preparing input data and configuration files. 2. Run the embedding script: ```bash python scripts/inference/predict_embeddings.py path/to/config.yaml # Optional: override config values via command line python scripts/inference/predict_embeddings.py path/to/config.yaml \ --paths.model_name=tx --batch_size=128 ``` ### Advanced Usage For memory-efficient gene embedding extraction over large datasets, use the lower-level API: ```python from tahoex.tasks import get_batch_embeddings cell_embs, gene_embs = get_batch_embeddings( adata=adata, model=model, vocab=vocab, model_cfg=model_cfg, collator_cfg=collator_cfg, return_gene_embeddings=True ) ``` Cell embeddings are saved to `adata.obsm` and gene embeddings to `adata.varm` (if `return_gene_embeddings=True`). ## 🏋️ Training and Fine-tuning Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the [GitHub repository](https://github.com/tahoebio/tahoe-x1) for detailed instructions. ### Quick Training Example ```bash # Use the test configuration to train a small model on Tahoe-100M composer scripts/train.py -f configs/test_config.yaml # Fine-tune from a pretrained checkpoint composer scripts/train.py -f configs/finetune_config.yaml \ --load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/ ``` For more details on training infrastructure, datasets, and benchmarks, please visit the [GitHub repository](https://github.com/tahoebio/tahoe-x1). ## 📊 Model Details ### Architecture - **Base Architecture**: Transformer-based encoder pretrained with masked language modeling (MLM) - **Training Data**: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M - **Training Infrastructure**: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision ### Benchmarks Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks: - **DepMap Essentiality**: Predicting gene dependencies in cancer cell lines - **MSigDB Hallmarks**: Recovering pathway memberships from gene embeddings - **Cell-Type Classification**: Classifying cell types across multiple tissues - **Perturbation Prediction**: Predicting transcriptional responses to perturbations See the [paper](http://www.tahoebio.ai/news/tahoe-x1) for detailed benchmark results. ## 📄 Citation If you use Tahoe-x1 in your research, please cite: ```bibtex @article{gandhi2025tahoe, title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters}, author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima}, journal = {bioRxiv}, year = {2025}, doi = {10.1101/2025.10.23.683759}, url = {https://www.biorxiv.org/content/10.1101/2025.10.23.683759v1}, publisher = {Cold Spring Harbor Laboratory} } ``` ## 📜 License Model weights and code are released under the Apache 2.0 license. ## 📧 Contact For questions or collaboration inquiries, please: - Open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or [HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1) - Email us at [admin@tahoebio.ai](mailto:admin@tahoebio.ai)