Shreshth Gandhi Claude commited on
Commit
341544f
·
1 Parent(s): 402d670

Enhance model card with comprehensive documentation and visuals

Browse files

- Add prominent quickstart section with ComposerTX.from_hf() usage
- Include abstract logo images with dark/light mode support
- Add detailed sections: tutorials, embeddings, training, benchmarks
- Expand model details with architecture and training info
- Include both Tahoe-x1 and Tahoe-100M citations
- Add contact information and improve overall structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

README.md CHANGED
@@ -10,61 +10,179 @@ tags:
10
  - tahoebio
11
  - pytorch
12
  ---
13
- # Tahoe-x1
14
 
15
- Tahoe-x1 is a family of perturbation-trained single-cell foundation models developed by Tahoe Therapeutics.
16
- For more details, see our blog post and [📄 preprint](http://www.tahoebio.ai/news/tahoe-x1).
17
- In this repository, we provide pretrained weights for three model sizes:
18
 
19
- - ~70M parameters (TahoeX1-70M)
20
- - ~1B parameters (TahoeX1-1B)
21
- - ~3B parameters (TahoeX1-3B)
22
- ![Abstract Logo](https://huggingface.co/tahoebio/tahoe-x1/resolve/main/assets/abstract_logo_light_mode.png)
23
 
24
- ## Installation
25
 
26
- To use the model, you must first install the `tahoe-x1` package from GitHub.
 
 
 
 
 
 
27
 
28
- 1. Clone the repository:
29
- ```bash
30
- git clone https://github.com/tahoebio/tahoe-x1
31
- cd tahoe-x1
32
- ```
33
 
34
- 2. Follow the installation steps for the docker or uv based insallation as described [here](https://github.com/tahoebio/tahoe-x1#installation) in the repository
35
 
36
- Additional files (including vocabulary files and data for the included benchmarks) are hosted on S3 in a publicly accesible bucket (s3://tahoe-hackathon-data/MFM). These files will be automatically downloaded as needed.
 
 
37
 
38
  ## Quickstart
39
- You can quickly load the model in this way:
 
40
 
41
  ```python
42
  from tahoex.model import ComposerTX
 
43
 
44
- model, vocab, model_config, collator_config = ComposerTX.from_hf(
45
- repo_id="tahoebio/tahoe-x1",
46
- model_size="70m", # or "1b", "3b".
 
 
47
  return_gene_embeddings=True, # optional, default True
48
  use_chem_inf=False # optional, default False
49
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Citation
54
 
55
  If you use Tahoe-x1 in your research, please cite:
56
 
57
  ```bibtex
58
- @misc{shreshth2025tahoe,
59
- title={Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion parameters},
60
- author={Shreshth Gandhi, Farnoosh Javadi, Valentine Svensson, Umair Khan, Matthew G Jones, and others},
61
- year={2025},
62
- publisher={Hugging Face},
63
- howpublished={\url{https://huggingface.co/tahoebio/tahoe-x1}}
 
64
  }
65
  ```
66
 
67
 
68
  ## License
69
- We release the model weights and associated code under the Apache 2.0 license.
 
 
 
 
 
 
 
70
 
 
10
  - tahoebio
11
  - pytorch
12
  ---
 
13
 
14
+ # Tahoe-x1
 
 
15
 
16
+ Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the _Tahoe-100M_ perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3–30× higher compute efficiency than prior implementations.
 
 
 
17
 
18
+ **Paper**: [Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters](http://www.tahoebio.ai/news/tahoe-x1) | **GitHub**: [tahoebio/tahoe-x1](https://github.com/tahoebio/tahoe-x1)
19
 
20
+ <p align="center">
21
+ <picture>
22
+ <source media="(prefers-color-scheme: dark)" srcset="./assets/abstract_logo_dark_mode.png">
23
+ <source media="(prefers-color-scheme: light)" srcset="./assets/abstract_logo_light_mode.png">
24
+ <img src="./assets/abstract_logo_light_mode.png" alt="Tahoe-x1 Abstract" width="600">
25
+ </picture>
26
+ </p>
27
 
28
+ ## Model Sizes
 
 
 
 
29
 
30
+ We provide pretrained weights for three model sizes:
31
 
32
+ - **Tx1-70M**: ~70M parameters, 1024 context length
33
+ - **Tx1-1B**: ~1.3B parameters, 2048 context length
34
+ - **Tx1-3B**: ~3B parameters, 2048 context length
35
 
36
  ## Quickstart
37
+
38
+ Load a model directly from Hugging Face and generate cell embeddings:
39
 
40
  ```python
41
  from tahoex.model import ComposerTX
42
+ import scanpy as sc
43
 
44
+ # Load model from Hugging Face in a single line
45
+ # Options: "70m", "1b", or "3b"
46
+ model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
47
+ repo_id="tahoebio/Tahoe-x1",
48
+ model_size="3b",
49
  return_gene_embeddings=True, # optional, default True
50
  use_chem_inf=False # optional, default False
51
  )
52
+
53
+ # Load your single-cell data
54
+ adata = sc.read_h5ad("your_data.h5ad")
55
+
56
+ # Generate embeddings (see tutorials for full example)
57
+ # Cell embeddings are stored in adata.obsm
58
+ ```
59
+
60
+ ### Installation
61
+
62
+ To use the models, install the `tahoex` package from GitHub:
63
+
64
+ ```bash
65
+ # Clone the repository
66
+ git clone https://github.com/tahoebio/tahoe-x1
67
+ cd tahoe-x1
68
+
69
+ # Install using Docker (recommended) or uv
70
+ # See installation guide: https://github.com/tahoebio/tahoe-x1#installation
71
+ ```
72
+
73
+ **Docker installation** provides better reproducibility and is recommended for the best experience. For native installation, use `uv` or `pip` for dependency management.
74
+
75
+ Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using `ComposerTX.from_hf()`. Training data is hosted publicly on S3 (s3://tahoe-hackathon-data/MFM) and will be downloaded as needed.
76
+
77
+ ## Tutorials
78
+
79
+ Please refer to the tutorials in the [GitHub repository](https://github.com/tahoebio/tahoe-x1/tree/main/tutorials) for detailed examples:
80
+
81
+ | Tutorial | Description | Link |
82
+ |----------|-------------|------|
83
+ | **Clustering Tutorial** | Cell clustering and UMAP visualization with Tahoe-x1 embeddings | [clustering_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/clustering_tutorial.ipynb) |
84
+ | **Training Tutorial** | Step-by-step guide to training and fine-tuning Tahoe-x1 models | [training_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/training_tutorial.ipynb) |
85
+
86
+ ## Generating Cell and Gene Embeddings
87
+
88
+ ### Using Configuration Files
89
+
90
+ 1. Create a configuration file (see `scripts/inference/configs/predict.yaml` in the [GitHub repo](https://github.com/tahoebio/tahoe-x1)):
91
+ ```yaml
92
+ # Key configuration options:
93
+ # - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
94
+ # - paths.hf_model_size: model size (70m, 1b, or 3b)
95
+ # - paths.adata_output: where to save AnnData output including embeddings
96
+ # - predict.return_gene_embeddings: True (for extracting gene embeddings)
97
+ ```
98
+
99
+ 2. Run the embedding script:
100
+ ```bash
101
+ python scripts/inference/predict_embeddings.py path/to/config.yaml
102
+
103
+ # Optional: override config values via command line
104
+ python scripts/inference/predict_embeddings.py path/to/config.yaml \
105
+ --paths.model_name=tx --batch_size=128
106
+ ```
107
+
108
+ ### Advanced Usage
109
+
110
+ For memory-efficient gene embedding extraction, use the lower-level API:
111
+
112
+ ```python
113
+ from tahoex.tasks import get_batch_embeddings
114
+
115
+ cell_embs, gene_embs = get_batch_embeddings(
116
+ adata=adata,
117
+ model=model,
118
+ vocab=vocab,
119
+ model_cfg=model_cfg,
120
+ collator_cfg=collator_cfg,
121
+ return_gene_embeddings=True
122
+ )
123
+ ```
124
+
125
+ Cell embeddings are saved to `adata.obsm` and gene embeddings to `adata.varm` (if `return_gene_embeddings=True`).
126
+
127
+ ## Training and Fine-tuning
128
+
129
+ Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the [GitHub repository](https://github.com/tahoebio/tahoe-x1) for detailed instructions.
130
+
131
+ ### Quick Training Example
132
+
133
+ ```bash
134
+ # Train from a configuration file
135
+ composer scripts/train.py -f configs/test_run.yaml
136
+
137
+ # Fine-tune from a pretrained checkpoint
138
+ composer scripts/train.py -f configs/finetune_config.yaml \
139
+ --load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/
140
  ```
141
 
142
+ For more details on training infrastructure, datasets, and benchmarks, please visit the [GitHub repository](https://github.com/tahoebio/tahoe-x1).
143
+
144
+ ## Model Details
145
+
146
+ ### Architecture
147
+ - **Base Architecture**: Transformer-based encoder pretrained with masked language modeling (MLM)
148
+ - **Training Data**: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
149
+ - **Context Length**: 1024 (70M) or 2048 (1B, 3B) tokens
150
+ - **Training Infrastructure**: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision
151
+
152
+ ### Benchmarks
153
+
154
+ Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:
155
+
156
+ - **DepMap Essentiality**: Predicting gene dependencies in cancer cell lines
157
+ - **MSigDB Hallmarks**: Recovering pathway memberships from gene embeddings
158
+ - **Cell-Type Classification**: Classifying cell types across multiple tissues
159
+ - **Perturbation Prediction**: Predicting transcriptional responses to perturbations
160
+
161
+ See the [paper](http://www.tahoebio.ai/news/tahoe-x1) for detailed benchmark results.
162
 
163
  ## Citation
164
 
165
  If you use Tahoe-x1 in your research, please cite:
166
 
167
  ```bibtex
168
+ @article{gandhi2025-tx1,
169
+ author = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
170
+ title = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
171
+ type = {Preprint},
172
+ year = {2025},
173
+ note = {Preprint},
174
+ url = {www.tahoebio.ai/news/tahoe-x1}
175
  }
176
  ```
177
 
178
 
179
  ## License
180
+
181
+ Model weights and code are released under the Apache 2.0 license.
182
+
183
+ ## Contact
184
+
185
+ For questions or collaboration inquiries, please:
186
+ - Open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1)
187
+ - Email us at [[email protected]](mailto:[email protected])
188
 
assets/abstract_logo_dark_mode.png ADDED
assets/tahoe-white-logo.png ADDED