Enhance model card with comprehensive documentation and visuals

- Add prominent quickstart section with ComposerTX.from_hf() usage
- Include abstract logo images with dark/light mode support
- Add detailed sections: tutorials, embeddings, training, benchmarks
- Expand model details with architecture and training info
- Include both Tahoe-x1 and Tahoe-100M citations
- Add contact information and improve overall structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (3) hide show

README.md +146 -28
assets/abstract_logo_dark_mode.png +0 -0
assets/tahoe-white-logo.png +0 -0

README.md CHANGED Viewed

@@ -10,61 +10,179 @@ tags:
 - tahoebio
 - pytorch
 ---
-# Tahoe-x1
-Tahoe-x1 is a family of perturbation-trained single-cell foundation models  developed by Tahoe Therapeutics.
-For more details, see our blog post and [📄 preprint](http://www.tahoebio.ai/news/tahoe-x1).
-In this repository, we provide pretrained weights for three model sizes:
-- ~70M parameters (TahoeX1-70M)
-- ~1B parameters (TahoeX1-1B)
-- ~3B parameters (TahoeX1-3B)
-![Abstract Logo](https://huggingface.co/tahoebio/tahoe-x1/resolve/main/assets/abstract_logo_light_mode.png)
-## Installation
-To use the model, you must first install the `tahoe-x1` package from GitHub.
-1. Clone the repository:
-   ```bash
-   git clone https://github.com/tahoebio/tahoe-x1
-   cd tahoe-x1
-   ```
-2. Follow the installation steps for the docker or uv based insallation as described [here](https://github.com/tahoebio/tahoe-x1#installation) in the repository
-Additional files (including vocabulary files and data for the included benchmarks) are hosted on S3 in a publicly accesible bucket (s3://tahoe-hackathon-data/MFM). These files will be automatically downloaded as needed.
 ## Quickstart
-You can quickly load the model in this way:
 ```python
 from tahoex.model import ComposerTX
-model, vocab, model_config, collator_config = ComposerTX.from_hf(
-    repo_id="tahoebio/tahoe-x1",
-    model_size="70m",  # or "1b", "3b".
     return_gene_embeddings=True,  # optional, default True
     use_chem_inf=False  # optional, default False
 )
 ```
 ## Citation
 If you use Tahoe-x1 in your research, please cite:
 ```bibtex
-@misc{shreshth2025tahoe,
-  title={Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion parameters},
-  author={Shreshth Gandhi, Farnoosh Javadi, Valentine Svensson, Umair Khan, Matthew G Jones, and others},
-  year={2025},
-  publisher={Hugging Face},
-  howpublished={\url{https://huggingface.co/tahoebio/tahoe-x1}}
 }
 ```
 ## License
-We release the model weights and associated code under the Apache 2.0 license.

 - tahoebio
 - pytorch
 ---
+# Tahoe-x1
+Tahoe-x1 is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics. Pretrained on 266 million single-cell transcriptomic profiles including the _Tahoe-100M_ perturbation compendium, Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks with 3–30× higher compute efficiency than prior implementations.
+**Paper**: [Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters](http://www.tahoebio.ai/news/tahoe-x1) | **GitHub**: [tahoebio/tahoe-x1](https://github.com/tahoebio/tahoe-x1)
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="./assets/abstract_logo_dark_mode.png">
+    <source media="(prefers-color-scheme: light)" srcset="./assets/abstract_logo_light_mode.png">
+    <img src="./assets/abstract_logo_light_mode.png" alt="Tahoe-x1 Abstract" width="600">
+  </picture>
+</p>
+## Model Sizes
+We provide pretrained weights for three model sizes:
+- **Tx1-70M**: ~70M parameters, 1024 context length
+- **Tx1-1B**: ~1.3B parameters, 2048 context length
+- **Tx1-3B**: ~3B parameters, 2048 context length
 ## Quickstart
+Load a model directly from Hugging Face and generate cell embeddings:
 ```python
 from tahoex.model import ComposerTX
+import scanpy as sc
+# Load model from Hugging Face in a single line
+# Options: "70m", "1b", or "3b"
+model, vocab, model_cfg, collator_cfg = ComposerTX.from_hf(
+    repo_id="tahoebio/Tahoe-x1",
+    model_size="3b",
     return_gene_embeddings=True,  # optional, default True
     use_chem_inf=False  # optional, default False
 )
+# Load your single-cell data
+adata = sc.read_h5ad("your_data.h5ad")
+# Generate embeddings (see tutorials for full example)
+# Cell embeddings are stored in adata.obsm
+```
+### Installation
+To use the models, install the `tahoex` package from GitHub:
+```bash
+# Clone the repository
+git clone https://github.com/tahoebio/tahoe-x1
+cd tahoe-x1
+# Install using Docker (recommended) or uv
+# See installation guide: https://github.com/tahoebio/tahoe-x1#installation
+```
+**Docker installation** provides better reproducibility and is recommended for the best experience. For native installation, use `uv` or `pip` for dependency management.
+Model checkpoints and configs are automatically downloaded from this Hugging Face repository when using `ComposerTX.from_hf()`. Training data is hosted publicly on S3 (s3://tahoe-hackathon-data/MFM) and will be downloaded as needed.
+## Tutorials
+Please refer to the tutorials in the [GitHub repository](https://github.com/tahoebio/tahoe-x1/tree/main/tutorials) for detailed examples:
+| Tutorial | Description | Link |
+|----------|-------------|------|
+| **Clustering Tutorial** | Cell clustering and UMAP visualization with Tahoe-x1 embeddings | [clustering_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/clustering_tutorial.ipynb) |
+| **Training Tutorial** | Step-by-step guide to training and fine-tuning Tahoe-x1 models | [training_tutorial.ipynb](https://github.com/tahoebio/tahoe-x1/blob/main/tutorials/training_tutorial.ipynb) |
+## Generating Cell and Gene Embeddings
+### Using Configuration Files
+1. Create a configuration file (see `scripts/inference/configs/predict.yaml` in the [GitHub repo](https://github.com/tahoebio/tahoe-x1)):
+```yaml
+# Key configuration options:
+#   - paths.hf_repo_id: Hugging Face repository (tahoebio/Tahoe-x1)
+#   - paths.hf_model_size: model size (70m, 1b, or 3b)
+#   - paths.adata_output: where to save AnnData output including embeddings
+#   - predict.return_gene_embeddings: True (for extracting gene embeddings)
+```
+2. Run the embedding script:
+```bash
+python scripts/inference/predict_embeddings.py path/to/config.yaml
+# Optional: override config values via command line
+python scripts/inference/predict_embeddings.py path/to/config.yaml \
+    --paths.model_name=tx --batch_size=128
+```
+### Advanced Usage
+For memory-efficient gene embedding extraction, use the lower-level API:
+```python
+from tahoex.tasks import get_batch_embeddings
+cell_embs, gene_embs = get_batch_embeddings(
+    adata=adata,
+    model=model,
+    vocab=vocab,
+    model_cfg=model_cfg,
+    collator_cfg=collator_cfg,
+    return_gene_embeddings=True
+)
+```
+Cell embeddings are saved to `adata.obsm` and gene embeddings to `adata.varm` (if `return_gene_embeddings=True`).
+## Training and Fine-tuning
+Tahoe-x1 models can be trained from scratch or fine-tuned on your own data. See the [GitHub repository](https://github.com/tahoebio/tahoe-x1) for detailed instructions.
+### Quick Training Example
+```bash
+# Train from a configuration file
+composer scripts/train.py -f configs/test_run.yaml
+# Fine-tune from a pretrained checkpoint
+composer scripts/train.py -f configs/finetune_config.yaml \
+    --load_path s3://tahoe-hackathon-data/MFM/ckpts/3b/
 ```
+For more details on training infrastructure, datasets, and benchmarks, please visit the [GitHub repository](https://github.com/tahoebio/tahoe-x1).
+## Model Details
+### Architecture
+- **Base Architecture**: Transformer-based encoder pretrained with masked language modeling (MLM)
+- **Training Data**: 266M single-cell profiles from CellxGene, scBaseCamp, and Tahoe-100M
+- **Context Length**: 1024 (70M) or 2048 (1B, 3B) tokens
+- **Training Infrastructure**: Built on MosaicML Composer with FlashAttention, FSDP, and mixed precision
+### Benchmarks
+Tahoe-x1 achieves state-of-the-art performance on cancer-relevant tasks:
+- **DepMap Essentiality**: Predicting gene dependencies in cancer cell lines
+- **MSigDB Hallmarks**: Recovering pathway memberships from gene embeddings
+- **Cell-Type Classification**: Classifying cell types across multiple tissues
+- **Perturbation Prediction**: Predicting transcriptional responses to perturbations
+See the [paper](http://www.tahoebio.ai/news/tahoe-x1) for detailed benchmark results.
 ## Citation
 If you use Tahoe-x1 in your research, please cite:
 ```bibtex
+@article{gandhi2025-tx1,
+  author       = {Gandhi, Shreshth and Javadi, Farnoosh and Svensson, Valentine and Khan, Umair and Jones, Matthew G. and Yu, Johnny and Merico, Daniele and Goodarzi, Hani and Alidoust, Nima},
+  title        = {Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters},
+  type         = {Preprint},
+  year         = {2025},
+  note         = {Preprint},
+  url          = {www.tahoebio.ai/news/tahoe-x1}
 }
 ```
 ## License
+Model weights and code are released under the Apache 2.0 license.
+## Contact
+For questions or collaboration inquiries, please:
+- Open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1)
+- Email us at [[email protected]](mailto:[email protected])

assets/abstract_logo_dark_mode.png ADDED Viewed

assets/tahoe-white-logo.png ADDED Viewed