scGPT (no prior) — Replogle K562/Jurkat/HepG2 → RPE1

Produced as part of the sc-interp single-cell model comparison repo.

Provenance

Source code commit: 1d58906
Runner: scripts/run_scgpt.py
Dataset manifest: data/manifests/replogle.yaml

Base model

scGPT whole-human pretrained (Cui et al. 2024), used as-is with the model's original learnable gene-token embeddings (no external prior). 12 transformer blocks, 8 heads, d_model=512, max_seq_len=1536. This run is the baseline counterpart to matthewshu/scgpt-replogle-esm-ft, which adds a frozen ESM2-15B per-gene prior; both runs use identical training data, splits, optimizer, and budget except for the prior.

Training

Source dataset: arcinstitute/State-Replogle-Filtered — CRISPRi essential-genome screens from Replogle et al. 2022 and Nadig et al. 2025. Training: 362,327 cells from K562 + Jurkat + HepG2 with 1,383 perturbations and 8,569 val pairs (held-out K562 perturbations). Evaluation: 109,207 RPE1 cells perturbed by the 1,047 genes overlapping the K562 training perturbation set, plus 10,691 real RPE1 controls.

Fine-tuned the scGPT whole-human pretrained checkpoint on this split with no additional gene prior. Used --stop-metric pearson_delta (per-perturbation Pearson on Δ-expression) for early-stopping and best-checkpoint selection — this metric directly measures perturbation-effect prediction quality, whereas full-expression pearson is dominated by the unchanged-genes baseline. Training ran the full 30-epoch budget without early-stopping; best checkpoint is from epoch 27.

Budget and stopping


Hardware	NVIDIA H100 PCIe (80 GB)
Train batch size	192
Eval batch size	192
Max epochs	30
Early-stop patience	10
Stop metric	`pearson_delta`
Epochs trained	30
Best epoch	27
Best val `pearson_delta`	0.1993
Training cells seen	5,400,630
Wall clock	393.6 min (~6.56 h)
Stop reason	max_epochs
AMP	fp16
Optimizer	Adam, lr=1e-4, StepLR γ=0.9

Test set metrics (cell-eval)

metric	mean	median	max
pearson_delta	0.1825	0.1424	0.6170
pr_auc	0.5206	0.5159	0.9191
roc_auc	0.3626	0.3603	0.4858
overlap_at_N	0.5081	0.4978	0.9252
de_sig_genes_recall	0.5313	0.5135	0.9527
de_direction_match	0.5252	0.5336	0.7896
discrimination_score_l1	0.5091	0.5091	1.0000
mae_delta	0.1763	0.1737	0.2336

Compare with matthewshu/scgpt-replogle-esm-ft (same data, same budget, with ESM2-15B prior injected at the gene-embedding layer): the +ESM run reaches pearson_delta = 0.508 vs 0.183 here on test, and improves de_direction_match, de_sig_genes_recall, and overlap_at_N by 5–11 absolute percentage points. The two runs share commit, runner code, dataset manifest, split, optimizer, and batch size.

Known limitations

Cell line distribution shift: trained on K562/Jurkat/HepG2, evaluated on RPE1.
Test set restricted to the 1,047 perturbed genes overlapping K562 training perturbations — does not test out-of-distribution perturbed genes.
roc_auc < 0.5 on test (also seen in the +ESM counterpart) — same eval pipeline in both, so a cell-eval/data convention quirk rather than a model defect.

Files

best_model.pt — fine-tuned scGPT weights (PyTorch state_dict, best val pearson_delta)
args.json — scGPT pretrained args (whole-human checkpoint config)
vocab.json — scGPT gene-token vocabulary
training_stats.json — wall clock, wandb run url, epoch count, best metrics, stop reason
eval/agg_results.csv — cell-eval describe() table over 1,047 RPE1 test perturbations
eval/results.csv — per-perturbation cell-eval metrics (1,047 rows × 28 metric columns)
predictions/scgpt_replogle_test.h5ad — self-contained predictions h5ad: predicted expression in .X, ground truth in .layers['truth'], includes 10,691 real RPE1 control cells. Layout produced by scripts/run_scgpt.py:save_predictions.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

matthewshu
/

scgpt-replogle-base-ft