Geneformer-10M (TransformerEngine-Optimized) Overview

Description:

Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

This version of the Geneformer model is optimized with NVIDIA's TransformerEngine library. It is based on the original Geneformer V1 model, and (within numerical precision) has identical weights and outputs.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Model Card Geneformer Model Card.

License/Terms of Use:

Geneformer is licensed under the Apache 2.0 license.

Deployment Geography:

Global

Use Case:

Network biology and therapeutic discovery, particularly in data-limited settings such as rare diseases or diseases affecting hard-to-access tissues.

Release Date:

Hugging Face 12/19/2025 via https://huggingface.co/nvidia/geneformer_V1_10M

Reference(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: BERT

This model was developed based on: Geneformer
Number of model parameters: 1 x 10^7

Input:

Input Type: Number (Row represents cell, containing gene names and single cell expression counts)
Input Format: Array AnnData
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: This model supports a context length of 2048.

Output:

Output Type: Dense Embedding Predictions
Output Format: Vector
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Numeric floating point vector (fp16, bf16, or fp32); Geneformer-10M outputs 256 dimensional embeddings.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Transformer Engine
  • PyTorch

Supported Hardware Microarchitecture Compatibility:

  • A100
  • H100
  • H200
  • GB200

Preferred/Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

  • Geneformer-V1-10M
  • Geneformer-V2-104M
  • Geneformer-V2-316M
  • Geneformer-V2-104M_CLcancer

Training and Evaluation Datasets:

Training Datasets:

Link: Genecorpus-30M

Data Modality:

  • Text (Human single-cell transcriptomes)

Text Training Data Size:

  • 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • N/A

Properties: The single-cell transcriptomes were assembled from a broad range of publicly available data sources. The researchers collected raw counts from sources like NCBI Gene Expression Omnibus (GEO), Human Cell Atlas, and Tumor Immune Single-cell Hub (TISCH), among others. They excluded cells with high mutational burdens, such as malignant cells and immortalized cell lines, and included only droplet-based sequencing platforms to ensure data comparability. The raw data was then converted into a uniform loom HDF5 file format.

Evaluation Datasets:

Link: A cross-disorder dosage sensitivity map of the human genome

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Not Applicable

Properties: The data was collected by harmonizing and meta-analyzing rare copy-number variants (rCNVs) from nearly one million individuals across 54 different disorders. This approach created a genome-wide catalog of dosage sensitivity.

Link: Single-cell Transcriptome Analysis Reveals Dynamic Cell Populations and Differential Gene Expression Patterns in Control and Aneurysmal Human Aortic Tissue

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Human

Properties: The data was collected by performing single-cell RNA sequencing (scRNA-seq) on human ascending aortic tissues. Tissues were obtained from 11 study participants, consisting of 8 patients with ascending thoracic aortic aneurysm (ATAA) and 3 control subjects.

Link: Systematic Comparison of High-throughput Single-Cell and Single-Nucleus Transcriptomes during Cardiomyocyte Differentiation

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Human

Properties: The researchers used two different sequencing platforms to collect data from the same biological process: induced pluripotent stem cell (iPSC) differentiation into cardiomyocytes. The two platforms used were Drop-seq (single-cell) and DroNc-seq (single-nucleus). The study involved two iPSC lines and collected data over a 15-day time period.

Link: A human cell atlas of fetal gene expression

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Hybrid: Human, Automated

Properties: The data was collected by profiling the gene expression of millions of single cells from 15 different human fetal organs.

Link: Single-nuclei profiling of human dilated and hypertrophic cardiomyopathy

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Hybrid: Human, Automated

Properties: The data was collected by performing single-nucleus RNA sequencing (snRNA-seq) on left ventricle samples from human hearts. The study included samples from 11 hearts with dilated cardiomyopathy, 15 hearts with hypertrophic cardiomyopathy, and 16 non-failing hearts. In total, nearly 600,000 nuclei were sequenced.

Inference:

Acceleration Engine: Transformer Engine, PyTorch

Test Hardware:

  • A100
  • H100
  • H200
  • GB200

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/geneformer_V1_10M

Collection including nvidia/geneformer_V1_10M