The model organizes genes into a structured biological coordinate system rather than opaque features
Different transformer layers encode various biological information in hierarchical manner
The findings have implications for regulatory network inference, drug target prioritization, and model auditing
๐ Full Retelling
Researcher Ihor Kendiukhov published a groundbreaking study on February 24, 2026, through the arXiv online repository, systematically decoding the geometric structure of scGPT internal representations to understand what biological knowledge single-cell foundation models encode. The study, titled 'Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations,' employed 63 iterations of automated hypothesis screening (testing 183 hypotheses) to reveal that scGPT organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis was found to separate genes by subcellular localization, with secreted proteins at one extreme and cytosolic proteins at the other. Intermediate transformer layers temporarily encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. The research discovered that orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes with significant accuracy (0.744, all 12 layers significant). The study found that early layers preserve which specific genes regulate which targets, while deeper layers compress this information into a coarser regulator versus regulated distinction. Repression edges were found to be geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 showed convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes clustered with high fidelity (0.851), and residual-stream geometry was found to encode biological structure complementary to attention patterns.
Spectral geometry is a field in mathematics which concerns relationships between geometric structures of domains and manifolds and spectra of canonically defined differential operators. The case of the LaplaceโBeltrami operator on a closed Riemannian manifold has been most intensively studied, altho...
Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are widely observed in biology, for example to trigger developmen...
The cells of eukaryotic organisms are elaborately subdivided into functionally-distinct membrane-bound compartments. Some major constituents of eukaryotic cells are: extracellular space, plasma membrane, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, vacuo...
--> Quantitative Biology > Genomics arXiv:2602.22247 (q-bio) [Submitted on 24 Feb 2026] Title: Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations Authors: Ihor Kendiukhov View a PDF of the paper titled Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations, by Ihor Kendiukhov View PDF HTML Abstract: Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000 across n = 5 STRING confidence quintiles, p = 0.017). In a compact six-dimensional spectral subspace, the model distinguishes transcription factors from their target genes 0.744, all 12 layers significant). Early layers preserve which specific genes regulate which targets, while deeper layers compress this into a coarser regulator versus regulated distinction. Repression edges are geometrically more prominent than activation edges, and B-cell master regulators BATF and BACH2 show convergence toward the B-cell identity anchor PAX5 across transformer depth. Cell-type marker genes cluster with high fidelity 0.851). Residual-stream geometry encodes biological structure complementary to attention patte...