Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis

Single-cell RNA sequencing (scRNA-seq) allows us to measure the gene expression of individual cells, but the data is massive and hard to interpret. To overcome this, researchers have developed Cell2Sentence-Scale (C2S-Scale), a family of large language models that can "read" and "write" biological data at the single-cell level. C2S-Scale transforms each cell's gene expression profile into a sequence of text, called a "cell sentence", making it possible to apply natural language models to scRNA-seq data. This makes single-cell data more accessible, interpretable, and flexible. The C2S-Scale model family is trained on over 1 billion tokens from real-world transcriptomic datasets, biological metadata, and scientific literature. The models can respond to diverse input queries for both prediction and generation tasks, enabling conversational single-cell analysis. C2S-Scale can answer questions about single-cell data, generate biological summaries of scRNA-seq data, and predict how a cell will respond to a perturbation. The performance of C2S-Scale improves predictably as model size increases, following clear scaling laws. The ability to simulate cellular behavior in silico accelerates drug discovery, personalized medicine, and prioritizing experiments. The Cell2Sentence models and resources are now available on platforms such as HuggingFace and GitHub, allowing researchers to explore and experiment with their own single-cell data.

research.google

Image for the article: Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis

RSS Hunter

2025-04-16