The Biggest AI for Biology Yet Writes Genomes From Scratch

The Biggest AI for Biology Yet Writes Genomes From Scratch

On-demand DNA for every branch of life.

Mother nature is perhaps the most powerful generative “intelligence.” With just four genetic letters—A, T, C, and G—she has crafted the dazzling variety of life on Earth.

Can generative AI expand on her work?

A new algorithm, Evo 2, trained on roughly 128,000 genomes—9.3 trillion DNA letter pairs—spanning all of life’s domains, is now the largest generative AI model for biology to date. Built by scientists at the Arc Institute, Stanford University, and Nvidia, Evo 2 can write whole chromosomes and small genomes from scratch.

It also learned how DNA mutations affect proteins, RNA, and overall health, shining light on “non-coding” regions, in particular. These mysterious sections of DNA don’t make proteins but often control gene activity and are linked to diseases.

The team has released Evo 2’s software code and model parameters to the scientific community for further exploration. Researchers can also access the tool through a user-friendly web interface. With Evo 2 as a foundation, scientists may develop more specific AI models. These could predict how mutations affect a protein’s function, how genes operate differently across cell types, or even help researchers design new genomes for synthetic biology.

Evo marks “a key moment in the emerging field of generative biology” because machines can now read, write, and “think” in the language of DNA, said study author Patrick Hsu in an Arc Institute blog.

Upping the Game

Evo 2 builds on an earlier model introduced last year. Both are large language models, or LLMs, like the algorithms behind popular chatbots. The original Evo was trained on roughly three million genomes from a range of microbes and bacteria-infecting viruses.

Evo 2 expanded this to include genes from humans, plants, yeast, and other organisms made of more complex cells. These are all known as eukaryotes. Eukaryotic genomes are far more intricate than bacterial ones. Some DNA snippets, for example, have specific functions, such as turning a gene on or off. Others allow a single gene to churn out multiple versions of a protein.

“These features underpin the emergence of multicellularity, sophisticated traits, and intelligent behaviors that are unique to eukaryotic life,” wrote the team in a pre-print paper on bioRxiv.

Though critical for the emergence of complex life, these control mechanisms are a headache for generative AI. Regulatory elements can be far apart from their associated genes, making it difficult to hunt them down. They’re usually hidden in regions of the genome that don’t make proteins but are still crucial to gene expression or the maintenance of chromosomes.

The team explicitly included these regions in Evo 2’s training. They curated a dataset of DNA sequences from 128,000 genomes encompassing all branches on the tree of life. Together, the dataset, OpenGenome2, contains 9.3 trillion DNA letters.

They created two versions of Evo 2: a smaller version trained on 2.4 trillion letters and a full version trained on the entire database. Both algorithms were designed to quickly churn through mountains of data, like for example, longer lengths of DNA. This allows Evo 2 to broaden its “search window” and find patterns across a larger genetic landscape, which is crucial for eukaryotic cells with far longer DNA sequences than bacteria. Compared to its predecessor, Evo 2 trained on 30 times more data and can crunch 8 times as many DNA letters at a time. The whole training process took several months on over 2,000 Nvidia H100 GPUs.

Genetic Sleuth

Once completed, Evo 2 beat state-of-the-art models at predicting the effects of mutations in BRCA1, a gene linked to breast cancer. It especially outshined its competitors when including both protein-coding and non-coding genetic letter changes. The AI separated benign mutations from potentially harmful ones with over 90 percent accuracy.

Using AI to screen for cancer isn’t new. But older methods often made diagnoses using medical images. Evo 2 used DNA sequences alone. With further validation, the tool could one day help scientists find the genetic causes of diseases—especially those hidden in non-coding regions.

It could also aid new treatments that target specific tissues, according to study author Hani Goodarzi. “If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells” to minimize side effects.

Potential medical uses aside, Evo 2 learned a variety of complex genetic traits across multiple species. For example, the tool fished out patterns in the human genome that could also be used to annotate that of a woolly mammoth. Our genome is different than that of the extinct beast, but Evo 2 found a shared genetic vocabulary and grammar that transcended the divide.

“Evo 2 represents a significant step in learning DNA regulatory grammar,” Christina Theodoris at the Gladstone Institutes told Nature.

Genome Architect

Scientists used the original Evo to design a variety of new CRISPR gene-editing tools and a full-length bacterial genome from scratch. Although the latter contained genes essential for survival, the AI also “hallucinated” unnatural sequences preventing it from being functional.

Evo 2 fared better. The team first challenged the model to create a full set of human mitochondrial DNA. With only 13 protein-coding genes and a handful of RNA types, these genomes are relatively small, but the resulting proteins and RNA do intricate work together.

The AI generated 250 unique mitochondrial DNA genomes, each containing roughly 16,000 letters. Using a protein prediction tool, AlphaFold 3, the team found these sequences yielded proteins similar to those found naturally in mitochondria. The team also used Evo 2 to create a minimal bacterial genome with just 580,000 DNA letters and a 330,000-letter-long yeast chromosome. And they added a Morse code message to a mouse’s genome.

To be clear, these generated DNA blueprints have yet to be tested inside living cells, but experiments are in the works.

Evo 2 is a step towards designing complex genomes. Combined with other AI tools in biology, it inches us closer to programming entirely new forms of synthetic life, wrote the authors.

The post The Biggest AI for Biology Yet Writes Genomes From Scratch appeared first on SingularityHub.



* This article was originally published at Singularity Hub

Post a Comment

0 Comments