AI-enabled protein design is a technology area you’ll want to keep an eye on in the year ahead, according to Nature. With massive datasets to train on and ever more sophisticated deep learning approaches, technologies like our own RFdiffusion All-Atom are opening the door to custom enzymes, advanced biomaterials, and more.
The article, Seven technologies to watch in 2024 | Nature (excerpted below), was published today by Michael Eisenstein with an illustration by The Project Twins.
Deep learning for protein design
Two decades ago, David Baker at the University of Washington in Seattle and his colleagues achieved a landmark feat: they used computational tools to design an entirely new protein from scratch. ‘Top7’ folded as predicted, but it was inert: it performed no meaningful biological functions. Today, de novo protein design has matured into a practical tool for generating made-to-order enzymes and other proteins. “It’s hugely empowering,” says Neil King, a biochemist at the University of Washington who collaborates with Baker’s team to design protein-based vaccines and vehicles for drug delivery. “Things that were impossible a year and a half ago — now you just do it.”
Much of that progress comes down to increasingly massive data sets that link protein sequence to structure. But sophisticated methods of deep learning, a form of artificial intelligence (AI), have also been essential.
‘Sequence based’ strategies use the large language models (LLMs) that power tools such as the chatbot ChatGPT (see ‘ChatGPT? Maybe next year’). By treating protein sequences like documents comprising polypeptide ‘words’, these algorithms can discern the patterns that underlie the architectural playbook of real-world proteins. “They really learn the hidden grammar,” says Noelia Ferruz, a protein biochemist at the Molecular Biology Institute of Barcelona, Spain. In 2022, her team developed an algorithm called ProtGPT2 that consistently comes up with synthetic proteins that fold stably when produced in the laboratory . Another tool co-developed by Ferruz, called ZymCTRL, draws on sequence and functional data to design members of naturally occurring enzyme families .
Sequence-based approaches can build on and adapt existing protein features to form new frameworks, but they’re less effective for the bespoke design of structural elements or features, such as the ability to bind specific targets in a predictable fashion. ‘Structure based’ approaches are better for this, and 2023 saw notable progress in this type of protein-design algorithm, too. Some of the most sophisticated of these use ‘diffusion’ models, which also underlie image-generating tools such as DALL-E. These algorithms are initially trained to remove computer-generated noise from large numbers of real structures; by learning to discriminate realistic structural elements from noise, they gain the ability to form biologically plausible, user-defined structures.
RFdiffusion software  developed by Baker’s lab and the Chroma tool by Generate Biomedicines in Somerville, Massachusetts , exploit this strategy to remarkable effect. For example, Baker’s team is using RFdiffusion to engineer novel proteins that can form snug interfaces with targets of interest, yielding designs that “just conform perfectly to the surface,” Baker says. A newer ‘all atom’ iteration of RFdiffusion  allows designers to computationally shape proteins around non-protein targets such as DNA, small molecules and even metal ions. The resulting versatility opens new horizons for engineered enzymes, transcriptional regulators, functional biomaterials and more.