I've been planning to write this post for some time, and today's news provides the perfect opportunity: the Nobel Committee has just awarded the Nobel Prize in Chemistry to John Jumper and Demis Hassabis, the creators of AlphaFold at Google DeepMind.
AlphaFold and the quest to determine protein structures have been close to my heart since I began my journey into biology with DeepTrait, where we develop a product for analyzing genetic data. Life on Earth is built from proteins. Proteins are synthesized from RNA, which in turn is generated from DNA. DeepTrait product identifies which genetic mechanisms or their variations affect specific phenotypes. Naturally, understanding how these changes influence protein structures was extremely interesting and potentially valuable for our users.
Despite this interest, we decided not to incorporate AlphaFold into our product. Here's why.
AlphaFold
(This text intentionally simplifies biological processes—they are important for understanding the issue, but only at a basic level.)
All the characteristics of an organism—including talents and flaws, diseases, their development, and their effects—result from protein interactions. In the 19th century, Dr. Paul Ehrlich proposed that chemical compounds could precisely influence physiological processes in the human body. This idea, which he called the "magic bullet," is the foundation of modern pharmacology. Nearly all drugs we know today interact with specific proteins—therapeutic targets—either enhancing or inhibiting their function.
According to various estimates, our bodies contain over two million proteins. We know the functions of only a small fraction of them. How can we find out which proteins we need to "target" to cure a specific disease?
Proteins are generated from our DNA code. Molecular machinery reads specific regions of DNA that code for proteins—genes—and transfers this code in the form of RNA molecules to ribosomes. The ribosomes then generate a chain of amino acids, which folds into the required shape to become a protein.
Proteins differ from inorganic molecules primarily in size. While a typical inorganic molecule may consist of tens of atoms, proteins can contain over a million. This high level of structural complexity allows proteins to perform complex functions compared to inorganic molecules. The function of proteins is largely determined by their shape.
Today, sequencing human DNA costs about 300 US dollars, and the price continues to decrease. We have vast amounts of genetic information, and we know the amino acid sequences of virtually every protein in our bodies. But how can we determine their shapes?
AlphaFold offers a brilliant solution. DeepMind found a way to represent the shape of a protein in a form that a neural network could predict. They trained a machine learning model to predict elements of a protein's shape from its amino acid sequence.
The first version of AlphaFold was published in 2020, and DeepMind along with Google were not too modest. In their blog, DeepMind called AlphaFold "the solution to a 50-year-old grand challenge." John Moult from the University of Maryland said that "in some sense, the protein folding problem is solved." And Andrei Lupas, an evolutionary biologist at the Max Planck Institute for Developmental Biology in Tübingen, Germany, stated: "This will change medicine. It will change research. It will change bioengineering. It will change everything."
In 2021, Google and DeepMind used AlphaFold to predict and publicly release the structures of 365,000 human proteins and those of 20 other model organisms.
The Problem
The main issue with AlphaFold, which is extremely difficult to fix, lies in the limitations of the data itself.
Any machine learning model requires data for training. In the case of protein structures, this data is very expensive.
Historically, protein shapes were determined using X-ray crystallography. Proteins were first crystallized, and then X-rays were passed through the crystal. By analyzing the diffraction patterns, scientists could determine the structures of the proteins—the components of the crystal.
Crystallography was an exceedingly complex and costly method. Venki Ramakrishnan, a Nobel Prize laureate in Chemistry for discovering the structure of the ribosome, detailed this in his book Gene Machine. Anything in the experiment could go wrong, negating all the work: the protein might not be pure enough, crystallization could fail, and X-ray crystallographs were extremely expensive and rare machines, accessible only to a small group of scientists and requiring meticulous advance planning. Any delay in using the machine meant waiting many months for the next opportunity.
Today, we have cryo-electron microscopy—a more accurate but even more expensive method for determining protein structures.
Determining a protein's structure has been and remains an extremely lengthy and expensive process. At the cost of tremendous effort, humanity has managed to determine the structures of approximately 170,000 proteins—the data on which DeepMind trained AlphaFold.
By their nature, machine learning models excel at approximating values within the range of their training data (interpolation) but perform poorly outside that range (extrapolation). Therefore, to train a high-quality model, we need data that are independent and identically distributed over the sample on which the model will operate. In other words, if you want to train a model to identify types of vehicles from photographs, your training set needs to include sedans, trucks, bicycles, and sports cars—ideally, in proportions reflecting their actual presence on real roads.
Determining the structure of a single protein costs between $50,000 and $250,000 and can take from six months to five years or more. With such costs, which proteins become priorities for biological efforts and budgets?
Naturally, we prioritize determining the structures of proteins associated with specific diseases. Proteins interact in sequences—pathways—and in studying a disease, we focus on proteins that interact in the formation or progression of that disease. We do not determine the structure of "every twentieth protein out of all known proteins, sorted randomly." That would provide an ideal dataset for training a machine learning model, but we don't have it. Instead, we have a dataset with thoroughly studied proteins from clusters related to certain diseases, and virtually nothing about the vast majority of other organism characteristics (phenotypes).
Could it happen that, purely by chance, data from this very unbalanced dataset would be sufficient to determine the structure of any other protein in any organism? The probability of that is close to zero.
Did the authors of AlphaFold know this during the planning stage of their experiment? Of course they did; their expertise suggests they couldn't have overlooked such an important point.
Did they actually manage to develop a model capable of predicting the structures of unknown proteins? No, they did not.
Today, numerous publications demonstrate that AlphaFold's predictions can serve at most as hypotheses for determining protein structures and cannot replace experimental validation. AlphaFold can more or less accurately predict the structures of proteins from pathways represented in the training set but cannot make accurate predictions for proteins from pathways about which we know nothing:
More importantly, AlphaFold is unable to predict the effect of single-nucleotide polymorphisms (SNPs) on protein structure. SNPs are the most common type of mutations that alter protein structures and lead to genetic diseases:
Is it possible to fix these shortcomings algorithmically? No, it's not. The source of these issues is the nature of the training data, and without collecting an additional dataset of random proteins from random pathways, it's impossible to correct them.
Did this stop Google's marketing machine from promoting AlphaFold as a universal and functional solution? Not at all. Moreover, the story of AlphaFold shows that when return on investment requires convincing the general public of a technology's omnipotence, small scientific groups genuinely interested in seeking the truth have little chance of being heard.
And this is the main problem with modern AI.
Conclusions
This post in no way aims to diminish the achievements of the AlphaFold authors—it is a highly original model with a brilliant representation of protein structures, accessible for computation and machine learning.
I have attempted to portray the story of AlphaFold in context and to draw attention to the work of other scientists who lack access to the marketing machinery of large corporations but whose voices and results should not be forgotten.
AlphaFold is just one example, but a symptomatic one. This is how the information bubble works, and since we—AI engineers, entrepreneurs, and enthusiasts—have found ourselves within such a bubble, we should be mindful that in bubbles, things are rarely as they seem at first glance.