When it came to the substantial number that were unknown, the team conducted one more study, using the best understood (at the genetic level) organism of all: Drosophila melanogaster. These fruit flies have been the subject of research for more than a century because they are easy and inexpensive to breed, have a short life cycle, produce lots of young, and can be genetically modified in numerous ways.
The team used gene editing to dial down the use of around 300 low-scoring genes found in both humans and fruit flies. “We found that one-quarter of these unknown genes were lethal—when knocked out, they caused the flies to die, and yet nobody had ever known anything about them,” says Freeman. “Another 25 percent of them caused changes in the flies—phenotypes—that we could detect in many ways.” These genes were linked with fertility, development, locomotion, protein quality control, and resilience to stress. “That so many fundamental genes are not understood was eye-opening,” Freeman says. It’s possible that variation in these genes could have very big impacts on human health.
All of this “unknomics” information is held on a database, which the team is making available for other researchers to use to discover new biology. The next step may be to hand the data on these mystery genes and the mystery proteins they create over to AI.
DeepMind’s AlphaFold, for example, can provide important insights into what mystery proteins do, notably by revealing how they interact with other proteins, says Alex Bateman of the European Bioinformatics Institute, based near Cambridge, UK. So can cryo-EM, which is a way of producing images of large, complex molecules, he says. And a University College London team has shown a systematic way to use machine learning to figure out what proteins do in yeast.
The Unknome is unusual in that it’s a biology database that will shrink as we understand it better. The paper shows that over the past decade “we have moved from 40 percent to 20 percent of the human proteome having a certain level of unknownness,” says Bateman. However, at current progress rates, working out the function of all human protein-coding genes could take more than half a century, Freeman estimates.
The discovery that so many genes remain misunderstood reflects what is called the streetlight effect, or the drunkard’s search principle, an observational bias that occurs when people only search for something where it is easiest to look. In this case, it has caused what Freeman and Munro call a “bias in biological research toward the previously studied.”
The same goes for researchers, who tend to get funding for research in relatively well-understood areas, rather than going off into what Freeman calls the wilderness. This is why the database is so important, Munro explains—it fights back against the economics of academia, which avoids things that are very poorly understood. “There is a need for a different type of support to address these unknowns,” says Munro.
But even with the database becoming available and researchers picking through it, there will still be some knowledge blind spots. The study focused on genes that are responsible for proteins. Over the past two decades, uncharted areas of the genome have also been found to harbor the code for small RNAs—scraps of genetic material that can affect other genes, and which are critical regulators of normal development and bodily functions. There may be more “unknown unknowns” lurking in the human genome.
For now, there’s still plenty to get into, and Freeman hopes this work will encourage others to study the genetic Terra Incognita: “There’s more than enough Unknome for anyone who wants to explore genuinely new biology.”