Berndt Lab Machine Learning Model Sets a New Pace in Protein Sensor Design

Two scientists in lab coats sitting in front of monitors in a lab
Sarah Wait and Dr. Andre Berndt

In the human brain, billions of neurons transmit unfathomable amounts of information across an elaborate signaling network that allows us to perform all of the conscious and unconscious functions that are essential for everyday life. How this system works, and what goes wrong when it fails, is a question of great intrigue and importance to scientists and physicians who want to solve health challenges that impact the brain.

While the idea of brain mapping may bring to mind images of wired electrodes attached to a patient’s head as neuronal activity flickers on a monitor, scientists have another tool capable of revealing the secrets of the brain: genetically-encoded protein-based fluorescent indicators (GEFIs) – or protein sensors, for short.

Andre Berndt, PhD, an assistant professor of Bioengineering and faculty member in the Institute for Stem Cell and Regenerative Medicine (ISCRM), leads a research lab determined to push the boundaries of protein-based sensor technology. For Berndt, that means creating a real-time view of neural patterns down to the single-cell level, like a traffic cam with the power to track the live movement of individual cars.

Protein sensors are programmed to flash in response to certain biological activities, for example, a change in levels of a particular chemical, like calcium. For that reason, protein sensors must be engineered to perform specific tasks, which requires an ability to predict the necessary structure of the protein, a job that scientists like Berndt are naturally assigning to powerful computers.

“The reality is that humans don’t have the mental capacity to analyze the complexity of the sequence, structure, and function of proteins, but computers are really good at making correlations between these properties,” says Berndt. “Right now, though, a lot of the current research offer frameworks that are largely theoretical and that don’t significantly improve on what humans have done.”

That leaves scientists to ask whether computers be trained to reliably guide the design of protein sensors with practical applications. To answer this question, Sarah Wait, a graduate student in the Berndt Lab, designed a machine learning model and put it to the test. What she and Berndt found went beyond the theoretical. Over the course of a multiyear, collaborative investigation, they showed that their trained machine learning models correctly predicted several variants of the calcium indicator GCaMP with record-setting speed and accuracy, outperforming all previous generations of these sensors, a breakthrough that marks a significant leap in functional protein engineering and accelerates the development of high-performing optogenetic sensors with broad applications across diverse protein engineering challenges.

“The impact of this study will be huge,” Berndt said. “It’s going to shift protein engineering away from the hit-and-miss approaches and spur people to invest more into machine learning and other computational approaches.”

The study’s results were published in the journal Nature Computational Science with Wait as the first author and ISCRM faculty members Mike Regnier, PhD, David Baker, PhD, and Farid Moussavi-Harami as co-authors of the paper.

Broad Applications for Record-Setting Machine Learning Models

Magnified image of glowing neurons
The image at left depicts rat neurons expressing GCaMP protein in the low calcium condition, and at right, the same GCaMP expressing neurons in a high calcium condition.

Of course, even computers would have a hard time matching the structure of a protein to a specialized task without large amounts of data to crunch first. To give the computers something to sink their teeth into – to learn from – Berndt and Wait trained three algorithms on the amino acid sequences of more than 1,000 versions of GCaMP whose properties were known. Some sequences were known to be more efficient and some less. This allowed the algorithms to develop statistical models that could identify sequences that were likely to be more efficient.  Other than the sequences of the different versions of the protein, no other details, such as information about the structure of a protein, were provided to the algorithm.

Then the trained algorithms analyzed the sequences of 1,423 versions of GCaMP whose properties were unknown, and predicted which would likely be the most efficient. Specifically, Berndt and Wait were seeking candidate GCaMP proteins that were able to fluoresce brightly when exposed to calcium but then quickly dim and be ready to fluoresce again. These properties are ideal because they allow GCaMP to more accurately reveal the activity of rapidly firing neurons.

“We trained our machine learning models on data about existing GCaMP variants,” explains Wait. “Then we took the mutations the models predicted and tested them downstream. Through that process, we identified several variants that made the sensor function the way we wanted it to function. And we were able to apply the model to other sensors and significantly speed up our engineering approach.”

The machine learning algorithms identified three promising versions of GCaMP. Subsequent laboratory testing found that all three were brighter and faster than any previously reported GCaMP proteins. One variant, called eGCAMP2+, was twice as bright as the versions currently considered state of the art.

Berndt adds that while the focus of this proof-of-concept paper was on calcium activity in the brain, the machine learning model – and protein sensor technology in general – are relevant to other areas of research. “We want to apply this method to other proteins that we’re developing, including opioid, hormone, and oxidative stress sensors, and our ISCRM colleagues [in the Regnier and Moussavi-Harami labs] are using this technology in the context of genetic mutations that cause dysfunction in the heart.”

Acknowledgement