We are also working at the intersection of proteomics and deep learning
Proteins are responsible for most functions in our body. They form as an extended chain of amino acids and fold into a three-dimensional (3D) structure that governs their function. Determining proteins’ 3D structure is key to understanding how they work, why they cause diseases, and how researchers can design drugs to block or activate their functions. Prediction of proteins’ atomic structure from their amino acid sequence is a 50-year standing grand challenge in biology. Current experimental approaches to discern the 3D structure of proteins are too costly to keep pace with the wealth of protein sequences that genome-sequencing projects generate. In fact, the rate at which new protein sequences are discovered outpaces the rate of experimental structure determination by orders of magnitude. Thus, establishing a computational approach that bridges the sequence-structure knowledge gap will significantly impact bioinformatics, biology, and medicine.
Two major developments are fueling hopes that a computational solution may be within reach:
(1) the Protein Data Bank, which stores all experimentally-solved structures, has grown sufficiently large in size to train deep learning algorithms, and
(2) the combination of deep learning algorithms and computing power can now extract information from very large datasets. The impressive performance of AlphaFold (the artificial intelligence company associated with Google) in the independent community-wide experiments called Critical Assessment of Protein Structure Prediction—established to assess and advance structure prediction—provided a definitive glimpse affirming that a solution is near.
Efficient deep learning methods perform structural analysis tasks at large scale, ranging from the classification of experimentally determined proteins to the quality assessment and ranking of computationally generated protein models in the context of protein structure prediction. Yet, the literature discussing these methods does not usually interpret what the models learned from the training or identify specific data attributes that contribute to the classification or regression task.
While 3D and 2D CNNs have been widely used to deal with structural data, they have several limitations when applied to structural proteomics data. Graph-based convolutional neural networks (GCNNs) are an efficient alternative while producing results that are interpretable. They are able to learn effectively from simplistic graph representations of protein structures while providing the ability to interpret what the network learns during the training and how it applies it to perform its task.
GCNNs perform comparably to their 2D CNN counterparts in predictive performance and they are outperformed by them in training speeds. The graph-based data representation allows GCNNs to be a more efficient option over 3D CNNs when working with large-scale datasets as preprocessing costs and data storage requirements are negligible in comparison. [Read more]