Our group works at the intersection of healthcare, big data and artificial intelligence


To help further research and improve precision care, the Veterans Affairs (VA) and Department of Energy (DOE) have partnered to bring together VA’s healthcare and genomic data with DOE’s high-performance computing (HPC) resources and expertise. The vast VA data includes Electronic health records (EHR) for almost 24M patients covering 22 years. EHR contains a combination of various data types including, genomics, structured data such as demographics and diagnoses, and unstructured data such as physicians’ and nurses’ notes. So far, the structured data has been mostly used for predicting outcomes and this is due in part to the complexity of dealing with unstructured data which contain ~4.65 trillion documents in the VA dataset. However, studies have shown that structured data alone have low performance metrics. We are developing AI models that integrate structured, unstructured, and geospatial data to improve VA’s ability to identify patients at risk of suicide, overdose, or complications from obstructive sleep apnea and to assess response to lung cancer treatment. 

We are developing Natural Language Processing methods


Unstructured notes are rich in detail and capture significant patient stressors such as housing instability, patient symptoms and history, etc. more accurately. We are developing Natural Language Processing (NLP) methods to extract embedded patient information from the unstructured data to be used within a framework that integrates all data types to predict health outcomes. Our work focuses on helping VA stakeholders to improve healthcare outcomes in areas such as suicide and overdose risk, sleep apnea and lung cancer. The work is cross disciplinary and can be extended to the prediction of other diseases. However, the challenges of extracting meaningful information from trillions of documents are many as the data is vast, highly heterogeneous, imbalanced, and noisy. [Read more]

A UMAP projection of GPT embeddings for mentions of firearms (ACCESS2_MEANS) in the VA corpus (200,000 snippets for 120,000 patients). Each dot is an extracted span of text anchored around firearm vocabulary. The color gradient indicates the likelihood of the mention belonging to clinical text of  veterans whose underlying cause of death is suicide/overdose (Red) along the national death index (NDI). Probabilities were computed along 500 nearest-neighbors. Clustering can be observed w.r.t. to mentions more commonly found in patients with a recorded completed suicide/overdose. Nearest neighbor classification can be used to score weakly-labeled mentions and filter by specificity to suicide outcomes.


We are developing geospatial methods


Integrating medical records and genomic data is not enough. According to a 2016 study published in the American Journal of Preventive Medicine, access to medical care accounts for only 20 percent of the contributors to healthy outcomes; socioeconomic and environmental factors of the community are far more important, contributing 50 percent (Hood, C. M., K. P. Gennuso, G. R. Swain, and B. B. Catlin. 2016. County health rankings: Relationships between determinant factors and health outcomes. American Journal of Preventive Medicine 50(2):129-135. https://doi.org/10.1016/j.amepre.2015.08.024). The challenge here is to understand how hundreds of variables from Social Determinants of Health (SDoH) affect the health outcomes of the individual and how they impact the area where we live. These variables are usually stored in large datasets and provided by multiple resources. [Read more]

We are also working at the intersection of proteomics and deep learning


Proteins are responsible for most functions in our body. They form as an extended chain of amino acids and fold into a three-dimensional (3D) structure that governs their function. Determining proteins’ 3D structure is key to understanding how they work, why they cause diseases, and how researchers can design drugs to block or activate their functions. Prediction of proteins’ atomic structure from their amino acid sequence is a 50-year standing grand challenge in biology. Current experimental approaches to discern the 3D structure of proteins are too costly to keep pace with the wealth of protein sequences that genome-sequencing projects generate. In fact, the rate at which new protein sequences are discovered outpaces the rate of experimental structure determination by orders of magnitude. Thus, establishing a computational approach that bridges the sequence-structure knowledge gap will significantly impact bioinformatics, biology, and medicine. 

Two major developments are fueling hopes that a computational solution may be within reach: 

(1) the Protein Data Bank, which stores all experimentally-solved structures, has grown sufficiently large in size to train deep learning algorithms, and 

(2) the combination of deep learning algorithms and computing power can now extract information from very large datasets. The impressive performance of AlphaFold (the artificial intelligence company associated with Google) in the independent community-wide experiments called Critical Assessment of Protein Structure Prediction—established to assess and advance structure prediction—provided a definitive glimpse affirming that a solution is near.


Efficient deep learning methods perform structural analysis tasks at large scale, ranging from the classification of experimentally determined proteins to the quality assessment and ranking of computationally generated protein models in the context of protein structure prediction. Yet, the literature discussing these methods does not usually interpret what the models learned from the training or identify specific data attributes that contribute to the classification or regression task.

While 3D and 2D CNNs have been widely used to deal with structural data, they have several limitations when applied to structural proteomics data. Graph-based convolutional neural networks (GCNNs) are an efficient alternative while producing results that are interpretable. They are able to learn effectively from simplistic graph representations of protein structures while providing the ability to interpret what the network learns during the training and how it applies it to perform its task.

GCNNs perform comparably to their 2D CNN counterparts in predictive performance and they are outperformed by them in training speeds. The graph-based data representation allows GCNNs to be a more efficient option over 3D CNNs when working with large-scale datasets as preprocessing costs and data storage requirements are negligible in comparison. [Read more]