Natural Language Processing

Natural Language Processing (NLP) at a glance

Language modeling was established as early as 2001, with use cases in speech recognition, text auto-completion, information retrieval, and more. Simplistic models such as Word2Vec, GloVe and FastText using generalized one-hot encoding type vector embedding to establish distance similarities are some of the first renown language models and still used consistently.

As NLP has progressed through time, more sophisticated modeling has been developed to better analyze language relations and even prediction tasks. These include recurrent neural networks (RNN) to long short-term memory (LSTM) to transformers, which is deemed 'the most powerful architecture' of today. Various research walks us through some in-depth historical analysis of these NLP tools (Wysocki, O., Florea, M., Landers, D., & Freitas, A. (2021). Architectures of Meaning, A Systematic Corpus Analysis of NLP Systems. arXiv preprint arXiv:2107.08124).

Since the debut in 2017, it is clear through present research that the current state of the art language modeling revolves around transformers. Some sophisticated language models which use transformer based architecture are known as Bidirectional Encoder Representations from Transformers (BERT), XLNet and Generative Pre-trained Transformer (GPT). 

While these models are powerful and can extrapolate information that was once never achievable, many require high-performance computing (HPC) to be successful. As researchers of LBNL, we are fortunate enough to have the opportunity at utilizing one of the fastest supercomputers today. [Read more about Perlmutter]

Suicide and Overdose Prevention


Suicide affects communities across the world. About 800,000 people die from suicide every year. In the US, suicide is the 10th leading cause of death and it is the 2nd cause of death among white males age 18-34. It impacts the lives of individuals, their families and communities with a cost of more than $69B every year in the US alone. Veterans are a particularly vulnerable population. About 18 Veterans die by suicide every day. On average, 18 Veterans die by suicide each day. Unfortunately, our ability to predict suicide remains quite limited. As a result, suicide prevention remains VA’s top clinical priority. While our ability to identify Veterans at risk for suicide has been dramatically improved with the implementation of the Recovery Engagement and Coordination for Health – Veterans Enhanced Treatment (REACH VET) suicide risk prediction model, overall classification of Veterans' risk for suicide remains poor, and there is a pressing need for improved suicide risk prediction models for Veterans.

Previously, our group has investigated different natural language processing (NLP) techniques and developed NLP pipelines to accurately identify stressful life events that are under-reported in the structured electronic health records (EHR) data that can potentially help to improve suicide risk prediction. Currently, REACH VET does not include NLP-derived life events as potential predictors thus largely ignores the role of social determinants of health variables. These life events include but are not limited to access to lethal means, detoxification, food insecurity, housing instability, justice, job instability, military sexual trauma, and social connections relating to relationships and isolation. 

Presently in this project, we aim to scale up and extend our existing NLP methods in order to integrate the NLP-identified stressful life events that we have already identified into our overall predictive modeling framework, on a level that considers all veteran patient data. With the use of high-performance computing (HPC), we are able to analyze and train more sophisticated machine learning models, extract more information than once before and integrate to produce performance metrics that out-score previous research. With the use of NLP and HPC, we can deliver information that will better guide future treatment for suicide and overdose risk.

Average number of life events identified prior to death using knowledge-based lexicon versus data-driven lexicons. Average number of life events measured along structured EHR variables are plotted as dotted lines for comparison.

Obstructive Sleep Apnea and Comorbidities


Obstructive Sleep Apnea (OSA) contributes to development of cardiovascular diseases (CVD) such as hypertension, atrial fibrillation, myocardial infarction, TIA/Stroke, and congestive heart failure, as well as diabetes mellitus type II (DM TII). 24% of Veterans have been diagnosed with OSA and the number of patients diagnosed with it has dramatically increased over the years suggesting a significant under reporting.

The current approach of treating everyone with moderate-severe sleep apnea even if they're not tired is unsustainable. To avoid undertreating or overtreating, a well-defined risk profile must be created. 

In this project, we aim to cluster and characterize the phenotypes associated with different groups of patients that develop various comorbidities in addition to OSA and therefore require treatment. In addition, we will then take this phenotyping to gather data and train a predictive model for evaluating excess risk. We aim to observe the immediate contributing risk of OSA towards developing associated comorbidities or death. Given the low sensitivity of structured variables, NLP will be implemented by our group to inform the clustering of associated phenotypes by collecting textual identifiers of OSA and comorbidities from administrative data. With the use of NLP and HPC, we can deliver information that will better guide future treatment of OSA. 

Acronyms: diabetes mellitus type 2 (T2DM), hypertension (HTM), heart failure (HF), atrial fibrillation (AF), coronary artery disease (CAD), polysomnogram (PSG), home sleep apnea test (HSAT)

Lung Cancer Treatment

Lung cancer accounts for 30% of all cancers among American war Veterans and remains the leading cause of cancer related deaths. Though the introduction of immune checkpoint inhibitors (ICI) has improved outcomes for Veterans with locally advanced (staged III) or metastatic (stage IV) non-small-cell lung cancer (NSCLC), there is an unmet need to develop predictive models to identify Veterans at risk of poor ICI response, ICI toxicity, and non-cancer (competing) mortality. Until these models are developed, personalized oncologic care for Veterans including treatment intensification, selection of alternative therapies, enrollment on clinical trials, or increased toxicity monitoring can not be optimized. 

In this project, we aim to build and validate neural net models enabling precision care for Veterans with Lung Cancer. Our overarching hypothesis is that deep neural networks trained on complex, multimodal structured health data, clinical text, and genomic data can improve the prediction of ICI efficacy, toxicity, and non-cancer mortality in Veterans with lung cancer. We aim to use of high-dimensional feature selection techniques to identify novel predictors of ICI efficacy and toxicity, allowing for rapid repurposing of medications to modulate ICI response, or identifying subsets of patients. Given the low sensitivity of structured variables, NLP will be implemented by our group to to impute predictor and outcome variables used down-stream in modeling ICI efficacy. With the use of NLP and HPC, we can deliver information that will better guide future treatment of lung cancer.