Data

Natural Language Processing

Annotation Tool


In order to extract the rich information found within complex unstructured EHR, annotation tools are necessary. Sophistication of such tools varies and many different types can be publicly accessed.

Our NLP pipeline is reliant on NILE (Narrative Information Linear Extraction) for rule-based annotation or weak labeling [1]. To deploy a distributed framework that incorporates HPC, a wrapper interface to the Java Runtime Environment (JRE) was used to distribute NILE’s workload across computing resources. NILE allows for both the addition of novel observation definitions (i.e. phenotype mentions), and word-sense modifier (i.e. negations, temporal modifiers.) Lexicons developed with data-driven methods can be trivially added to existing NILE configurations.

1.Yu, S., Cai, T., & Cai, T. (2013). NILE: fast natural language processing for electronic health records. arXiv preprint arXiv:1311.6063.

Lexicons:

A guiding component to most annotation tools, is the lexicon. The terminology provided to these models are essential for extracting specific information relating to its use case. Common among healthcare is the Unified Medical Language System (UMLS) and other schematics of similar nature. While appropriate in many instances, researchers, including our group, have found the need for data-driven lexicons in order to improve extraction recall. We have begun documenting the process of curating lexicons for our various projects. This information, including lexicon versioning, available to be implemented by other researchers, can be found at our gitlab. [Request Access]

Methodology: 

Common practice terminology within machine learning and NLP are defined here for ease of understanding supporting documentation and publications.

Geospatial

We aim to build tools to help VA transform data into actionable intelligence by spatially targeting interventions, precisely identifying vulnerable regions, and dynamically evaluating the use of resources. We use geographic information of socioeconomic features to predict a higher suicide rate area and find geospatial clusters of this area.