Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
© The Author(s) 2017
Received: 4 January 2017
Accepted: 15 March 2017
Published: 4 April 2017
Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues.
Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score.
The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01.
Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.
KeywordsBig data Decision support system Data mining Health informatics
Chronic obstructive pulmonary disease (COPD) is a family of diseases associated with reduced airflow to the lungs. Over time, patients will experience decreasing airflow as well as increasing inflammation of the tissues that line the airway. The National Institutes of Health (NIH) estimates that approximately 24 million Americans have COPD, with many not even being aware . Though the disease primarily affects smokers, COPD may also occur in those genetically predisposed or due to air pollution. COPD has no known cure.
Historically, many medical professionals diagnosed COPD as chronic bronchitis or emphysema. More recently, diseases characterized by chronic cough with sputum production and increasing shortness of breath are encompassed in the blanket diagnosis of COPD . This means that COPD often co-occurs with related lung diseases. However, many diseases that co-occur with COPD are not contained within the family of COPD diseases. For example, hypertension often co-occurs with COPD because smoking increases the risk for both diseases . Other diseases such as asthma may also affect the lungs and have a high co-occurrence with COPD. Additionally, many medications not specifically created for COPD treatment are highly correlated with COPD. Aspirin has been shown to help in the treatment of COPD patients and is often prescribed by medical professionals .
The discovery of co-occurring diseases, symptoms, and medications can be useful to researchers and medical professionals. Researchers have shown interest in developing clinical guidelines which consider multimorbidities [5, 6]. Indexes exist which measure the likelihood of patient death based upon which diseases are present . Building these indexes using a computational approach by algorithmically discovering co-occurring diseases, symptoms, and medications would greatly expand their accuracy and coverage.
Currently, no standard set of ground-truth terms exists for evaluating the performance of COPD co-occurrence analysis. The contribution of our work is (1) proposed methodology and manual creation of an expert reviewed dictionary and (2) proposition of new mathematical formulas and big data computational framework for finding COPD related terms. After the ground truth dictionary has been created, it is evaluated using precision and recall against traditional methods for finding disease and term co-occurrence.
Clinical data sources
The past several years has seen an increase in the electronic storage of patient records using Electronic Health Records (EHR). Data is typically stored in two formats: structured and unstructured. Structured data is stored in a form which can be directly queried and results returned as a normalized data structure. Structured data includes ICD-10 codes and patient demographics. ICD-10 codes are commonly used for billing purposes and may not include a complete picture of the patient. Additionally, these billing codes may be assigned based on financial reimbursement incentives rather than accuracy .
Clinical notes written by medical professionals during the treatment and discharge of patients are considered unstructured data. These notes often contain more information than ICD-10 codes because they are dictated for the purpose comprehensive patient documentation rather than billing purposes . Medical professionals can quickly dictate large amounts of unstructured information without the conversion losses a structured system would incur. However, these notes have the disadvantage that they cannot be easily queried. The field which processes such unstructured data, or natural language, is known as Natural Language Processing (NLP).
The medical domain has been an early adopter of NLP. Unlike many domains which use general NLP techniques, clinical NLP systems attempt to incorporate domain knowledge in order to increase the quality of extracted information. A common practice in NLP is stemming, which attempts to find the root of a word. This technique can potentially increase the quality of extracted data, but general techniques such as the Porter stemmer have limited use in the clinical domain . Variants of common English terms often contain no more than an additional prefix or suffix. Clinical term variants may be medication brand names of the same generic drug or diseases which may have different common names internationally. Additional concerns exist such as extracting numeric values and units of laboratory results, identification of family histories vs patient history, and negation of findings. These concerns have led to the development of domain specific clinical NLP systems spanning back many decades.
The Linguistic String Project–Medical Language Processor (LSP–MLP) is the oldest traceable NLP system directed at information extraction from clinical notes. LSP began in 1965 for the purposes of developing an English language parser that could process scientific literature . Researchers attempted to structure text in a way that could be easily queried and was an early question–answer system. The NIH funded an expansion of LSP to be applied to clinical notes. This research resulted in the MLP system. MLP was aimed at supplementing LSP with domain specific knowledge to increase the quality of information. LSP–MLP does not appear to be under active development or maintenance and has not seen any related research papers published in over a decade.
Many of the NLP techniques used by LSP–MLP have been superseded by researchers outside of the clinical domain. Modern approaches typically use statistical methods for the identification of parts of speech, sentence boundaries, and other structure. LSP attempts to structure language using grammars. Modern systems additionally make use of controlled medical vocabularies which are developed independently by organizations such as the National Library of Medicine (NLM). Additionally, LSP–MLP is fragile because it depends on the syntactic structure of the text rather than semantic meaning.
The Medical Language Extraction and Encoding System (MedLEE) originated as a system to structure radiology reports. MedLEE eventually evolved as a general purpose clinical NLP system . MedLEE was created without the help of additional NLP libraries and frameworks. Modern clinical NLP systems often use Commercial off the shelf (COTS) NLP tools such as UIMA, GATE, and OpenNLP to prevent duplication of efforts. MedLEE has improved in recent years to include medical vocabularies from the National Library of Medicine’s Unified Medical Language System (UMLS). UMLS contains medical terms from many data sources in a standardized format.
Although cTAKES and UIMA provide useful features, both tools are designed to be used document-at-a-time. This limits the use in document aggregation. Analyzing the frequency of disease occurrence in a document corpus would not be possible with a UIMA annotator. Annotations which use frequency counts would need system extensions. Our research makes use of Apache cTAKES and has written the code necessary to annotate document aggregations.
The Health Information Text Extraction (HITEx) is an information extraction system aimed at general purpose processing of medical texts. The system departs from previous works as it uses a component based architecture based on the GATE framework . The use of GATE allowed researchers to focus solely on domain specific concerns rather than low level tasks such as tokenization and sentence segmentation. This project has stalled and not published new research since 2006.
The highest layer of HITEx is the UMLS concept mapper which maps medical terms to UMLS concepts. This subsystem uses both exact string matching and fuzzy matching through truncation and normalization. The system has been used successfully as a hybrid system combining ICD-9 structured data and unstructured clinical notes. In addition to basic NLP tasks, HITEx contains modules capable of discerning the patient’s primary diagnosis and smoking status.
Zeng et al.  have created a system to assist in the detection of co-morbidities in clinical notes. This system primarily uses HITEx to assist in the finding of co-morbidities. The existence of COPD and another disease in a clinical note is considered a comorbidity. This methodology is common in the determination of co-occurring diseases. However, this methodology may not be ideal. Diseases which occur with high prevalence in a general population will statistically also co-occur with high frequency independent of COPD status. Ideally, penalizing diseases, medications, and symptoms which occur with high frequency in a general population would allow a more accurate picture. While such penalizations have been greatly researched in the Information retrieval (IR) community , few have attempted to adapt these methods to clinical NLP .
Clinical big data
Data mining of clinical data has been well studied since the emergence of the field. However, big data approaches have been far less studied. A review by Herland et al.  documents several big data systems. Current clinical big data systems are often built for the purpose of supervised machine learning tasks. In practice, many medical professionals use Clinical Decision Support Systems (CDSS). These allow the practitioner to make conclusions rather than relying on algorithmic classification. Many classification algorithms provide evidence which is difficult to interpret. Medical professionals may be uncomfortable diagnosing patients without clearly interpretable results. Aggregated data, summary statistics, and similar patients are examples of common information presented to medical professionals using a CDSS. Big data approaches in CDSS have seen little research to date.
Our research attempts to create a computational framework for the discovery of co-occurring diseases, medications, and symptoms in COPD patients. COPD was chosen because it is tangential to many lung diseases. Clinical notes are used as the primary data source due to a potentially high yield of information. Several NLP techniques are employed in this framework in an effort to maximize the information captured within these notes. With the emergence of electronic patient record databases, many large systems containing big data are now available. Examples of these are the Veterans Affairs (VA) hospital system and England’s National Health Service (NHS). Our methodology uses a big data approach to finding co-occurring evidence and is validated using a dataset containing approximately 64,000 instances. Due to rarity of access to databases as large as the VA system, this dataset was the largest available to our research group. However, the methodology was designed with big data techniques as the foundation. It can be employed by organizations such as the VA hospital system without scalability issues. Although an increasing number of researchers are using NLP with clinical notes as a data source [18, 19], few have explored COPD clinical notes  and there is no documented evidence in Google Scholar of this methodology applied to big data.
The Apache Hadoop ecosystem is leveraged for COED. Hadoop Distributed Filesystem (HDFS) is used for the storage and distribution of deidentified patient discharge summaries. Apache Spark is utilized for MapReduce operations and the pyspark python interface is used for programming. Documents are represented as Resilient Distributed Datasets (RDD). Apache cTAKES is used for the extraction of medical terms from unstructured clinical notes. cTAKES offers several UIMA pipelines and UMLS fast-dictionary-lookup is used as the primary pipeline. Disease, medication, and symptom annotations are stored, excluding annotations marked “history” and those that have been negated. UMLS Concept IDs (CID) are extracted from each annotation and used as the primary term identifier.
Summary of dictionaries used by cTAKES
Snomed-CT is a set of clinical terms maintained by the International Health Terminology Standards Development Organization (IHTSDO). The development of Snomed-CT dates back to 1965 and is known for its comprehensive coverage of clinical terms. SNOMED-CT consists of concepts, descriptions, and relationships and can be used for semantic processing
NCI Thesaurus 
The National Cancer Institute (NCI) Thesaurus was created to assist in research systems made available by NCI. It covers clinical terminology regarding cancers, findings, drugs, therapies, anatomy, genes, and many other cancer research related terms. NCI thesaurus offers a partial model as to how these subjects relate to each other and aims to provide a common system for cancer researchers to communicate
Medical Subject Headings (MeSH) is an NLM controlled vocabulary used for indexing articles on NIH’s pubmed. Additionally, relationships between terms are provided which can act as a thesaurus
International Classification of Diseases (ICD) is a coding system designed for classification of diseases and disorders. ICD is maintained by the World Health Organization (WHO) and ICD-9 is the ninth revision of the system. In the United States, ICD-9 has seen popular usage in medical billing. The system has been adopted by many organizations, including the centers for disease control for reporting mortality and morbidity statistics 
Co-occurrence evidence discovery framework
Aggregator—Gathers annotations into a single data file suitable for processing. In the Hadoop ecosystem, this is a tab separated file with one line per document.
Analyzer—Documents are counted and terms mapped to COPD and non-COPD lookup tables. Each document is considered a COPD document if COPD was the primary diagnosis. Terms contained within the same document are considered to be COPD terms and counts incremented within the lookup table. Documents that do not contain COPD as a primary diagnosis are mapped to the non-COPD lookup table in a similar fashion. Terms may exist in both tables. These term counts are later used with Eq. 5 to calculate COED scores.
Score—The scoring mechanism then scores each term using equations and parameters outlined in the next section.
Ranker—Scores are then ranked and recombined with UMLS definitions for user accessible output.
After all annotations are complete, annotations are aggregated to a spark compatible file and a signal is sent to the second phase. COED runs as a series of map and reduce tasks. Word counts are performed for COPD and non-COPD documents. Results are stored in two separate RDDs and reduced by key using the add callback function. The two RDDs are then joined using the pyspark join() method creating a merged RDD of form [K,(V1,V2)]. Each term is then scored using Eq. 5 and then sorted by value using a custom sort function. No final reduce operation is required as the previous reduce has ensured distinct keys. Results are then outputted to a file for further analysis.
Selection of ground-truth terms
Chronic lung disease
Acute respiratory failure
Gastro esophageal reflux
Carcinoma of lung
Clubbing (morphologic abnormality)
Carbon dioxide, increased level
Congestive heart failure
Diseases and disorders
Selection of top ten results for diseases and disorders
Congestive heart failure
Congestive heart failure
Chronic respiratory failure
Acute respiratory distress
Acute chronic respiratory failure
Chronic respiratory insufficiency
Selection of top ten results for symptoms
Decreased air entry
Selection of top ten results for medications
As shown in the results, penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time. Additionally, localized data may return slightly different results based on patient population. This allows dictionaries to be created on a per-hospital basis rather than nationally, which may not account for localized concerns.
Future work intends to expand this methodology to other diseases to increase confidence in results. Many diseases do not contain ground truth dictionaries for the purposes of information retrieval analysis and must be created using similar methodology. Finally, we intend to integrate the software into an EHR system directly for analytical feedback to medical professionals about their patient population. This can serve as a decision support system to assist medical staff in developing patient treatment procedures.
CB carried out the conception, design, and implementation of this research as well as interpretation of results. AA and XZ made substantial contributions to the conception and design of this research as well as critically reviewing and interpreting results. CB carried out the drafting of manuscript. AA and XZ critically reviewed the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
This research was supported in part by NSF Grants IIP-1444949 and IIP-1624497.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- American Lung Association. COPD Fact Sheet, 2014. http://bit.ly/1rOoy1i. Accessed 05 Aug 2016.
- Petty TL. The history of COPD early historical landmarks. Int J COPD. 2006;1:3–14.View ArticleGoogle Scholar
- Marengoni A, Rizzuto D, Wang HX, Winblad B, Fratiglioni L. Patterns of chronic multimorbidity in the elderly population. J Am Geriatr Soc. 2009;57(2):225–30.View ArticleGoogle Scholar
- Aaron CP, Schwartz JE, Hoffman EA, Tracy R, Austin JHM, Smith LJ, Jacobs DR, Watson KE, Barr RG. Aspirin use and longitudinal progression of percent emphysema on CT : the MESA lung study. Am J Respiration Crit Care Med. 2015;191:A6354.Google Scholar
- Guthrie B, Payne K, Alderson P, McMurdo MET, Mercer SW. Adapting clinical guidelines to take account of multimorbidity. Br Med J. 2012;345:e6341.View ArticleGoogle Scholar
- Tinetti ME, Fried TR, Boyd CM, Badalà F, Nouri-mahdavi K, Raoof DA. Designing health care for the most common chronic condition—multimorbidity. JAMA. 2012;307(23):2493–4.View ArticleGoogle Scholar
- D’Hoore W, Sicotte C, Tilquin C. Risk adjustment in outcome assessment: the Charlson comorbidity index. Methods Inf Med. 1993;32(5):382–7.Google Scholar
- Danielsen RD, Simon AF, Pavlick R. The culture of cheating: from the classroom to the exam room. J Phys Assist Educ. 2006;17(1):23–9.Google Scholar
- Rosenbloom ST, Denny JC, Xu H, Lorenzi N, Stead WW, Johnson KB. Data from clinical notes: a perspective on the tension between structure and flexible documentation. J Am Med Inform Assoc. 2011;18(2):181–6.View ArticleGoogle Scholar
- Porter MF. An algorithm for suffix stripping. Program. 1980;14(3):130–7.View ArticleGoogle Scholar
- Sager N. Natural language information processing. Advanced Book Program. Boston: Addison-Wesley Publishing Company; 1981.Google Scholar
- Friedman C.A Broad-coverage natural language processing system. Proceeding of the AMIA Symposium. American Medical Informatics Association. 2000; 270–4.Google Scholar
- Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.View ArticleGoogle Scholar
- Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006;6:30.View ArticleGoogle Scholar
- Ramos J, Eden J, Edu R. Using TF-IDF to determine word relevance in document queries. Process Manag. 2003;24(5):513–23.Google Scholar
- Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc. 2012;19(e1):e149–56.View ArticleGoogle Scholar
- Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):2.View ArticleGoogle Scholar
- Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform. 2009;42(5):760–72.View ArticleGoogle Scholar
- Wu Y, Denny JC, Rosenbloom ST, Miller RA, Giuse DA, Xu H. A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries. AMIA Annu Symp Proc. 2012;2012:997–1003.Google Scholar
- Ruch P, Gobeill J, Lovis C, Geissbühler A. Automatic medical encoding with SNOMED categories. BMC Med Inform Decis Mak. 2008;8(1):S6.View ArticleGoogle Scholar
- Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007;40(1):30–43.View ArticleGoogle Scholar
- Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–6.Google Scholar
- United States Department of Health and Human Services. The international classification of diseases. Geneva: World Health Organization; 1969.Google Scholar
- Slee VN. The International classification of diseases: ninth revision (ICD-9). Ann Intern Med. 1978;88(3):424–6.View ArticleGoogle Scholar
- International Classification of Diseases, Ninth Revision (ICD-9). http://www.cdc.gov/nchs/icd/icd9.htm. Accessed 11 Jul 2016.
- WebMD. COPD Comorbid Conditions: heart disease, osteoporosis, and more. http://wb.md/2dGwUqq. Accessed 01 Aug 2016.
- CDC. Addressing the Nation’s most common cause of disability at A Glance 2015. http://bit.ly/1FKbR7i. Accessed 01 Aug 2016.