Email updates

Keep up to date with the latest news and content from Journal of Clinical Bioinformatics and BioMed Central.

Open Access Methodology

Potential identification of pediatric asthma patients within pediatric research database using low rank matrix decomposition

Teeradache Viangteeravat

Author Affiliations

Biomedical Informatics Core, Children’s Foundation Research Institute, Department of Pediatrics, The University of Tennessee Health Science Center, 50 N. Dunlap, 38013, Memphis, TN, USA

Journal of Clinical Bioinformatics 2013, 3:16  doi:10.1186/2043-9113-3-16

The electronic version of this article is the complete one and can be found online at: http://www.jclinbioinformatics.com/content/3/1/16


Received:20 July 2013
Accepted:22 August 2013
Published:28 September 2013

© 2013 Viangteeravat; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Asthma is a prevalent disease in pediatric patients and most of the cases begin at very early years of life in children. Early identification of patients at high risk of developing the disease can alert us to provide them the best treatment to manage asthma symptoms. Often evaluating patients with high risk of developing asthma from huge data sets (e.g., electronic medical record) is challenging and very time consuming, and lack of complex analysis of data or proper clinical logic determination might produce invalid results and irrelevant treatments. In this article, we used data from the Pediatric Research Database (PRD) to develop an asthma prediction model from past All Patient Refined Diagnosis Related Groupings (APR-DRGs) coding assignments. The knowledge gleamed in this asthma prediction model, from both routinely use by physicians and experimental findings, will become fused into a knowledge-based database for dissemination to those involved with asthma patients. Success with this model may lead to expansion with other diseases.

Keywords:
Clinical research; Translational research; Medical informatics; Biomedical informatics; Machine learning; Data mining; Feature extraction; Classification

Background

Data mining in medical informatics

Because of their predictive power, various healthcare systems are attempting to use available data mining techniques to discover hidden relationships as well as trends in huge data available within the clinical database and convert it to valuable information that can be used by physicians and other clinical decision markers. In general, data mining techniques can learn from what was happened in past examples and model oftentimes non-linear relationships between independent and dependent variables. The resulting model provides formalized knowledge and prediction of outcome. For example, Shekar et al. used data mining based decision tree algorithm to discover the most common refractive error in both male and female [1]. Palaniappan et al. presented a prototype that combines the strengths of both an online analytical processing (OLAP) and data mining techniques for clinical decision support systems (DSS) [2]. Jonathan et al. used data mining techniques to explore the factors contributing to cost of prenatal care and outcomes [3]. Chae et al. used data mining approach analysis in health insurance domain [4]. With advanced data mining techniques to help evaluate healthcare utilization costs for employees and dependents in organizations [5].

More advanced machine learning methods, such as artificial neural networks and support vector machines, have been adopted to use in various areas of biomedical and bioinformatics, including genomics and proteomics [6]. For biological data, clustering is probably the most widely used data mining technique, such as clustering analysis, hierarchical clustering, k-means clustering, backpropagation neural networks, self-organization maps, fuzzy clustering, expectation maximization, and support vector machines [7,8]. Bayesian models were widely used to classify data into predefined classes based on a set of features. Given the training examples, a Bayesian model stores the probability of each class, the probability of each feature, and the probability of each feature given each class. When a new unseen example occurred, it can be classified according to these probabilities [9,10]. This classification technique is one of the most widely used in medical data mining. Decision tree models, such as the Iterative Dichotomiser 3 (ID3) Heuristic techniques belong to the subfield of machine learning. The ID3 Heuristic uses a technique called "entropy" to measure disorder in a set of data [11,12]. The idea behind the ID3 Heuristic is to find the best attribute to classify the records in the data set. The outcome is learned rules and a model used to predict unseen examples based on past seen examples. Non-negative matrix factorization (NMF) has been widely used in the field of text mining applications [13,14]. The only constraint that is unique from other methods is factorization of two matrices W and H from V (i.e., nmf (V) → WH) must be non-negative or all elements must be equal to or greater than zero. Typically, W and H are initialized with random non-negative values to start the NMF algorithm. The convergent time is varied and local minimum is not guaranteed [15].

Here, we are working on a methodology and classification technique in data mining called Low Rank Matrix Decomposition (LRMD) to allow computer to learn from what has happened in the past APR-DRGs datasets for asthma, able to extract dominant features, and then predict outcomes. The summary of APR-DRGs and the mathematics behind LRMD is discussed further below.

All patient refined diagnosis related groups (APR-DRGs)

APR-DRG is a grouping methodology developed in a joint effort between 3M Health Information Systems (HIS) and National Association of Children’s Hospitals and Related Institutions (NACHRI). APR-DRGs are proprietary and have the most comprehensive and complete classification of any severity of illness system for pediatric patients. It was designed to be more appropriate for general population patients than the old Diagnosis Related Group (DRG) [16]. While the DRG was designed and normed on Medicare patients only, the APR-DRG was designed and normed on a general population. We use APR-DRG based weights normed on a pediatric patient population. There are 316 APR-DRGs, such common APR-DRG codes include but not limited to 138 Bronchiolitis/RSV pneumonia, 141 Asthma, 160 Major repair of heart anomaly, 225 Appendectomy, 420 Diabetes, 440 Kidney transplant, 662 Sickle cell anemia crisis, and 758 Childhood behavioral disorder. Each group has 4 severity levels of illnesses (SOI) and 4 risk levels of mortality (ROM) while the DRG and Medicare Service – Diseases Related Groups (MS-DRG) have only a single severity and risk of mortality per group. For example, there are multiple diagnosis codes for asthma and an encounter might have asthma as principal diagnosis or a secondary diagnosis and if the encounter was primarily for asthma treatment, then the APR-DRG code will be 141 and all asthma encounters will be assigned the same APR-DRG code. In our internal system we code inpatient encounters to APR-DRG as well as DRG. We have available from our PRD back through 2009 [17], including Emergency Room (ER), Ambulatory Surgery (AS), and Observation (OBS) encounters.

Methods

Singular value decomposition

In general, the Singular Value Decomposition method is a method for decomposition of any matrix A∈RMxN where M ≥ N in a product of UVT , where U∈RMxk and V∈RNxk[18,19]. Since any rank k matrix can be decomposed in such a way, and any pair of such matrices yields a rank k matrix, the problem becomes as an unconstrained minimization over pairs of matrices (U ,V ) with the minimization objective

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M1">View MathML</a>

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M2">View MathML</a>

(1)

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M3">View MathML</a>

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M4">View MathML</a>

Where A(k) is a rank k approximation of matrix A. To find the optimum choices of U,V in l2 norm sense [20,21], the partial derivatives of the objective f (U,V) with respect to U,V are

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M5">View MathML</a>

(2)

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M6">View MathML</a>

(3)

Solving <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M7">View MathML</a> for U yields U = AV(VTV)- 1. By considering an orthogonal solution, then U = Λ is diagonal such that U = AV. Substituting back into <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M8">View MathML</a>, we have

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M9">View MathML</a>

(4)

The columns of V are mapped by ATA to multiples of themselves, i.e., they are eigenvectors of ATA. Therefore, the gradient <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M10">View MathML</a> vanishes at an orthogonal (U,V) if and only if the columns of V are eigenvectors of ATA and the column of U are eigenvectors of AAT, scaled by the square root of their eigenvalues [18,19]. More generally, the gradient vanishes at any (U,V) if and only if the columns of U are spanned by eigenvector of AAT and the columns of V are spanned by eigenvector of ATA. In term of the singular value decomposition, <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M11">View MathML</a> the gradient vanishes at (U,V) if and only if there exist matrices <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M12">View MathML</a> such that U = UOSPU and V = VOPV. Thus, using singular eigenvectors that corresponds to the largest singular values can represent the global properties (i.e., feature vectors) of A with satisfying the minimization under l2 norm sense [19].

Low rank matrix decomposition

Suppose that it is desired to represent matrix XRMxN as a sum of simple rank one matrices so as to capture the nature of the matrix in which matrix X is to be represented by the summation of r, i.e., rank of matrix. In this case, the outer products can be written as:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M13">View MathML</a>

(5)

Where X ∈ RM x N, {u1, u2, …, ur} and {v1, v2, …, vr} vectors each represents linearly independent column vectors with dimensions M and N, respectively. The constituent outer product <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M14">View MathML</a> is rank one in which the MxN matrix whose column (row) vectors are each a linear multiple of vector ui(vi). To be more precise, a necessary condition is that the vector set {u1, u2, …, ur} must form a basis for the column space of matrix X and the vector set <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M15">View MathML</a> should form a basis for the row space of matrix X. It is noted, however, that there will exist an infinite number of distinct selections of these basis vectors for the case r ≥ 2. It then follows that there will be an infinite number of distinct ranks when the decomposition of a matrix has rank r ≥ 2. The ultimate selection to be made is typically based on the application as well as computational considerations. To provide a mathematically based method for selecting the required basis vectors, let us consider the functional relationship

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M16">View MathML</a>

(6)

For 1 ≤ kr and p = 1,2 where the integer k ranges in the interval 1 ≤ kr.

It can be readily shown that the function (6) represents a convex function of the set {ui} for a fixed set of {vi} and vice-versa. For the proof, please refer to [22]. The convexity property is important since it ensures that any local minimum of fk (v) (i.e., u is fixed) and vice-versa is also a global minimum. With regard to the above equation, a specific selection of the vector sets {u1,u2,…,uk}∈ RM and {v1,v2,…,vk}∈ RN is to be made so as to minimize this functional. The optimal selection will then provide the best rank k approximation of matrix X in the lp norm sense, as designated by

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M17">View MathML</a>

(7)

This optimal matrix is said to be the best rank k approximation of matrix X in the lp norm sense. For convenience, we express equation (5) in a normalized form as:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M18">View MathML</a>

(8)

Where <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M19">View MathML</a> and σio are positive scalars. The most employed matrix decomposition procedure is the Singular Value Decomposition (SVD). The SVD method provides an effective method for mitigating the deleterious effects of additive noise and is characterized by the function fk({ui},{vi}) in the l2 norm sense, that is

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M20">View MathML</a>

(9)

The use of the l1 norm criterion can be of practical use when analyzing data that contains data outliers. Namely, it would be useful to express this equation (9) as an objective function that optimizes the best rank k approximation of matrix XRMxN as measured for the case of the l1 norm criterion. That is

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M21">View MathML</a>

(10)

In order to attempt to find the optimum solution which minimizes the objective function (10), we introduce a concept, called Alternating Optimization, This optimization concept is explained a detailed below.

Alternating optimization

We can rewrite the equation (10) in term of matrices U and V as f (U, V) = ||X - UVT||1 by fixing U, then objective function becomes:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M22">View MathML</a>

(11)

Where <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M23">View MathML</a> and similarly the column of VT are denoted by <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M24">View MathML</a>. It is straightforward to see that f (V) can be rewritten as a sum of independent criteria

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M25">View MathML</a>

(12)

where each <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M26">View MathML</a> term may be minimized independently by selecting an appropriate. The solution method for each of these subproblems is given in [23]. Grouping the resulting <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M27">View MathML</a> together to obtain VT, we get a solution for equation (11). On the other hand, by fixing V, the objective function can be expressed as:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M28">View MathML</a>

(13)

And a similar method may be used to solve for U. The iteration process proceeded by finding <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M29">View MathML</a> and then finding <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M30">View MathML</a> (i.e., the alternating optimization) is continued until a stopping criterion is met (i.e., the matrix from two successive iterations are sufficiently close). For example, <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M31">View MathML</a>. However, it must be noted that finding a global minimum is not guaranteed. In the following section, we establish a guideline for selection of the stopping criteria.

Selection criterion

In this section, let us direct our attention to the selection criteria for the initial choice for U, where URMxk. We note that for the following cases where (i) rank k = 1 approximation and (ii) rank 1 < k ≤ r, then r = rank (X). In order to take the global data into account, a good choice of initial value of U for a rank k = 1 (i.e., URMx1) approximation may be obtained as follows. First, we compute the l1 norm of each column vector in X, and denoted this norm by <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M32">View MathML</a>. Next compute the l1norm of each row vector in X, and denoted this norm by <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M33">View MathML</a>. Now we find the maximum value in <a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M34">View MathML</a>. If the maximum corresponds to a column norm, say from column j, then chose that column (i.e., U = X (:,j )) as the initial choice for U. If the maximum corresponds to a row norm, say row I, then we start with the transposed form of the criterion in (11) and we chose that row (i.e., VT = X (i,:)) as the initial choice for VT. We can also extend the previous concept to find the initial choice for U for the rank k = 2. Essentially, we apply the rank one approximation twice in succession. Therefore our objective function can be expressed as:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M35">View MathML</a>

(14)

Where U = [u1u2] and V = [v1v2], u1,u2,v1,v2 are vectors. Therefore the initial choice for U (rank k = 2) is U = [u1u2] (i.e., two largest l1 column or row norm). In a similar fashion, a selection criterion for the initial for U for rank k (1 < k ≤ r ) can be also obtained. Thus the column space of X (i.e., U) represents a feature vector that is considered as a global property (i.e., the best low rank approximation) of X that minimizes the above objective function under l1 norm sense [22].

Convergence subsequence

The error sequence happened in each iteration can be expressed as:

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M36">View MathML</a>

(15)

Since the error sequence is bounded below (i.e., E(U,V) ≥0)) we have

<a onClick="popup('http://www.jclinbioinformatics.com/content/3/1/16/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.jclinbioinformatics.com/content/3/1/16/mathml/M37">View MathML</a>

(16)

And lim i → Ei = Efinal ≥ 0. Therefore the entire infinite length sequence lies inside a hypersphere (i.e., a closed and bounded set of points) of finite volume centered at X and with a radius of E1. Since this hypersphere has finite volume, it is possible to construct a finite number of smaller hypersphere, each with radius ϵ > 0, such that the union of all these small hyperspheres contains the large hypersphere of radius E1. For all ϵ > 0 there will be at least one hypersphere of radius ϵ containing an infinite number of points of the sequence. Thus, there is at least one cluster point. The cluster point is the limit of a convergent subsequence. Therefore, we know that the sequence of Xi(k), produced by the algorithm must contain at least one convergent subsequence.

Feature extraction methodology

For the purpose of this preliminary study, we acquired de-identified data sets from PRD that demonstrate patient visits in year 2012. The total number of observations includes 92,175 encounters. Among all encounters, we selected encounters that have APR-DRG code = 141 Asthma, 144 Respiratory signs & minor diagnoses, 131 Cystic fibrosis – pulmonary disease, and 132 BPD & chronic respiratory for our initial datasets. The total number of meeting criteria is 8,895 encounters for 7,011 distinct patient records (see Figure 1 and Figure 2). Among all patients, 57.8% (4,052) were male, 11.7% (817) were white, and 81.1% (5,685) were black or African-American. The PRD has the UTHSC Institutional Review Board (IRB) approval for the creation and maintenance of the database. The waiver applies to the medical records of patients who received care in 2009 or later.

thumbnailFigure 1. The PRD search cohort demonstrates the ability to search for possible asthma cases based on APR-DRGs.

thumbnailFigure 2. The PRD visual analytic feature demonstrates the ability to sort and filter by various criteria (tree view sorted by patient’s gender and filtered by patient’s age).

The text parsing software and natural language toolkit [24] (written in Python) were used to parse all encounter data sets for this preliminary study. If X = [xij] defines the m × n term-by-encounter matrix for decomposition. Each element or component xij of the matrix X defines a weighted frequency at which term i occurs in encounter j, where term i∈ {gender, age, discharge status, admitting diagnosis, secondary diagnoses, principal diagnosis, principal procedure, secondary procedures}. The corpus stop words from NLTK were used to filter out unimportant terms.

In evaluating the classification performance, we randomly selected subset 1,200 encounters and divided into a number of four subsets of equal size (i.e., four-fold cross validation). The system is trained and tested for four iterations (see Figure 3).

thumbnailFigure 3. Classification workflow.

In each iteration, three subsets of data are used as training data and the remaining set is used as testing data. In rotation, each subset of data serves as the testing set in exactly one iteration. The rank U used to test the LRMD was k = 4. Hence, the U and V matrix factors were number of terms × 4 and 4 × 1200, respectively. The percentage of possible asthma encounters used for training in our testing was 900 encounters and the remaining 300 encounters were used for testing our classifier. The initial matrix factors U and V were selected to meet our Selection criterion (see Selection Criterion) and alternating iteration was continued until the matrix from two successive iterations are sufficiently close (see Alternating optimization). All classification results were obtained using Python version 2.7.4.

Results

Table 1 demonstrates an example of dominant features for the classifier, when applied to training data sets (randomly selected 900 out of a 1,200 encounters). We note that among all features, admitting diagnosis = 786.07 (wheezing), secondary diagnosis = 786.05 (shortness of breath), age 4–8, and having family history of asthma (ICD-9-CM = v175) would potentially progress toward asthma, i.e., APR-DRG code = 141 asthma. The 2nd Feature shows asthma patients with pneumonia condition (ICD-9-CM = 486.00) during the length of stay in a hospital. The 4th Feature demonstrates the connection between asthma symptoms and another pulmonary condition known as bronchitis symptom (ICD-9-CM = 466.0). When the two conditions co-exist, bronchitis can cause patients with asthma to make their asthma symptoms worse, i.e., an asthma attack.

Table 1. Example of dominant features using LRMD

To evaluate the performance of our classifier for this preliminary study, we plot a receiver operating characteristic (ROC). Figure 4 shows the receiver operating characteristic (ROC) curves (true positive rate versus false positive rate).

thumbnailFigure 4. The receiver operating characteristic (ROC) shows sensitivity and specificity of our classifier.

Our goal is to maximize the number of true positives (correctly diagnosed asthmas) with an acceptable number of false positives (false alarm). From Table 2 and Figure 4, we note that as sensitivity goes up, a specificity goes down. We can begin to see the trade-offs when we insist an higher sensitivity (fewer missed asthmas) the result is lower specificity, and more false positive. In practice, it is much worse to miss a asthma than to endure unnecessary treatment, so we tend to choose a higher sensitivity cut off (e.g., cutoff score > 0.65 or > 0.75). As it is apparent from Table 2, LRMD yields very promising results for disease identification and classification. However, we still have much work to do to enhance the LRMD classifier and it is discussed further below.

Table 2. Sensitivity and specificity

Discussion

The results presented in this paper should not be taken as an accurate representation of our patient data (as it does not include all the data records). These data are meant to demonstrate the potential of PRD and the feasibility of data mining technique using LRMD. Additional experiments with a larger number of features (rank k > 4) and encounter data sets (2009 – 2012) should produce better models to capture the diversity of contexts described by those encounters. Using ICD-9-CM has limitations because they are generally used for billing purposes and not for clinical research. We are planning to access free-text fields in the near future, such as physician and clinician notes, and include them into our classifier. Additional socio-demographic variables such as incomes, type of insurance, environment, nutrition, genome and comorbidity covariants could potentially be added to the model to support the evaluation of potential causes for readmission.

Conclusions

Using data mining technique to learn from past examples within rich data sources such as electronic medical records not only permits users to detect expected events, such as might be predicted by models, but also helps users discover the unexpected patterns and relationships that can then be examined and assessed to develop new insights. We hope that learned rules from the LRMD technique will greatly advance progress toward the goal of identifying high risk of pediatric asthma patient and help support clinical decisions.

Competing interests

The author declares that they have no competing interests.

Acknowledgements

The author thanks the UTHSC department of ITS Computing Systems and Office of Biomedical Informatics for use of informatics resources and collaborations. We gratefully acknowledge Rae Shell and Grady Wade for assistance on proofreading and providing good comments. This work was supported by the Children’s Foundation Research Institute (CFRI).

References

  1. Chandra Shekar DV, Sesha Srinivas V: Clinical Data Mining An Approach for Identification of Refractive Errors. Hong Kong: Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol I IMECS 2008; 2008.

    19-21 March

    OpenURL

  2. Palaniappan S, Ling C: Clinical Decision Support Using OLAP With Data Mining.

    IJCSNS International Journal of Computer Science and Network Security September 2008, 8:9. OpenURL

  3. Prather JC, et al.: Medical data mining: knowledge discovery in a clinical data warehouse.

    Proc AMIA Annu Fall Symp 1997, 101-105. OpenURL

  4. Chae YM, et al.: Data mining approach to policy analysis in a health insurance domain.

    Int J Med Inform 2001, 62(2-3):103-111. PubMed Abstract | Publisher Full Text OpenURL

  5. Hedberg SR: The data gold-rush.

    Byte 1995, 20(10):83-88. OpenURL

  6. Mohri M, Rostamizadeh A, Talwalkar A: Foundations of Machine Learning. New York: The MIT Press; 2012. OpenURL

  7. Huang Z: Extensions to the k-means algorithm for clustering large data sets with categorical values.

    Data Mining and Knowledge Discovery 1998, 283:304. OpenURL

  8. Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ohio: ACM computing surveys; 1999. OpenURL

  9. Neapolitan RE: Learning Bayesian Networks. Illinois: Prentice Hall; 2004. OpenURL

  10. Gelman A: A Bayesian formulation of exploratory data analysis and goodness-of-fit testing.

    International Statistical Review 2003, 71(2):369-382. OpenURL

  11. Tom M: Machine Learning. McGraw-Hill; 1997:55-58. OpenURL

  12. Grzymala-Busse JW: Selected algorithms of machine learning from examples.

    Fundamenta Informaticae 1993, 18:193-207. OpenURL

  13. Liu WX, et al.: Nonnegative matrix factorization and its applications in pattern recognition.

    Chinese Science Bulletin 2006, 51(1):7-18. Publisher Full Text OpenURL

  14. Cemgil AT: Bayesian inference for nonnegative matrix factorisation models.

    Comput Intell Neurosci 2009, 785152. OpenURL

  15. Berry MW, Gillis N, Glineur F: Document Classification Using Nonnegative Matrix Factorization and Underapproximation. IEEE; 2009. OpenURL

  16. Sedman AB, Bahl V, Bunting E, Bandy K, Jones S, Nasr SZ, Schulz K, Campbell DA: Clinical redesign using all patient refined diagnosis related groups.

    Pediatrics 2004, 114(4):965-969. PubMed Abstract | Publisher Full Text OpenURL

  17. Viangteeravat T: Giving Raw Data a Chance to Talk: A demonstration of de-identified Pediatric Research Database and exploratory analysis techniques for possible cohort discovery and identifiable high risk factors for readmission.

    Proceeding of 12TH Annual UT-ORNL-KBRIN Bioinformatics Summit 2013. OpenURL

  18. Srebro N, Jaakkola T: Weighted Low Rank Approximation. Washington DC: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003); 2003. OpenURL

  19. Young E:

    Singular Value Decomposition.

    http://www.schonemann.de//svd.htm webcite

    OpenURL

  20. Cadzow JA: Signal enhancement: a useful signal processing tool Spectrum Estimation and Modeling.

    Fourth Annual ASSP Workshop 1988, 162:167. OpenURL

  21. Cadzow JA: Minimum l(1), l(2), and l(infinity) norm approximate solutions to an overdetermined system of linear equations.

    Digital Signal Processing 2002, 12(4):524-560. Publisher Full Text OpenURL

  22. Viangteeravat T: Discrete Approximation using L1 norm Techniques. Master Thesis: Electrical Engineering, Vanderbilt University; 2000. OpenURL

  23. Cadzow JA: Application of the l1 norm in Signal Processing". Department of Electrical Engineering. Nashville: Vanderbilt University; 1999. PubMed Abstract | Publisher Full Text OpenURL

  24. Perkins J: Python Text Processing with NLTK 2.0 Cookbook. Birmingham: Packt Publishing; 2010. OpenURL