Abstract
Asthma is a prevalent disease in pediatric patients and most of the cases begin at very early years of life in children. Early identification of patients at high risk of developing the disease can alert us to provide them the best treatment to manage asthma symptoms. Often evaluating patients with high risk of developing asthma from huge data sets (e.g., electronic medical record) is challenging and very time consuming, and lack of complex analysis of data or proper clinical logic determination might produce invalid results and irrelevant treatments. In this article, we used data from the Pediatric Research Database (PRD) to develop an asthma prediction model from past All Patient Refined Diagnosis Related Groupings (APRDRGs) coding assignments. The knowledge gleamed in this asthma prediction model, from both routinely use by physicians and experimental findings, will become fused into a knowledgebased database for dissemination to those involved with asthma patients. Success with this model may lead to expansion with other diseases.
Keywords:
Clinical research; Translational research; Medical informatics; Biomedical informatics; Machine learning; Data mining; Feature extraction; ClassificationBackground
Data mining in medical informatics
Because of their predictive power, various healthcare systems are attempting to use available data mining techniques to discover hidden relationships as well as trends in huge data available within the clinical database and convert it to valuable information that can be used by physicians and other clinical decision markers. In general, data mining techniques can learn from what was happened in past examples and model oftentimes nonlinear relationships between independent and dependent variables. The resulting model provides formalized knowledge and prediction of outcome. For example, Shekar et al. used data mining based decision tree algorithm to discover the most common refractive error in both male and female [1]. Palaniappan et al. presented a prototype that combines the strengths of both an online analytical processing (OLAP) and data mining techniques for clinical decision support systems (DSS) [2]. Jonathan et al. used data mining techniques to explore the factors contributing to cost of prenatal care and outcomes [3]. Chae et al. used data mining approach analysis in health insurance domain [4]. With advanced data mining techniques to help evaluate healthcare utilization costs for employees and dependents in organizations [5].
More advanced machine learning methods, such as artificial neural networks and support vector machines, have been adopted to use in various areas of biomedical and bioinformatics, including genomics and proteomics [6]. For biological data, clustering is probably the most widely used data mining technique, such as clustering analysis, hierarchical clustering, kmeans clustering, backpropagation neural networks, selforganization maps, fuzzy clustering, expectation maximization, and support vector machines [7,8]. Bayesian models were widely used to classify data into predefined classes based on a set of features. Given the training examples, a Bayesian model stores the probability of each class, the probability of each feature, and the probability of each feature given each class. When a new unseen example occurred, it can be classified according to these probabilities [9,10]. This classification technique is one of the most widely used in medical data mining. Decision tree models, such as the Iterative Dichotomiser 3 (ID3) Heuristic techniques belong to the subfield of machine learning. The ID3 Heuristic uses a technique called "entropy" to measure disorder in a set of data [11,12]. The idea behind the ID3 Heuristic is to find the best attribute to classify the records in the data set. The outcome is learned rules and a model used to predict unseen examples based on past seen examples. Nonnegative matrix factorization (NMF) has been widely used in the field of text mining applications [13,14]. The only constraint that is unique from other methods is factorization of two matrices W and H from V (i.e., nmf (V) → WH) must be nonnegative or all elements must be equal to or greater than zero. Typically, W and H are initialized with random nonnegative values to start the NMF algorithm. The convergent time is varied and local minimum is not guaranteed [15].
Here, we are working on a methodology and classification technique in data mining called Low Rank Matrix Decomposition (LRMD) to allow computer to learn from what has happened in the past APRDRGs datasets for asthma, able to extract dominant features, and then predict outcomes. The summary of APRDRGs and the mathematics behind LRMD is discussed further below.
All patient refined diagnosis related groups (APRDRGs)
APRDRG is a grouping methodology developed in a joint effort between 3M Health Information Systems (HIS) and National Association of Children’s Hospitals and Related Institutions (NACHRI). APRDRGs are proprietary and have the most comprehensive and complete classification of any severity of illness system for pediatric patients. It was designed to be more appropriate for general population patients than the old Diagnosis Related Group (DRG) [16]. While the DRG was designed and normed on Medicare patients only, the APRDRG was designed and normed on a general population. We use APRDRG based weights normed on a pediatric patient population. There are 316 APRDRGs, such common APRDRG codes include but not limited to 138 Bronchiolitis/RSV pneumonia, 141 Asthma, 160 Major repair of heart anomaly, 225 Appendectomy, 420 Diabetes, 440 Kidney transplant, 662 Sickle cell anemia crisis, and 758 Childhood behavioral disorder. Each group has 4 severity levels of illnesses (SOI) and 4 risk levels of mortality (ROM) while the DRG and Medicare Service – Diseases Related Groups (MSDRG) have only a single severity and risk of mortality per group. For example, there are multiple diagnosis codes for asthma and an encounter might have asthma as principal diagnosis or a secondary diagnosis and if the encounter was primarily for asthma treatment, then the APRDRG code will be 141 and all asthma encounters will be assigned the same APRDRG code. In our internal system we code inpatient encounters to APRDRG as well as DRG. We have available from our PRD back through 2009 [17], including Emergency Room (ER), Ambulatory Surgery (AS), and Observation (OBS) encounters.
Methods
Singular value decomposition
In general, the Singular Value Decomposition method is a method for decomposition of any matrix A∈R^{MxN} where M ≥ N in a product of UV^{T} , where U∈R^{Mxk} and V∈R^{Nxk}[18,19]. Since any rank k matrix can be decomposed in such a way, and any pair of such matrices yields a rank k matrix, the problem becomes as an unconstrained minimization over pairs of matrices (U ,V ) with the minimization objective
Where A^{(k)} is a rank k approximation of matrix A. To find the optimum choices of U,V in l_{2} norm sense [20,21], the partial derivatives of the objective f (U,V) with respect to U,V are
Solving for U yields U = AV(V^{T}V)^{ 1}. By considering an orthogonal solution, then U = Λ is diagonal such that U = AV. Substituting back into , we have
The columns of V are mapped by A^{T}A to multiples of themselves, i.e., they are eigenvectors of A^{T}A. Therefore, the gradient vanishes at an orthogonal (U,V) if and only if the columns of V are eigenvectors of A^{T}A and the column of U are eigenvectors of AA^{T}, scaled by the square root of their eigenvalues [18,19]. More generally, the gradient vanishes at any (U,V) if and only if the columns of U are spanned by eigenvector of AA^{T} and the columns of V are spanned by eigenvector of A^{T}A. In term of the singular value decomposition, the gradient vanishes at (U,V) if and only if there exist matrices such that U = U_{O} SP_{U} and V = V_{O} P_{V}. Thus, using singular eigenvectors that corresponds to the largest singular values can represent the global properties (i.e., feature vectors) of A with satisfying the minimization under l_{2} norm sense [19].
Low rank matrix decomposition
Suppose that it is desired to represent matrix X∈R^{MxN} as a sum of simple rank one matrices so as to capture the nature of the matrix in which matrix X is to be represented by the summation of r, i.e., rank of matrix. In this case, the outer products can be written as:
Where X ∈ R^{M x N}, {u_{1}, u_{2}, …, u_{r}} and {v_{1}, v_{2}, …, v_{r}} vectors each represents linearly independent column vectors with dimensions M and N, respectively. The constituent outer product is rank one in which the MxN matrix whose column (row) vectors are each a linear multiple of vector u_{i}(v_{i}). To be more precise, a necessary condition is that the vector set {u_{1}, u_{2}, …, u_{r}} must form a basis for the column space of matrix X and the vector set should form a basis for the row space of matrix X. It is noted, however, that there will exist an infinite number of distinct selections of these basis vectors for the case r ≥ 2. It then follows that there will be an infinite number of distinct ranks when the decomposition of a matrix has rank r ≥ 2. The ultimate selection to be made is typically based on the application as well as computational considerations. To provide a mathematically based method for selecting the required basis vectors, let us consider the functional relationship
For 1 ≤ k ≤ r and p = 1,2 where the integer k ranges in the interval 1 ≤ k ≤ r.
It can be readily shown that the function (6) represents a convex function of the set {u_{i}} for a fixed set of {v_{i}} and viceversa. For the proof, please refer to [22]. The convexity property is important since it ensures that any local minimum of f_{k} (v) (i.e., u is fixed) and viceversa is also a global minimum. With regard to the above equation, a specific selection of the vector sets {u_{1},u_{2},…,u_{k}}∈ R^{M} and {v_{1},v_{2},…,v_{k}}∈ R^{N} is to be made so as to minimize this functional. The optimal selection will then provide the best rank k approximation of matrix X in the lp norm sense, as designated by
This optimal matrix is said to be the best rank k approximation of matrix X in the lp norm sense. For convenience, we express equation (5) in a normalized form as:
Where and σ_{i}^{o} are positive scalars. The most employed matrix decomposition procedure is the Singular Value Decomposition (SVD). The SVD method provides an effective method for mitigating the deleterious effects of additive noise and is characterized by the function f_{k}({u_{i}},{v_{i}}) in the l_{2} norm sense, that is
The use of the l_{1} norm criterion can be of practical use when analyzing data that contains data outliers. Namely, it would be useful to express this equation (9) as an objective function that optimizes the best rank k approximation of matrix X∈R^{MxN} as measured for the case of the l_{1} norm criterion. That is
In order to attempt to find the optimum solution which minimizes the objective function (10), we introduce a concept, called Alternating Optimization, This optimization concept is explained a detailed below.
Alternating optimization
We can rewrite the equation (10) in term of matrices U and V as f (U, V) = X  UV^{T}_{1} by fixing U, then objective function becomes:
Where and similarly the column of V^{T} are denoted by . It is straightforward to see that f (V) can be rewritten as a sum of independent criteria
where each term may be minimized independently by selecting an appropriate. The solution method for each of these subproblems is given in [23]. Grouping the resulting together to obtain V^{T}, we get a solution for equation (11). On the other hand, by fixing V, the objective function can be expressed as:
And a similar method may be used to solve for U. The iteration process proceeded by finding and then finding (i.e., the alternating optimization) is continued until a stopping criterion is met (i.e., the matrix from two successive iterations are sufficiently close). For example, . However, it must be noted that finding a global minimum is not guaranteed. In the following section, we establish a guideline for selection of the stopping criteria.
Selection criterion
In this section, let us direct our attention to the selection criteria for the initial choice for U, where U∈R^{Mxk}. We note that for the following cases where (i) rank k = 1 approximation and (ii) rank 1 < k ≤ r, then r = rank (X). In order to take the global data into account, a good choice of initial value of U for a rank k = 1 (i.e., U∈R^{Mx1}) approximation may be obtained as follows. First, we compute the l_{1} norm of each column vector in X, and denoted this norm by . Next compute the l_{1}norm of each row vector in X, and denoted this norm by . Now we find the maximum value in . If the maximum corresponds to a column norm, say from column j, then chose that column (i.e., U = X (:,j )) as the initial choice for U. If the maximum corresponds to a row norm, say row I, then we start with the transposed form of the criterion in (11) and we chose that row (i.e., V^{T} = X (i,:)) as the initial choice for V^{T}. We can also extend the previous concept to find the initial choice for U for the rank k = 2. Essentially, we apply the rank one approximation twice in succession. Therefore our objective function can be expressed as:
Where U = [u_{1}u_{2}] and V = [v_{1}v_{2}], u_{1},u_{2},v_{1},v_{2} are vectors. Therefore the initial choice for U (rank k = 2) is U = [u_{1}u_{2}] (i.e., two largest l_{1} column or row norm). In a similar fashion, a selection criterion for the initial for U for rank k (1 < k ≤ r ) can be also obtained. Thus the column space of X (i.e., U) represents a feature vector that is considered as a global property (i.e., the best low rank approximation) of X that minimizes the above objective function under l_{1} norm sense [22].
Convergence subsequence
The error sequence happened in each iteration can be expressed as:
Since the error sequence is bounded below (i.e., E(U,V) ≥0)) we have
And lim _{i → ∞}E_{i} = E_{final} ≥ 0. Therefore the entire infinite length sequence lies inside a hypersphere (i.e., a closed and bounded set of points) of finite volume centered at X and with a radius of E_{1}. Since this hypersphere has finite volume, it is possible to construct a finite number of smaller hypersphere, each with radius ϵ > 0, such that the union of all these small hyperspheres contains the large hypersphere of radius E_{1}. For all ϵ > 0 there will be at least one hypersphere of radius ϵ containing an infinite number of points of the sequence. Thus, there is at least one cluster point. The cluster point is the limit of a convergent subsequence. Therefore, we know that the sequence of X_{i}^{(k)}, produced by the algorithm must contain at least one convergent subsequence.
Feature extraction methodology
For the purpose of this preliminary study, we acquired deidentified data sets from PRD that demonstrate patient visits in year 2012. The total number of observations includes 92,175 encounters. Among all encounters, we selected encounters that have APRDRG code = 141 Asthma, 144 Respiratory signs & minor diagnoses, 131 Cystic fibrosis – pulmonary disease, and 132 BPD & chronic respiratory for our initial datasets. The total number of meeting criteria is 8,895 encounters for 7,011 distinct patient records (see Figure 1 and Figure 2). Among all patients, 57.8% (4,052) were male, 11.7% (817) were white, and 81.1% (5,685) were black or AfricanAmerican. The PRD has the UTHSC Institutional Review Board (IRB) approval for the creation and maintenance of the database. The waiver applies to the medical records of patients who received care in 2009 or later.
Figure 1. The PRD search cohort demonstrates the ability to search for possible asthma cases based on APRDRGs.
Figure 2. The PRD visual analytic feature demonstrates the ability to sort and filter by various criteria (tree view sorted by patient’s gender and filtered by patient’s age).
The text parsing software and natural language toolkit [24] (written in Python) were used to parse all encounter data sets for this preliminary study. If X = [x_{ij}] defines the m × n termbyencounter matrix for decomposition. Each element or component x_{ij} of the matrix X defines a weighted frequency at which term i occurs in encounter j, where term i∈ {gender, age, discharge status, admitting diagnosis, secondary diagnoses, principal diagnosis, principal procedure, secondary procedures}. The corpus stop words from NLTK were used to filter out unimportant terms.
In evaluating the classification performance, we randomly selected subset 1,200 encounters and divided into a number of four subsets of equal size (i.e., fourfold cross validation). The system is trained and tested for four iterations (see Figure 3).
Figure 3. Classification workflow.
In each iteration, three subsets of data are used as training data and the remaining set is used as testing data. In rotation, each subset of data serves as the testing set in exactly one iteration. The rank U used to test the LRMD was k = 4. Hence, the U and V matrix factors were number of terms × 4 and 4 × 1200, respectively. The percentage of possible asthma encounters used for training in our testing was 900 encounters and the remaining 300 encounters were used for testing our classifier. The initial matrix factors U and V were selected to meet our Selection criterion (see Selection Criterion) and alternating iteration was continued until the matrix from two successive iterations are sufficiently close (see Alternating optimization). All classification results were obtained using Python version 2.7.4.
Results
Table 1 demonstrates an example of dominant features for the classifier, when applied to training data sets (randomly selected 900 out of a 1,200 encounters). We note that among all features, admitting diagnosis = 786.07 (wheezing), secondary diagnosis = 786.05 (shortness of breath), age 4–8, and having family history of asthma (ICD9CM = v175) would potentially progress toward asthma, i.e., APRDRG code = 141 asthma. The 2nd Feature shows asthma patients with pneumonia condition (ICD9CM = 486.00) during the length of stay in a hospital. The 4th Feature demonstrates the connection between asthma symptoms and another pulmonary condition known as bronchitis symptom (ICD9CM = 466.0). When the two conditions coexist, bronchitis can cause patients with asthma to make their asthma symptoms worse, i.e., an asthma attack.
Table 1. Example of dominant features using LRMD
To evaluate the performance of our classifier for this preliminary study, we plot a receiver operating characteristic (ROC). Figure 4 shows the receiver operating characteristic (ROC) curves (true positive rate versus false positive rate).
Figure 4. The receiver operating characteristic (ROC) shows sensitivity and specificity of our classifier.
Our goal is to maximize the number of true positives (correctly diagnosed asthmas) with an acceptable number of false positives (false alarm). From Table 2 and Figure 4, we note that as sensitivity goes up, a specificity goes down. We can begin to see the tradeoffs when we insist an higher sensitivity (fewer missed asthmas) the result is lower specificity, and more false positive. In practice, it is much worse to miss a asthma than to endure unnecessary treatment, so we tend to choose a higher sensitivity cut off (e.g., cutoff score > 0.65 or > 0.75). As it is apparent from Table 2, LRMD yields very promising results for disease identification and classification. However, we still have much work to do to enhance the LRMD classifier and it is discussed further below.
Table 2. Sensitivity and specificity
Discussion
The results presented in this paper should not be taken as an accurate representation of our patient data (as it does not include all the data records). These data are meant to demonstrate the potential of PRD and the feasibility of data mining technique using LRMD. Additional experiments with a larger number of features (rank k > 4) and encounter data sets (2009 – 2012) should produce better models to capture the diversity of contexts described by those encounters. Using ICD9CM has limitations because they are generally used for billing purposes and not for clinical research. We are planning to access freetext fields in the near future, such as physician and clinician notes, and include them into our classifier. Additional sociodemographic variables such as incomes, type of insurance, environment, nutrition, genome and comorbidity covariants could potentially be added to the model to support the evaluation of potential causes for readmission.
Conclusions
Using data mining technique to learn from past examples within rich data sources such as electronic medical records not only permits users to detect expected events, such as might be predicted by models, but also helps users discover the unexpected patterns and relationships that can then be examined and assessed to develop new insights. We hope that learned rules from the LRMD technique will greatly advance progress toward the goal of identifying high risk of pediatric asthma patient and help support clinical decisions.
Competing interests
The author declares that they have no competing interests.
Acknowledgements
The author thanks the UTHSC department of ITS Computing Systems and Office of Biomedical Informatics for use of informatics resources and collaborations. We gratefully acknowledge Rae Shell and Grady Wade for assistance on proofreading and providing good comments. This work was supported by the Children’s Foundation Research Institute (CFRI).
References

Chandra Shekar DV, Sesha Srinivas V: Clinical Data Mining An Approach for Identification of Refractive Errors. Hong Kong: Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol I IMECS 2008; 2008.
1921 March

Palaniappan S, Ling C: Clinical Decision Support Using OLAP With Data Mining.
IJCSNS International Journal of Computer Science and Network Security September 2008, 8:9.

Prather JC, et al.: Medical data mining: knowledge discovery in a clinical data warehouse.

Chae YM, et al.: Data mining approach to policy analysis in a health insurance domain.
Int J Med Inform 2001, 62(23):103111. PubMed Abstract  Publisher Full Text

Mohri M, Rostamizadeh A, Talwalkar A: Foundations of Machine Learning. New York: The MIT Press; 2012.

Huang Z: Extensions to the kmeans algorithm for clustering large data sets with categorical values.

Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ohio: ACM computing surveys; 1999.

Neapolitan RE: Learning Bayesian Networks. Illinois: Prentice Hall; 2004.

Gelman A: A Bayesian formulation of exploratory data analysis and goodnessoffit testing.

GrzymalaBusse JW: Selected algorithms of machine learning from examples.

Liu WX, et al.: Nonnegative matrix factorization and its applications in pattern recognition.
Chinese Science Bulletin 2006, 51(1):718. Publisher Full Text

Cemgil AT: Bayesian inference for nonnegative matrix factorisation models.

Berry MW, Gillis N, Glineur F: Document Classification Using Nonnegative Matrix Factorization and Underapproximation. IEEE; 2009.

Sedman AB, Bahl V, Bunting E, Bandy K, Jones S, Nasr SZ, Schulz K, Campbell DA: Clinical redesign using all patient refined diagnosis related groups.
Pediatrics 2004, 114(4):965969. PubMed Abstract  Publisher Full Text

Viangteeravat T: Giving Raw Data a Chance to Talk: A demonstration of deidentified Pediatric Research Database and exploratory analysis techniques for possible cohort discovery and identifiable high risk factors for readmission.
Proceeding of 12TH Annual UTORNLKBRIN Bioinformatics Summit 2013.

Srebro N, Jaakkola T: Weighted Low Rank Approximation. Washington DC: Proceedings of the Twentieth International Conference on Machine Learning (ICML2003); 2003.

Singular Value Decomposition.
http://www.schonemann.de//svd.htm webcite

Cadzow JA: Signal enhancement: a useful signal processing tool Spectrum Estimation and Modeling.

Cadzow JA: Minimum l(1), l(2), and l(infinity) norm approximate solutions to an overdetermined system of linear equations.
Digital Signal Processing 2002, 12(4):524560. Publisher Full Text

Viangteeravat T: Discrete Approximation using L1 norm Techniques. Master Thesis: Electrical Engineering, Vanderbilt University; 2000.

Cadzow JA: Application of the l1 norm in Signal Processing". Department of Electrical Engineering. Nashville: Vanderbilt University; 1999. PubMed Abstract  Publisher Full Text

Perkins J: Python Text Processing with NLTK 2.0 Cookbook. Birmingham: Packt Publishing; 2010.