Selection and organization of study groups
A total of 175 volunteers in the age range of 18–69 years were recruited for this study. The volunteers were divided into 2 groups—(1) Healthy volunteers (FBGL: 80–120 mg/dl; 41 female; 46 male; age range 18–62 years; mean age 35 ± 11 years), (2) Clinically diagnosed type II Diabetes Mellitus patients (FBGL ≥ 120 mg/dl; 47 female; 41 male; age range 21–69 years; mean age 47 ± 10 years). The following subjects were excluded from this study: (1) individuals with any salivary pathological condition such as salivary calculi, viral parotitis, (2) pregnant women, (3) people with gum bleeding, gingivitis or oral disorders such as oral cancer, (4) individuals with any other systemic sickness other than diabetes or severe diabetic complications, and (5) subjects on drugs like anticholinergic, sympathomimetic, skeletal muscle relaxant, antimigraine, cytotoxic, retinoids, anti HIV and cytokines which are known to affect the salivary flow rate and its composition. The inclusion criteria for a person suffering with diabetes mellitus was based on the recommendations of the Expert Committee on Diagnosis and Classification of Diabetes Mellitus (Kahn 2003). This included features of polydypsia, polyphagia, polyuria and elevated BGLs.
Sample collection and analysis protocol
The participants were instructed to come in a fasting mode between 8:00 and 10:00 A.M. without brushing their teeth. They were then asked to swallow their existing saliva and made to sit on a comfortable chair in an isolated room keeping all ambient conditions the same so as to maintain their circadian rhythm. Every individual was asked to spit approximately 2 mL of saliva in a pre-autoclaved collecting vial. These saliva samples were then immediately analyzed for various electrochemical parameters before they could degrade proteolytically. The pH and oxidation reduction potential (ORP) values were measured using the F-71 Laqua Lab (Japan) pH/ORP meter. The conductivity and concentration of the electrolytes (mainly Na+, K+, and Ca++) were recorded using the Horiba Laqua twin series ion selective models (Malik et al. 2015). For comparison with the current gold standard, the FBGL of all the volunteers was also measured in the venous plasma and analyzed by an automatic biochemical analyzer (Cobas integra 400 plus).
Data preprocessing
The electrochemical data obtained from the saliva samples were used to train machine learning algorithms such as logistic regression, SVM and ANN in order to be able to predict the results for unknown samples in future. Machine learning recognizes patterns and mining trends in large data sets and is now routinely used in pharmaceutical industry to meet their targets. In our study, the mathematical models were coded in MATLAB R2014a (version 8.3). Prior to data fitting, an essential feature scaling operation was performed on all the different parameters, namely pH, ORP, conductivity, electrolyte concentration and volunteer’s age, to obtain normalized data in the range of −1 to 1. This was done to avoid any bias generated by the differences in the parameter measuring units. The relationship used for feature normalization is shown in Eq. 1,
$$x_{i}^{{\prime }} = \frac{{x_{i} - \mu }}{\sigma }$$
(1)
where, \(x_{i}\) is the input feature variable (pH, ORP, age etc.), \(x_{i}^{{\prime }}\) is the normalized feature variable, and \(\mu\) and \(\sigma\) are the mean and standard deviations from all the data obtained for that feature. The FBGL values measured in the venous plasma were classified as 1 (high FBGL) if ≥120 mg/dl else 0, and fitted against the normalized training set data to determine the coefficients of the fitted variables related by the general equation (Eq. 2),
$$Y = f\_\theta \left( x \right).$$
(2)
Here, Y is the predicted output FBGL value of either 0 or 1, x represents either linear or non-linear combination of input variables and θ is the coefficient value corresponding to x.
Once the entire data from 175 volunteers were normalized, they were cross-validated three times by dividing into three equal randomly generated data sets. At a time, two random data sets were used for training and the third one was used for testing. Since the process was cross-validated three times, it generated three different combinations of training and testing set in one complete cycle. The motivation to do this was to create a shuffled training and testing data set with no biasing. The training set was then used to train the algorithm which in turn provided a model for FBGL prediction of 0 or 1. The testing set was used to evaluate the utility of the trained model by computing the average values of the reported data and the classifier performance index (CPI) parameters (discussed later) after twenty iterations of the algorithm. The entire process cycle was iterated 20 times by randomly selecting different combinations of cross-validated training and testing data sets to further enhance the fitting accuracy and give much more stable results. The final outcome was reported as an average result of the above discussed process.
Logistic regression method
A linear logistic regression model was developed to detect high FBGL from age and salivary electrochemical parameters. The logistic regression model generates output in terms of probabilities and we chose 0.5 as the threshold equivalent to 120 mg/dl of BGL (Malik et al. 2015). Predicted output value (POV) depends on the input variables \(x_{i }\) and their coefficients \(\theta_{i}\) as shown in Eq. 3 below,
$$POV = \frac{1}{{1 + e^{{ - \left( {\theta_{0} + \sum_{1}^{n} \left( {x_{i} \theta_{i} } \right)} \right)}} }}$$
(3)
The values of \(\theta_{i}\) were initialized to zero to keep the initial condition unbiased since the data was normalized and separated around zero. Then the gradient descent algorithm was applied to the training data set to calculate the values of the coefficients using the mean square error (MSE) method (Additional file 1).
Artificial neural network (ANN)
ANN is another machine learning tool that can be used for fitting non-linear functions with higher precision and accuracy to analyze associated complex patterns (Chen and Billings 1992). We used a feed-forward ANN with back propagation gradient descent algorithm to classify the diabetic patients from normal ones using their salivary data. The ANN classifier architecture consisted of an input layer with 7 neurons (one for each parameter), 33 hidden layer neurons and two nodes in output layer with one neuron each (Additional file 1: Fig. S1). The 33 hidden layer neurons architecture was chosen as it gave us maximum accuracy with minimum deviations (see Additional file 1: Fig. S2). The ANN was trained by reducing the MSE of the training dataset (Additional file 1: Fig. S3). Once the MSE was minimized, the values of the constants obtained were stored internally to validate the model using half of the remaining data. The results of validation created a platform for testing the model by the other half of the remaining data (Additional file 1: Fig. S3)
Support vector machine (SVM)
SVM is another powerful tool now routinely used in clinical applications (Cortes and Vapnik 1995; Maglogiannis et al. 2009). In our study, it was used to map the salivary data from a lower to multidimensional feature space such that the high and low FBGL could be separated with maximum margin by a hyperplane using various non-linear kernels as shown in Eq. 4 (Cristianini and Shawe-Taylor 2000). Here, \(x_{i}\) is the normalized feature vector and \(x_{j}\) is the support vector.
$$k\left( {x_{i} ;x_{j} } \right) = f\left( {x_{i} } \right)^{T} f\left( {x_{j} } \right)$$
(4)
$$k_{linear} \left( {x_{i} ;x_{j} } \right) = x_{i}^{T} x_{j}$$
(5)
$$k_{Gaussian} \left( {x_{i} ;x_{j} } \right) = exp^{{ - \gamma \parallel x_{i} - x_{j} \parallel^{2} }} .$$
(6)
The SVM classifier was implemented using the LibSVM software package in MATLAB (Chang and Lin 2011) using the linear and Gaussian (radial basis function; RBF) kernel functions represented in Eqs. 5 and 6, respectively (Thurston et al. 2009). To develop an optimal SVM model, two key parameters, C and γ, were preselected for the kernels. C is commonly known as the penalty parameter which controls over-fitting of the model. In case of RBF, the classification is generally better due to a higher value of C which makes the SVM classify more correctly. Parameter γ controls the degree of non-linearity of the model. C is commonly used in implementing linear as well as RBF, whereas γ is used specifically for the RBF kernel (see Additional file 1: Fig. S5).
Classifier performance index (CPI)
The model performances were determined using the confusion matrix (also known as error or contingency matrix in machine learning) and the receiver operating characteristic (ROC) curve (Qin 2005) (Fig. 2a). True positives (TP) were defined as the cases where both the actual and predicted values of the FBGL lied in the ≥120 mg/dl range. Similarly, true negatives (TN) were cases where both the actual and predicted values had FBGL <120 mg/dl. False positives (FP) represented cases where the actual state of disease was false but the model predicted them to be true, and vice versa for false negatives (FN). The data in the confusion matrix were used to estimate a set of statistically-relevant performance indicators defined below,
$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN},$$
(7)
$$Presicion = \frac{TP}{TP + FP},$$
(8)
$$Recall = \frac{TP}{TP + FN},$$
(9)
$$F_{1} \;score = \frac{2TP}{2TP + FP + FN}.$$
(10)
Accuracy provides the total count of correctly predicted high FBGL cases by the model in a total participating population (Eq. 7). The precision gives the fraction of actual to detected diseased cases (Eq. 8). Similarly, recall or sensitivity gives an estimate of the truly detected diseased cases among the actual ones (Eq. 9). F1 score is another important parameter defined as the harmonic mean of recall and precision (Eq. 10). It is used when we want to contrast the performances of different prediction parameters with a single evaluation matrix to judge the classifier efficiency. For definitions of other CPIs such as specificity and negative predictive value, see Additional file 1.
In the ROC curve, a well-accepted graphical tool for performance illustration of a binary classifier (Slaby 2007), the True Positive Rate (TPR) is plotted on the y-axis against the False Positive Rate (FPR) on the x-axis TPR is mathematically the same as recall (Eq. 11), whereas, FPR (Eq. 12) signifies how many wrong positive results occur among all the negative samples available during the test. To obtain a reasonable performance of the binary classifier, the ratio of TPR to FPR should be high.
$$True\;positive\;rate = \frac{TP}{TP + FN}$$
(11)
$$False\;positive\;rate = \frac{FP}{FP + TN}$$
(12)