Identifying influential metrics in the combined metrics approach of fault prediction

Fault prediction is a pre-eminent area of empirical software engineering which has witnessed a huge surge over the last couple of decades. In the development of a fault prediction model, combination of metrics results in better explanatory power of the model. Since the metrics used in combination are often correlated, and do not have an additive effect, the impact of a metric on another i.e. interaction should be taken into account. The effect of interaction in developing regression based fault prediction models is uncommon in software engineering; however two terms and three term interactions are analyzed in detail in social and behavioral sciences. Beyond three terms interactions are scarce, because interaction effects at such a high level are difficult to interpret. From our earlier findings (Softw Qual Prof 15(3):15-23) we statistically establish the pertinence of considering the interaction between metrics resulting in a considerable improvement in the explanatory power of the corresponding predictive model. However, in the aforesaid approach, the number of variables involved in fault prediction also shows a simultaneous increment with interaction. Furthermore, the interacting variables do not contribute equally to the prediction capability of the model. This study contributes towards the development of an efficient predictive model involving interaction among predictive variables with a reduced set of influential terms, obtained by applying stepwise regression.


Background
Fault prediction models based on different modelling techniques have been widely used to improve software quality for the last three decades. Out of the many modelling techniques used by researchers, regression and its variants are still drawing a major portion of the attention of research communities (Basili et al. 1996;Denaro et al. 2003;Yu 2012;Bibi et al. 2008;Thwin and Quah 2005;Briand et al. 2000;Khoshgoftaar et al. 2002;Gyimothy et al. 2005). Comparison of regression with other evolutionary algorithm based techniques has also been appraised as well (Raj Kiran and Ravi 2008;Radjenovic et al. 2013).
The application of regression analysis focuses on identifying potential complexity metrics and building relationship models that are capable of identifying faults-prone software modules.
No single set of metrics exists which can be applied to all projects equally. Therefore by taking failure [scenarios] and their correlation into account within a project, the capability to design an improved prediction model can be achieved by combining metrics (Nagappan et al. 2006).
In the recent literature, the benefits and comparative advantages of using a combination of source code metrics to predict bugs, has been illustrated by (D' Ambros et al. 2012) and (Okutan and Yildiz 2012). However, combining metrics may lead to interactions among metrics which has not yet been properly dealt within software engineering literature, though it has been reported in other areas of the sciences and engineering.
This issue has been highlighted in our previous study ) in which we developed eight different models by considering two types of metrics i.e. Chidamber and Kemerer (CK) and other object oriented (OO) metrics (Chidamber and Kemerer 1994). These models describe different possibilities of two-term interaction in which the first four models take combinations of CK and OO into consideration. The four remaining models consider CK, OO and their combination separately with or without quadratic terms.
Through our earlier findings, we statistically established that the full-interaction model in which, linear two-term interaction with self-interacting terms outperforms other models.
Though the models developed in the previous study were statistically effective, the large number of predictive variables arising from interaction may lead to the over-fitting of data, thereby giving rise to prediction errors.
In this study our goal is to select the most influential metrics, derived through the interaction, since all candidate complexity metrics may not have equally resolute predictive powers. In order to reduce the dimensionality of data a feature selection technique needs to be utilized. For the purpose of this paper, we have used stepwise regression.
Through applying stepwise regression a subset of predictors that optimally models the measured responses has been computed, which yields the most influential combination of predictive variables.

Data and mathematical methods used
The following methodology has been implemented in order to select those suitable variables, from amongst the chosen predicting variable set, taken into account in this study.

Selection and structure of the dataset
For the purpose of validating the method and mechanism proposed in this paper we have taken a publicly available bug prediction dataset (D' Ambros et al. 2010) available at (http://bug.inf.usi.ch). Amongst other statistical data available in this dataset, we have taken into consideration 6 CK (Chidamber and Kemerer) metrics and 11 OO (Object Oriented) metrics, for five software systems i.e. Eclipse, Mylyn, Equinox, PDE and Lucene. Within the purview of this paper, however, we use single version approaches of bug prediction, assuming that the current design and behaviour of the program influences the presence of future defects, and thereby does not require the history of the system (D' Ambros et al. 2010). Table 1, below describes the metrics of the dataset used in this study.

Multiple linear regression (MLR)
Models the relationship between two or more independent variables (x 1 , x 2 , …, x k ) with the dependent variable (y) (Eq. 1), and can be expressed as Data = Fit + Residual (Pedhazur 1997;Cohen et al. 2003).
where α 0 = intercept term, α 1 , α 2 : coefficients for the independent variables and ϵ is a random error component. Coupling between bbject classes (CBO) Investigates the coupling between classes by taking the dependency of one class with other classes into consideration.
Depth of the inheritance tree (DIT) Investigates the complexity of inheritance hierarchy by counting ancestor levels in the inheritance tree.

Lack of cohesion metric (LCOM)
Investigates cohesion with a class by measuring the dissimilarity of methods.
Response for the classes (RFC) Investigates the coupling between classes by calculating the sum of the number of local methods and the methods that can be called remotely.
Weighted methods per class (WMC) Investigates the complexity of class by summing up the complexity of methods.

Number of children (NOC)
Investigates complexity of inheritance hierarchy by counting the number of immediate subclasses of a class.

MLR with interaction
In MLR, Y is a linear function of all k input variables. However to bring an additional level of regression (Eq. 2), the interaction between variables ought to be considered. This in turn provides a synergistic effect of combined predictors. Like with two interacting variables x 1 and x 2 the model would be as follows: where α 0 = intercept term, α 1 , α 2 : coefficients for the independent variables and α 12 : coefficient for the interaction term; x 1 , x 2 : values taken by the independent variables.

Formation of a set of linearly interacting terms
Step 1: Consider n variables i.e. x 1 to x n .
Step 2: For a variable x 1 , consider pairwise interaction with remaining n-1 variables.
Step 3: Repeat step 2 for all other remaining variables as well.
The systematic execution of steps 1-3 result in [n (n-1)/ 2] + n number of variables arranged as a triangular matrix with the diagonal values as zero, since the self-interaction between variables resulting in quadratic terms are not being considered here. For example for 17 variables, the set would comprise of [(17 *16) /2] +17 = 153 linearly interacting terms. The total number of terms, including linear interaction of different kinds of metrics considered (i.e. CK, OO and their combinations) is as follows:

Triangular Matrix representing interacting terms
(i) For CK metric analysis: 21 (ii) For OO metric analysis: 66 (iii)For CK + OO analysis: 153

Experimental design and statistical measures used
In our experiment, we do cross validation with 50 fold 90%-10% splits of the training and validation sets, which further validates the values of statistical measures reported by D' Ambros et al. (2010) for CK and OO metrics in isolation. These have been implemented and simulated in the Matlab 7.9.0 (R2009b) environment. Table 2, below highlights the empirical aspect of the dataset provided for a single version CK-OO metrics.
To compare the performance of the models developed, we present R 2 , Adjusted R 2 values as statistical measures. The R 2 measures the percentage of explained variation in the dependent variable of a predictive model by taking every independent variable into consideration. Its value lies in between 0 and 1, with a value closer to one indicating the strong predictive capability of the model developed. However, value of R 2 can be increased by including more independent variables which may not be having sufficient explanatory power. Thus, the value of R 2 needs to be adjusted for the degree of freedom. The adjusted R 2 is a preferred statistical measure to ascertain the fitness of the model; it quantifies the percentage of variance explained by only those independent variables which actually touch on the dependent variable (Runkler 2012). A value of Adj. R 2 approaching to 1 indicates better performance of predictive models.
R 2 and Adj. R 2 can be computed as follows (Refer Eq. 3 & 4): Adj: where SSE = Sum of squared error of the dependent variable SST = Sum of squared derivation of the dependent variable n = Sample size p = Number of predictors (independent variables) Step-wise regression (SWR) In regression analysis with a long list of independent variables, some of which may not be useful predictors, the purpose is to find the best subset of independent variables. Trying out all subsets would result in too large a number of possibilities. For example, in our experiment the number of possibilities would be 2 153 -1, which is too a large number to compute within the scope of this model, thereby making the problem computationally intractable. The stepwise model-building technique (Draper and Smith 1981) could be one potential solution to this problem. Within this technique the predictor variables are included one at a time, depending upon whether the included variable increases the adjusted R 2 or not. Initially, the R 2 value of each variable is considered independently, following which stepwise regression is implemented, starting with that variable that has the highest value of R 2 and moving on to the next variable with next highest R 2 value. This process continues until the adjusted R 2 starts decreasing. The adjusted R 2 is used as a "stepping" criterion here.

Results and discussion
The repercussion of considering interaction amongst metrics in the development of a predictive model To appropriately highlight the importance of interaction, the statistics generated from all five modules of the dataset considered, along with number of corresponding variables are shown in Table 3. CK (WOI) refers to CK metrics without interaction and CK (WI) considers CK metrics with interaction. We have used similar terminology with the other metrics considered as well. The data in Table 3 adequately reflect that after considering the interaction with CK metrics, there is a significant improvement in the adjusted R 2 value for all software modules, while correspondingly, also resulting in an undesired increase in the number of variables i.e. from 6 to 21. Similarly for OO metrics, an improvement in adjusted R 2 results in an undesired increase in the number of variables from 11 to 66. Taking a combination of CK and OO metrics returns an even greater value of adjusted R 2 across all five software modules, but this improved predictive power is achieved at the cost of the variables increasing from 17 to 153.
In Table 3, Mylyn exhibits lower values of Adj. R 2 when compared to other software modules for CK and OO both (with and without interaction). This may be due to the fact that the procedural code complexity of the methods of a class has not been taken into account and this study focuses only on object oriented metrics.

Obtaining a reduced set of influential terms
In order to find the best subset of interacting variables, which provides an enhanced explanatory and predictive power, stepwise regression (SWR) was performed up to 10% of the threshold of the improved Adj. R 2 . The following Tables 4, 5 and 6 (for five software modules) show a reduced number of interacting metrics. Initially, SWR was performed for the combination of CK and OO metrics up to a threshold level of 10% of corresponding adjusted R 2 value of each software module. For Mylyn 36 metrics are sufficient to be considered out of 153 total possibilities. Similarly, for other software modules also, we observe a significant reduction in the number of metrics to be considered relevant, as is evident from Table 4. SWR was then conducted for CK and OO metrics in isolation. The total number of possibilities is 21 in the case of CK and 66 for OO. Again, we can observe a significant reduction in the number of relevant interacting metrics as is evident from Tables 5 and 6.

Superset of interacting terms for all software modules
The superset of a reduced number of metrics is obtained with the intent to construct a cross-project and robust fault prediction model, which adequately acts upon all five different software modules (Peters et al. 2013). It is obtained by computing the union of the set of the reduced number of metrics, for all five software modules. The superset of interacting metrics for CK, OO and their combination is depicted in Table 7.
The influential metrics thus identified have increased information content as fault predictors, encompassing different aspects of the measurement of software characteristics. Brief description of metrics discussed in results is given in Table 1.
Referring to Table 7, while first considering CK metrics with interaction [CK (WI)]; coupling between objects (CBO), lack of cohesion of methods (LCOM), and response for class (RFC) metrics are influential in isolation,   hence appearing individually and affirming the results reported by (Gyimothy et al. 2005). In furtherance of this, other influential metrics derived in CK (WI) as shown in Table 7 are appearing as interacting terms. The individual characteristics of LCOM measure the level of relatedness among the methods of a class, and those of CBO measure the dependence of this class to other classes. The interrelatedness of these individual metrics, i.e. CBO and LCOM, can be justified by the fact that they both share class attributes, member functions and the use of the attributes by these methods, consequently appearing as CBO + LCOM.
Weighted method per class (WMC) is the weighted sum of the complexity of the methods and both CBO and RFC are based on the invocation of a method from another class, thereby making them related to one another (RFC + WMC, CBO + RFC). Further, the interdependence of CBO and depth of inheritance tree (DIT) can be explained on account of the fact that the coupling between classes, arising from inheritance, will be higher for the classes which have high values of DIT (Subramanyam and Krishnan 2003). CK metrics, in general, refer to the different aspects of a class design; that is identification, semantics and relationship with other classes and are often interrelated (Chidamber et al. 1998).
Regarding the predictive capability of inheritance metrics i.e. DIT and number of children (NOC), contradictory results have been reported in literature (Okutan and Yildiz 2012;Yu 2012;Basili et al. 1996;Gyimothy et al. 2005;Subramanyam and Krishnan 2003). Nevertheless, our results indicate that their combination (interaction) with other metrics like WMC, LCOM and RFC becomes a determining factor in the accuracy of the fault prediction model.
Other OO metrics used in this paper have the additional advantage of simplicity in the measurement of software characteristics; that is complexity, reusability, encapsulation and modularity. As is evident from Table 3, these metrics exhibit predictive power equivalent to CK metrics, if not better.
Similar to the argument presented for CK (WI), within OO (WI) in Table 7 dominant OO metrics in isolation are number of attributes (NOA), FanIn, FanOut, number of attributes inherited (NOAI), NLOC, number of methods (NOM), number of private methods (NOPRM) and number of public attributes (NOPA). Whereas Fan-Out, FanIn, NLOC, number of private attributes (NOPRA) and NOPRM metrics are more frequently used in interacting terms.
The majority (42 out of 83) of influential metrics considered under CK + OO (WI) is derived from the combination (with interaction) of CK and OO metrics. Subsequently, it has been observed that the metric FanOut appears in combination with all CK metrics, which further validates the applicability of inter-class metrics when used in combination. Out of 30 interacting OO metrics from within CK + OO (WI), metrics like NOPRA, NOPA, NOPRM, number of public methods (NOPM) and number of methods inherited (NOMI) frequently appear in combination. These primitive OO metrics quantify the basic building blocks of a typical object oriented software module and contribute  significantly to the development of a fault prediction model. The number of metrics within the superset of all interacting terms indicates a significant reduction in the total number of metrics to be considered in the design of a predictive model, which also maintains an adequate level of accuracy for all five software modules. Table 8 shows the statistics generated by only including those variables found in the superset for CK metrics, OO metrics and their combination. The value of statistical measure i.e. Adj. R 2 is significantly consistent and acceptable (almost 90%) in comparison to the values obtained through total possible interacting terms for CK , OO and their combined metrics respectively. This elaborates and establishes the significance of the reduction in number of interacting metrics.

Threats to validity
Certain issues that could have an effect on the results of the study and may have subsequently limited our interpretations were identified; The scope of this paper is restricted to two-term interaction effects in the context of linear regression. Nonlinear regression has other well developed heuristic based approaches of feature selection, which are beyond the scope of this paper.
In SWR a unique optimal subset of variables is presumed, however the presence of multiple optimal solutions cannot be denied. Thus, the process presented herein may be augmented by an additional step to identify the "best" of all the possible subsets, obtained after the slaying of a cycle of SWR.
Five different Java based software modules, each with a reasonable number of records, were considered in this study. In order to further support the derived results, software modules implemented in other programming languages may also be considered.