Improving the prediction of going concern of Taiwanese listed companies using a hybrid of LASSO with data mining techniques

The purpose of this study is to establish rigorous and reliable going concern doubt (GCD) prediction models. This study first uses the least absolute shrinkage and selection operator (LASSO) to select variables and then applies data mining techniques to establish prediction models, such as neural network (NN), classification and regression tree (CART), and support vector machine (SVM). The samples of this study include 48 GCD listed companies and 124 NGCD (non-GCD) listed companies from 2002 to 2013 in the TEJ database. We conduct fivefold cross validation in order to identify the prediction accuracy. According to the empirical results, the prediction accuracy of the LASSO–NN model is 88.96 % (Type I error rate is 12.22 %; Type II error rate is 7.50 %), the prediction accuracy of the LASSO–CART model is 88.75 % (Type I error rate is 13.61 %; Type II error rate is 14.17 %), and the prediction accuracy of the LASSO–SVM model is 89.79 % (Type I error rate is 10.00 %; Type II error rate is 15.83 %).

in light of the numerical value of some given classification data in order to acquire the relevant classification rule for every classification, bringing unknown classification data into the rule in order to acquire the final classification result. Many going concern prediction (GCP) studies have applied neural network (NN) to build classification models and to acquire results for going concern (GC) issues (Chen and Church 1992;Cornier et al. 1995;Mutchler et al. 1997;Foster et al. 1998;Carcello and Neal 2000;Gaganis et al. 2007;Chen and Lee 2015).
In terms of statistical tools used to handle mega data analysis, machine learning has risen sharply in recent years. It identifies unknown information from complex data and aims to recognize data in order to draw an inference from the structured model, which can act as a reference amount when making decisions for different purposes that are often related to GC issues (Lenard et al. 1995;Anandarajan and Anandarajan 1999;Brabazon and Keenan 2004;Gaganis et al. 2007, Martens et al. 2008Kirkos et al. 2007a, b;Mokhatab et al. 2011;Salehi and Fard 2013;Yeh et al. 2014;Chen and Lee 2015). The classification method is used most often in these studies, and its results are able to serve as the basis for both decisions and forecasts. However, whether any of the machine learning algorithms in GCP studies is more suitable to this task than another method remains disputed.
Aside from accuracy of the prediction models, the occurrence of Type I error and Type II error cannot be ignored (O'Leary 1998;Kirkos et al. 2007a, b;Tasi and Huang 2010;. A Type II error may especially cause damages and high costs. If an auditor issues a wrong audit report due to his/her misjudgment, then it affects not only the enterprise and stakeholders, but also many investors. Moreover, the CPA may be sued. The costs for Type II errors are rather severe in the U.S. Examples include the Enron scandal in 2001 (Benston and Hartgraves 2002) and WorldCom fraud in 2003. Taiwan has had its own financial fraud cases for Procomp Informatics andInfodisc in 2004 andSummit Computer in 2006. The purpose of this study is to develop a satisfactory model for forecasting the GCD of firms and to forecast an omen for such GCD and to reduce damage to both investors and auditors. This study applies support vector machine (SVM), as well as the classification and regression trees (CARTs) in the machine learning method, as its basis and matches LASSO in order to separately establish a classification model and draw up a comparison.

Literature review
Going concern concept and reports Before investors invest in a company, they should understand the viability of the company. This kind of viability relates to the ability of management to properly manage the company's overall resources in order to survive. In uncertain situations, investors expect auditors to provide early warnings of business failure and risks of bankruptcy (Chen and Church 1996).
Pursuant to the provision of SAS No. 59, an auditor's consideration of an entity's ability to continue as GC requires an explicit evaluation of the auditee's continued viability during the audit process. As a result, the GCD report is used as a warning sign when an auditor suspects an auditee's weakness in terms of GCD (Lenard et al. 1995).

Criteria for issuing an audit report by CPA for going concern
Taiwan's auditing standards bulletin No. 16 stipulates that the compilation of financial statements is often based on an assumption of going concern. It further requires that auditors shall comply with the stipulations as specified in the bulletin when they evaluate reasonable assumptions of going concern. CPAs are able to issue unqualified opinion audit reports if they eliminate their doubt about the ability of going concern after evaluating the rationality of the assumption of going concern. If CPAs consider the auditee's future measures are reasonable and necessary to be disclosed in the financial report, then a qualified opinion audit report or an adverse opinion audit report is needed. If the CPA cannot eliminate doubts about the auditee's ability of going concern, but the auditee's financial statements have been disclosed, then the CPA shall issue an unqualified-modified opinion audit report. If the auditee's financial statements have not been properly disclosed, then the CPA shall issue a qualified opinion audit report or an adverse opinion audit report depending on the significance. If a CPA has confirmed that the assumption of going concern for the compilation of financial statements is not consistent with the actual situation and would have serious consequences, then the CPA shall issue an adverse audit opinion report. If the CPA cannot eliminate doubt, or the assumption is not consistent with the actual situation, then explanatory notes should be included in the audit report, and these notes should form the audit report (Auditing Standards Board of the Republic of China Accounting Research Development Foundation, Auditing standard bulletin and auditing practice, 2013).

Traditional classification studies
The GCP model carries out a computation that mainly depends on the numerical values of train subset data of financial and non-financial indicators in order to acquire the relevant classification rule for every classification and brings data subsets into the rule in order to acquire the final classification result.
Based on the difficulty of the GCD assessment, many authors apply LR in order to make a GCP classification in relation to the GC issue (Chen and Church 1992;Cornier et al. 1995;Mutchler et al. 1997;Foster et al. 1998;Carcello and Neal 2000;Gaganis et al. 2007). However, the traditional classification method suffers from the limitation of having to be in accordance with specific assumptions in the data.

Machine learning classification methods
The machine learning approach has often been adopted in the literature. Many studies have attempted to apply the machine learning approach as a base to build a classification model. These studies point out that adopting this method leads to outstanding prediction accuracy. Several studies applying a machine learning approach (e.g. SVM, DT, NN, etc.) to GCD, indicating that these approaches are able to forecast the GC status of businesses and provide useful financial data for the GC issue (Brabazon and Keenan 2004;Koh and Low 2004;Martens et al. 2008;Mokhatab et al. 2011;Salehi and Fard 2013;Yeh et al. 2014).
On a similar classification issue, Tasi and Wu (2008) apply NN in relation to bankruptcy predictions and credit scores. Chen et al. (2014) employ DT, SVM, and LR in the Fraudulent Financial Statements forecast in order to acquire excellent classification results. Based on these studies, this study utilizes the aforementioned LR, SVM, NN, and DT approaches as the basis upon which to build a classification model.

Methods
The purpose of this study is to establish a two-stage going concern doubt prediction model that integrates financial and non-financial indicators. The process of this study creates a least absolute shrinkage and selection operator (LASSO) to obtain the results for important indicators of GCD after screening. For forecast modeling, the classification approach includes the following machine learning techniques: NN, DT, and SVM. Finally, this study draws a comparison and conducts an analysis in order to obtain better GC prediction results.

Least absolute shrinkage and selection operator (LASSO)
Stepwise regression has been applied in related work in the past, but there are significant problems with stepwise methods, which have been admirably summarized by Harrell (2001). These problems are as follows: (1)  This study applies LASSO as a feature selection method, which was first proposed by Tibshirani (1996). This algorithm minimizes the residual sum of squares subject to the sum of the absolute values of the coefficient being less than a constant.
If t > p j=1 β 0 j , then the LASSO algorithm yields the same estimate as the OLS estimate.
However, if 0 < t < p j=1 β 0 j , then the problem is equivalent to: where, λ > 0. We shall show later that the relation between λ and the LASSO parameter t is one-to-one. Due to the nature of the constraint, LASSO tends to produce some coefficients that are exactly zero. Compared to OLS, whose predicted coefficient β 0 ∼ is an unbiased estimator , both ridge regression and LASSO sacrifice a little bias in order to reduce the variance of the predicted values and improve the overall prediction accuracy. In this past decade, LASSO has been widely applied in many different ways and variants (Tibshirani et al. 2005;Colombani et al. 2013;Yamada et al. 2014;Toiviainen et al. 2014;Connor et al. 2015).

Neural networks (NN)
Neural networks refer to information processing systems that simulate bio-neural networks. They use a large number of connected artificial neurons in order to simulate the capacity of neural networks (Anandarajan and Anandarajan 1999;Tasi and Wu 2008;Korol 2013;. Since NN is equipped with the functions of high-speed calculation and information de-noises, it is capable of solving many sophisticated classification and forecasting issues. The most common NN model has three layers: input layer, hidden layer, and output layer. The input layer is used to receive variables. The hidden layer is constituted by neutrons, and its major purpose is to increase the complexity of neural networks, so that they can simulate complicated linear relations. The output layer generates post-processing prediction results. The three layers of the NN model are illustrated in Fig. 1.
The MLP network is a function of one or more predictors that minimizes the prediction error of one or more targets. Predictors and targets can be a mix of categorical and continuous fields. The general architecture for MLP networks can be described as: (4) Input layer: J 0 = P units, a 0:1,..., a 0:j0 ; with a 0:j0 = x j The training finally proceeds through at least one complete pass of the data. The search should then be stopped according to the stopping criteria. Where, R is the target vector; pattern m; I is the number of layers, discounting the input layer; J i is the number of units in layer i; J 0 = P, J i = R, discounting the bias unit; Ŵ c and Ŵ are a set of categorical outputs and continuous outputs; Ŵ h is a set of sub-vectors of Y (m) containing 1-of c coded hth categorical field; and w i:j,k is a weight leading from layer i − 1, unit j to layer i, unit k. No weights connect a m i−1:j and the bias a m i:0 -that is, there is no w i:j,0 for any j. Finally, c m i:k is is an activation function for layer i.

Support vector machine (SVM)
Support vector machine (SVM) was developed by Boser et al. (1992) to provide better solutions than other traditional classifiers, such as neural networks. SVM is a type of maximal margin classifier, in which the classification problem can be represented as an optimization process, which finds the maximum-margin hyper-plane from a given training dataset D as described by: where y i is either 0 or 1, and n is the number of training data. Each x i is a p-dimensional vector having the feature quantity R. Any hyper-plane can be written as: where, w is the vector to the hyper-plane. If the training data are linearly separable, then the hyper-plane can be described as: The distance between these two hyper-planes is 2/ w , and so the purpose is to minimize w. Therefore, the algorithm can be rewritten as: We can also reformulate the equation without changing the solution as: The hyper-plane, or a set of hyper-planes, can be used as the separate lines in a classification. The SVM approach has recently been used in several financial applications (Martens et al. 2008;Tasi 2008;Li and Sun 2009;Chen et al. 2014;Yeh et al. 2010Yeh et al. , 2014.

Class and regression tree (CART)
Classification and regression tree (CART) is a flexible method to describe how the variable Y is distributed after assigning the forecast vector X (Patil et al. 2012). It is able to classify huge amounts of data according to the division rule so as to identify valid data and thereby achieve ideal results (Kirkos et al. 2007a, b;Salehi and Fard 2013;Kim �w� 2 , under the condition of y i (w · x i − b) ≥ 1, for any 1 ≤ i ≤ n and Upneja 2014; Marsala and Petturiti 2015). CART uses the binary tree to divide the forecast space into certain subsets on which the target variable distribution is continuously even. The "leaf " nodes correspond to different division areas that are determined by Splitting Rules relating to each internal node. By moving from the tree root to the leaf node, any forecast sample will be given only a leaf node. This algorithm uses the GINI Index to determine in which attribute the branch should be generated. The building process of the model is to choose the attribute whose GINI index is a minimum after splitting. It can be described as: Let X be divided into n subsets, {T 1, T 2, . . . Tn}. Among them, T i 's sample number is n i . Thus, the Gini index divided according to property X is described as: CART divides the property that leads a minimum value after the division.

Data collection and sampling
Research samples are drawn from GCD and NGCD firms in Taiwan from 2002 to 2013. 48 GCD firms are selected from all the listed companies of the Taiwan Economic Journal (TEJ) Data Bank. We adopt the 1-by-3 pair technique in order to match 144 NGCD firms. Thus, there are 192 firms in total that serve as our research sample of GCD and NGCD firms as shown in Table 1. Based on the indicators' selection in prior studies on GCD (Anandarajan and Anandarajan 1999;Behn et al. 2001;Kirkos et al. 2007a, b;Martens et al. 2008;Yeh et al. 2014), we prepare a set of 22 variables, as displayed in Table 2. These indicators are available in the TEJ database.
For the consideration of the number of samples, in order to avoid having too few samples in the test group and in order to improve test accuracy, we randomly gather 5 subsets from our original sample set and conduct fivefold cross validation.

Model development
This study begins by reducing the indicators using the LASSO screening method. The variables screened serve as the input variables for NN, CART and SVM. Next, the study carries out the model training and testing with every method. Finally, the study compares the merits and demerits of the classification ratio and provides relevant suggestions based on the analytic results. Model construction is divided into three parts. The first part is replacement sampling; the second part is the LASSO feature selection; and the third part compares the test results of four kinds of classification models. The research process of this study is shown in Fig. 2.

Important variable screening
While constructing the classification model, many variables may be included, but not all of these variables are actually important. Therefore, unimportant variables need to be  X17 Change CPA firm (CPA) or not: 1 is for change; 0 is for non-change Anandarajan and Anandarajan (1999), Yeh et al. (2014) and Chen and Lee (2015) X18 Current liabilities: Natural logarithm of current liabilities Salehi and Fard (2013) X19 Operating income: Natural logarithm of operating income Salehi and Fard (2013) and  X20 Total assets turnover: Net Sales/Average total assets Sun and Li (2008) and Sun et al. (2011) X21 Earnings before interest and tax (EBIT) Salehi and Fard (2013) and  X22 Return on assets ( (2012) and  eliminated in order to construct a simpler classification model. There is quite a number of ways to screen variables, of which the LASSO algorithm has shown excellent performance in reducing variables (Connor et al. 2015). This study therefore adopts the suggestions of Connor et al. (2015) and screens the important indicators using the LASSO technique in order to retain only input variables with a significant influence. We employ the LASSO available in the SAS software to calculate the AIC values and coefficients of variable importance. The input variables of the study are screened using LASSO to acquire the results shown in Table 3 and Figs. 3,4,5,6 and 7. This study proposes a GCD prediction model for CPAs. Thus, the study adopts the indicators as input variables, which were selected in each screening process (Work-Groups 1-5). The important variables selected by using LASSO include: X4 (Debt ratio), X6 (Undistributed surplus), X20 (Total assets turnover), and X22 (Return on assets; ROA).   X4 (Debt ratio: Total liabilities/Total assets) is an important measure of the debt ratio and capital structure of a company. Generally, capital is sourced from stockholders or external financing. Financing has a leverage that can increase the return on investment. Moreover, interest costs are not taxed, and thus financing has numerous advantages, but if debt is high, then financial leverage may increase risk. If a firm's operations are not as good as expected, then bankruptcy may occur. X6 (Undistributed surplus) is net income after withdrawal of legal and special surplus and can be used to pay cash dividends, expansion, or R&D. X20 (Total assets turnover: Net Sales/Average total assets) is an important measure to evaluate the operation quality of corporate assets and utilization efficiency. The greater the turnover rate is, the faster the turnover of total assets, and the stronger the sales ability. X22 (Return on assets (ROA): [Net income + interest expense × (1 − tax rate)]/Average total assets) shows the percentage of how profitable a company's assets are in generating revenue.
This study subsequently takes the 4 variables above as new input predictors in order to construct a prediction/classification model. The descriptive statistics and correlation of input variables are shown as Tables 4 and 5.

Classification model
This study employs IBM SPSS modeler 14.0 to build classification models NN, CART, and SVM. The cross-validation results of the training and testing subsets are shown as Tables 6, 7 and 8.

LASSO-NN model
The NN model is set as follow: (1) model type is set at Multilayer Perceptron (MLP), one hidden layer, and maximum training cycles stop at 250 times. The LASSO-NN model classification results are shown as Table 6.
On average, 9 of the 72 NGCD materials are incorrectly classified, and the Type I error rate is 12.22 %. In addition, 22 of the 24 GCD materials are correctly classified, while the remaining 2 GCD materials are incorrectly classified in NGCD. The Type II error is 7.50 %. The weight of each node and importance of variables are shown as Figs. 8 and 9.

LASSO-CART model
This study constructs the LASSO-CART model, sets maximum depth at 5, and adopts the Gini index as an impurity measure for categorical targets. The forecast results of the LASSO-CART prediction model are shown in Table 7. On average, 62 of the 72 NGCD materials are correctly classified, while 10 of them are incorrectly classified in GCD, for a Type I error of 13.61 %. On the other hand, 20 of the 24 GCD materials are correctly classified, with the remaining 2 GCD materials incorrectly classified in NGCD. The Type II error is 14.17 %.

LASSO-SVM model
In terms of the LASSO-SVM model, the kernel type is set at "Linear", the stopping criteria is set at 1.0E−3, and the regularization parameter is set at 10 and 0.1 of the regression precision.  The LASSO-SVM classification results are shown in Table 8. On average, 66 of the 72 NGCD materials are correctly classified, while 6 of them are incorrectly classified in GCD. The Type I error is 10.00 %. In addition, 20 of the 24 GCD materials are correctly classified, with the remaining 4 GCD materials incorrectly classified in NGCD. The Type II error is 15.83 %.

Model comparison and statistical test
According to the empirical results (Tables 6, 7, 8), the prediction accuracy of the LASSO-NN model is 88.96 % (Type I error rate is 12.22 %; Type II error rate is 7.50 %), the prediction accuracy of the LASSO-CART model is 88.75 % (Type I error rate is 13.61 %; Type II error rate is 14.17 %), and the prediction accuracy of the LASSO-SVM model is 89.79 % (Type I error rate is 10.00 %; Type II error rate is 15.83 %). Our comparison follows that of Kirkos et al. (2007a, b), Tasi and Huang (2010) and Chen et al. (2014). We not only focus on the hit ratio of the models, but also consider the Type I error and Type II error rates.
Unlike past works, which typically use Type I errors to judge the performance of a forecasting model, GCP studies prefer to use Type II errors to determine the performance of forecasting models. In order to confirm the significant difference between prediction models, this study uses the Wilcoxon two-sample test and the Kruskal-Wallis test, with the results shown in Table 9. The test results reveal a significant difference among the LASSO-NN, LASSO-CART, LASSO-NN, and LASSO-SVM prediction models.

Conclusions
Certified public accountants (CPAs) and auditors check firms' financial statements and issue their audit opinions and audit reports. These audit opinions and audit reports are very important for enterprises, stakeholders, and financial markets, especially investors. Thus, it is necessary to establish more accurate going concern doubt prediction models. The purpose of this study is to set up rigorous and reliable going concern doubt prediction models for auditors. This study applies the least absolute shrinkage and selection operator (LASSO) and data mining techniques (NN, CART, and SVM) to establish the prediction models.
According to the empirical results, the prediction accuracy is 88.96 % for the LASSO-NN model, is 88.75 % for the LASSO-CART model, and is 89.79 % for the LASSO-SVM model. This study uses LASSO to select important variables, which include: X4 (Debt ratio), X6 (Undistributed surplus), X20 (Total assets turnover), and X22 (Return on assets; ROA). As such, a firm's top management, CPAs, and auditors all should pay close attention to them.
Type I errors may not have serious consequences when compared to Type II errors. If the auditor wrongly classifies a GC firm as healthy, then he/she can be sued. If an auditor issues a wrong audit report due to his/her misjudgment, then this will affect not only the Table 9 Statistical tests * Significant at P < 0.1; ** significant at P < 0.05, *** significant at P < 0.01 Pr >Chi square 0.0413** 0.0328** enterprise and stakeholders, but also many investors. Moreover, the CPA may be sued. The costs for Type II errors are thus rather severe. We have developed three GCD prediction models. In the LASSO-NN model, the Type I error rate is 12.22 % and the Type II error rate is 7.50 %; in the LASSO-CART model, the Type I error rate is 13.61 % and the Type II error rate is 14.17 %; and in the LASSO-SVM model, the Type I error rate is 10.00 % and the Type II error rate is 15.83 %. These error rates are all lower than 20 %, especially in the LASSO-NN model where the Type II error rate is only 7.50 %. This is a key contribution of this paper. Finally, the empirical results of this study can provide a reference for enterprises' top management, CPAs, auditors, and future studies.

Limitations
There are several limitations in this study. 1. The size of the financial market in Taiwan is not as big when compared to China, the U.S., UK, EU, Japan, etc.); 2. The Taiwan government has strict control over the listed companies and the financial market. Thus, GCD listed companies are fewer. 3. If the GCD prediction models are used in countries other than Taiwan, then the GCD indicators (variables) should be measured according to national or economically regional audit laws and regulations and financial practice.