The purpose of this study is to establish a two-stage going concern doubt prediction model that integrates financial and non-financial indicators. The process of this study creates a least absolute shrinkage and selection operator (LASSO) to obtain the results for important indicators of GCD after screening. For forecast modeling, the classification approach includes the following machine learning techniques: NN, DT, and SVM. Finally, this study draws a comparison and conducts an analysis in order to obtain better GC prediction results.

### Least absolute shrinkage and selection operator (LASSO)

Stepwise regression has been applied in related work in the past, but there are significant problems with stepwise methods, which have been admirably summarized by Harrell (2001). These problems are as follows: (1) R^{2} values are biased. (2) The F test statistics do not have the claimed distribution. (3) The standard errors of the parameter estimates are too small. (4) Consequently, the confidence intervals around the parameter estimates are too narrow. (5) The parameter estimates are highly biased in absolute value. (6) Collinearity problems are exacerbated.

This study applies LASSO as a feature selection method, which was first proposed by Tibshirani (1996). This algorithm minimizes the residual sum of squares subject to the sum of the absolute values of the coefficient being less than a constant.

$$ \mathop {\hat{\beta }^{L} }\limits_{\sim} = \arg \hbox{min} \left\{ {\sum\limits_{i = 1}^{N} {\left( {y_{1} - \alpha - \sum\limits_{j} {\beta_{j} \chi_{ij} } } \right)^{2} } } \right\} $$

(1)

$$ \begin{aligned} & {\text{subject to}} \\ & \sum\limits_{j = 1}^{p} {\left| {\hat{\beta }_{j}^{L} } \right|} \le (Constant) \\ \end{aligned} $$

(2)

If \( t > \sum\nolimits_{j = 1}^{p} {\left| {\hat{\beta }_{j}^{0} } \right|} , \) then the LASSO algorithm yields the same estimate as the OLS estimate.

However, if \( 0 < t < \sum\nolimits_{j = 1}^{p} {\left| {\hat{\beta }_{j}^{0} } \right|} , \) then the problem is equivalent to:

$$ \mathop {\hat{\beta }^{L} }\limits_{\sim} = \arg \hbox{min} \left[ {\sum\limits_{i = 1}^{N} {\left( {y_{1} - \alpha - \sum\limits_{j} {\beta_{j} \chi_{ij} } } \right)^{2} + \lambda \sum\limits_{j} {\left| {\beta_{j} } \right|} } } \right] $$

(3)

where, λ > 0. We shall show later that the relation between λ and the LASSO parameter t is one-to-one.

Due to the nature of the constraint, LASSO tends to produce some coefficients that are exactly zero. Compared to OLS, whose predicted coefficient \( \mathop {\beta^{0} }\limits_{\sim} \) is an unbiased estimator of \( \mathop \beta \limits_{\sim} , \) both ridge regression and LASSO sacrifice a little bias in order to reduce the variance of the predicted values and improve the overall prediction accuracy. In this past decade, LASSO has been widely applied in many different ways and variants (Tibshirani et al. 2005; Colombani et al. 2013; Yamada et al. 2014; Toiviainen et al. 2014; Connor et al. 2015).

### Neural networks (NN)

Neural networks refer to information processing systems that simulate bio-neural networks. They use a large number of connected artificial neurons in order to simulate the capacity of neural networks (Anandarajan and Anandarajan 1999; Tasi and Wu 2008; Korol 2013; Chen et al. 2015). Since NN is equipped with the functions of high-speed calculation and information de-noises, it is capable of solving many sophisticated classification and forecasting issues. The most common NN model has three layers: input layer, hidden layer, and output layer. The input layer is used to receive variables. The hidden layer is constituted by neutrons, and its major purpose is to increase the complexity of neural networks, so that they can simulate complicated linear relations. The output layer generates post-processing prediction results. The three layers of the NN model are illustrated in Fig. 1.

The MLP network is a function of one or more predictors that minimizes the prediction error of one or more targets. Predictors and targets can be a mix of categorical and continuous fields. The general architecture for MLP networks can be described as:

$$ {\text{Input layer:}}\;J_{0} = P\;{\text{units,}}\;a_{0:1, \ldots ,} a_{0:j0} ;{\text{ with}}\;a_{0:j0} = x_{j} $$

(4)

$$ {\text{ith hidden layer:}}\;J_{i} \;{\text{units,}}\;a_{i:1, \ldots ,} a_{{i:J_{i} }} ;\;{\text{with}}\;a_{i:k} = \gamma_{i} (c_{i:k} )\;{\text{and}}\;c_{i:k} = \sum\limits_{j = 0}^{{J_{1} }} {_{{w_{i:j,k} a_{i - 1:j} }} } ,\;{\text{and}}\;a_{i - 1:0} = 1 $$

(5)

$$ {\text{Output layer:}}\;J_{I} = R\;{\text{units,}}\;a_{I:1, \ldots ,} a_{{I:J_{i} }} ;{\text{ with}}\;a_{I:k} = \gamma_{I} (c_{I:k} )\;{\text{and}}\;c_{I:k}^{m} = \sum\limits_{j = 0}^{{J_{1} }} {_{{w_{I:j,k} a_{i - 1:j} }} } ,\;{\text{and}}\;a_{i - 1:0} = 1 $$

(6)

The training finally proceeds through at least one complete pass of the data. The search should then be stopped according to the stopping criteria.

Where, \( X(m) = x_{1, \ldots ,}^{(m)} x_{p}^{(m)} \) is the input vector; pattern *m*, *m* = 1, … M; \( Y(m) = y_{1, \ldots ,}^{(m)} y_{R}^{(m)} \) is the target vector; pattern *m*; *I* is the number of layers, discounting the input layer; *J*
_{
i
} is the number of units in layer *i*; \( J_{0} = P,J_{i} = R, \) discounting the bias unit; \( \Gamma ^{c} \) and \( \Gamma \) are a set of categorical outputs and continuous outputs; \( \Gamma _{h} \) is a set of sub-vectors of \( Y^{(m)} \) containing 1-of c coded hth categorical field; and \( w_{i:j,k} \) is a weight leading from layer *i* − 1, unit *j* to layer *i*, unit *k*. No weights connect \( a_{i - 1:j}^{m} \) and the bias \( a_{i:0}^{m} \)—that is, there is no \( w_{i:j,0} \) for any *j*. Finally, \( c_{i:k}^{m} \) is \( \sum\nolimits_{j = 0}^{{J_{i} - 1}} {w_{i:j,k} a_{i - 1:j}^{m} ,i = 1, \ldots ,I} \) and \( \gamma_{i} (c) \) is an activation function for layer *i*.

### Support vector machine (SVM)

Support vector machine (SVM) was developed by Boser et al. (1992) to provide better solutions than other traditional classifiers, such as neural networks. SVM is a type of maximal margin classifier, in which the classification problem can be represented as an optimization process, which finds the maximum-margin hyper-plane from a given training dataset D as described by:

$$ D = \left\{ {(x_{i} ,y_{i} )\left| {x_{i} \in {\mathbb{R}}^{p} ,y_{i} \in \{ - 1,1\} } \right.} \right\}_{i = 1}^{n} $$

(7)

where \( y_{i} \) is either 0 or 1, and n is the number of training data. Each \( x_{i} \) is a p-dimensional vector having the feature quantity \( {\mathbb{R}}. \) Any hyper-plane can be written as:

$$ w \cdot x - b = 0 $$

(8)

where, w is the vector to the hyper-plane. If the training data are linearly separable, then the hyper-plane can be described as:

$$ w \cdot x - b = 1\;{\text{and}}\;w \cdot x - b = - 1 $$

(9)

The distance between these two hyper-planes is \( 2 /\left\| w \right\|, \) and so the purpose is to minimize w. Therefore, the algorithm can be rewritten as:

$$ {\text{Minimize:}}\;\left\| w \right\|,{\text{ under the condition of}}\;y_{i} (w \cdot x_{i} - b) \ge 1,{\text{ for any}}\; 1\le {\text{i}} \le {\text{n}} $$

(10)

We can also reformulate the equation without changing the solution as:

$$ {\mathop{\arg \min}\limits_{(w,b)}} \frac{1}{2}\left\| w \right\|^{2} ,\,{\text{under the condition of}}\;y_{i} (w \cdot x_{i} - b) \ge 1,{\text{ for any}}\;1 \le {\text{i}} \le {\text{n}} $$

(11)

The hyper-plane, or a set of hyper-planes, can be used as the separate lines in a classification. The SVM approach has recently been used in several financial applications (Martens et al. 2008; Tasi 2008; Li and Sun 2009; Chen et al. 2014; Yeh et al. 2010, 2014).

### Class and regression tree (CART)

Classification and regression tree (CART) is a flexible method to describe how the variable Y is distributed after assigning the forecast vector X (Patil et al. 2012). It is able to classify huge amounts of data according to the division rule so as to identify valid data and thereby achieve ideal results (Kirkos et al. 2007a, b; Salehi and Fard 2013; Kim and Upneja 2014; Marsala and Petturiti 2015). CART uses the binary tree to divide the forecast space into certain subsets on which the target variable distribution is continuously even. The “leaf” nodes correspond to different division areas that are determined by Splitting Rules relating to each internal node. By moving from the tree root to the leaf node, any forecast sample will be given only a leaf node.

This algorithm uses the GINI Index to determine in which attribute the branch should be generated. The building process of the model is to choose the attribute whose GINI index is a minimum after splitting. It can be described as:

$$ GINI(T) = 1 - \sum\limits_{i = 1}^{m} {P_{i}^{2} } $$

(12)

Let X be divided into *n* subsets, \( \{ T1,T2, \ldots Tn\} . \) Among them, T_{i}’s sample number is n_{i}. Thus, the Gini index divided according to property X is described as:

$$ GINI(T) = 1 - \sum\limits_{i = 1}^{n} {\frac{{n_{i} }}{n}GINI(T_{i} )} $$

(13)

CART divides the property that leads a minimum value after the division.