Transformer fault diagnosis using continuous sparse autoencoder
 Lukun Wang^{1}Email author,
 Xiaoying Zhao^{2},
 Jiangnan Pei^{3} and
 Gongyou Tang^{1}
Received: 10 December 2015
Accepted: 5 April 2016
Published: 14 April 2016
Abstract
This paper proposes a novel continuous sparse autoencoder (CSAE) which can be used in unsupervised feature learning. The CSAE adds Gaussian stochastic unit into activation function to extract features of nonlinear data. In this paper, CSAE is applied to solve the problem of transformer fault recognition. Firstly, based on dissolved gas analysis method, IEC three ratios are calculated by the concentrations of dissolved gases. Then IEC three ratios data is normalized to reduce data singularity and improve training speed. Secondly, deep belief network is established by two layers of CSAE and one layer of back propagation (BP) network. Thirdly, CSAE is adopted to unsupervised training and getting features. Then BP network is used for supervised training and getting transformer fault. Finally, the experimental data from IEC TC 10 dataset aims to illustrate the effectiveness of the presented approach. Comparative experiments clearly show that CSAE can extract features from the original data, and achieve a superior correct differentiation rate on transformer fault diagnosis.
Keywords
Background
Transformer is one of the most important equipment in power network. It will bring huge economic loss to the power network if it fails. The periodical monitoring of the condition of the transformer is necessary. There are a lot of methods used for detecting power failures such as oil breakdown voltage test, resistivity test and moisture analysis in transformer oil (Saha 2003). Among these methods, dissolved gas analysis (DGA) is the most widely used method (Arakelian 2004). This method diagnoses the transformer fault based on the analysis of dissolved gas concentrations in transformer oil (Duval 2003). The gases in transformer oil mainly include hydrocarbons, such as: methane (CH_{4}), ethane (C_{2}H_{6}), ethylene (C_{2}H_{4}), acetylene (C_{2}H_{2}) and other gases, such as: hydrogen (H_{2}) and carbon dioxide (CO_{2}). In recent years, researchers have proposed transformer fault diagnosis methods including particle swarm optimization (Ballal et al. 2013), support vector machine (Chen et al. 2009), fuzzy learning vector quantization network (Yang et al. 2001) and back propagation (BP) neural network (Patel and Khubchandani 2004). Miranda et al. (2012) built a diagnosis system based on a set of autoassociative neural networks to diagnose the faults of power transformer. The information theoretic mean shift (ITMS) algorithm was adopted to densify the data clusters. Dhote and Helonde (2012) proposed a new five fuzzy ratios method and developed a fuzzy diagnostic expert system to diagnose the transformer fault. Souahlia et al. (2012) combined the Rogers and Doernenburg ratios together to be the gases signature. The multilayer perceptron neural network was applied for decision making. Bhalla et al. (2012) applied a pedagogical approach for rule extraction from function approximating ANN (REFANN). REFANN derives linear equations by approximating the hidden unit activation function and splitting the input space into subregion. Ren et al. (2010) used the rough set theory to reduce the degree of complex training samples; the speed of learning and training was enhanced. Then the quantum neural network was applied to the classifier of transformer fault diagnosis.
In 1996, sparse coding was proposed by Olshausen and Field (1996) which showed that the receptive fields of simple cells in mammalian primary visual cortex could learn higher level representations from the outside input signals (Vinje and Gallant 2000). After then, autoencoder was proposed to learn higher level features. In 2006, a new neural network model called deep belief network (DBN) was proposed by Hinton and Salakhutdinov (2006) as a new neural network (Cottrell 2006). With the development of the deep learning theory, DBN is widely used in many AI areas (Le Roux and Bengio 2010).
According to Bengio et al. (2006), DBN was successfully comprised of autoencoder (AE). He used AE as a basic model of DBN. With this structure, the training of handwritten digits recognition has achieved more than 99 % accuracy rate. It is proved that AE can completely replace restricted Boltzmann machine (RBM) as the basic elements of DBN. In 2008, Vincent et al. (2008) proposed denoising autoencoder (DAE) which could be adopted in corrupted data. DAE learns to project the corrupted data back onto the manifold, and can make the characteristics of the data more robust. On this basis, Vincent et al. (2010) introduced stacked denoising autoencoder (SDAE) by stacking several layers of DAE with the category constraint. At present, AE has been successfully applied to speech recognition (Dahl et al. 2012), handwritten digit recognition, natural language processing fields (Glorot et al. 2011), etc.
The current research on transformer fault diagnosis which applies neural network to the classification algorithm is mainly based on singlelayer neural network. Instead of a singlelayer neural network, a deep network composed of multiple layers of continuous sparse autoencoder (CSAE) is designed to solve the problem of transformer fault recognition. The second section describes the method of DGA, the relationship between the transformer fault classification and the concentrations of five fault gases has been introduced. In the third section, the basic autoencoder is briefly reviewed and a new continuous sparse autoencoder is proposed to extract the features of nonlinear data. The fourth section, several experiments are designed to verify the validity of CSAE. The last section concludes our work and points out the future direction.
Dissolved gas analysis
DGA is an analytic technique by detecting the dissolved gas in transformer oil. The insulating materials will release small amounts of hydrocarbons if transformer breaks down. The concentrations of these hydrocarbons can be used for electrical fault classification. The gases generated by transformer have useful information. They can be applied to electrical equipment diagnosis.
Fault classification
Symbol  Transformer fault 

PD  Partial discharges 
LED  Low energy discharge 
HED  High energy discharge 
TF1  Thermal faults <700 °C 
TF2  Thermal faults >700 °C 
Gas importance by faults
Cause of gas generation  H_{2}  CH_{4}  C_{2}H_{6}  C_{2}H_{4}  C_{2}H_{2} 

Electrical fault  
PD  ●  ○  
LED  ●  ●  
HED  ●  ○  ●  
Thermal fault  
TF1  ○  ●  ●  ●  
TF2  ○  ○  ●  ○ 
Methods
DBN model

Step 1 Each layer of AE can be used for unsupervised feature learning. In the process of training, each layer of AE can extract different features from the input data. These features are stored in the feature vector W. In this step, the optimization is not meant for the entire DBN.

Step 2 One layer of BP neural network is set at the bottom layer of DBN. The reason of setting one layer of BP is to receive trained AE weight. After AE unsupervised training, BP will calculate the error between DBN output and expected output. The error will be passed back to previous layers of AE. According to the error, the weight matrix of the whole DBN will be updated. The process of reconstruction will be repeated based on the set epochs until the error converges. It realizes the optimization of feature data.
DBN overcomes the disadvantages of signallayer neural network: falling into local optimum and long training time.
Basic autoencoder
Where \(x_{i},i \in 1,\ldots,n\) is the input of autoencoder, \(h_{j},j \in 1,\ldots,k\) is the value of hidden units, \(\hat{x}_{i},i \in 1,\ldots,n\) is the target output, \(W^{(i)} ,i \in 1,2\) denotes the weight matrix. AE tries to learn a function like \(h_{W,b} \left( x \right) = x\) which can make \(\hat{x}\) approximate to x. \(h_{W,b} \left( x \right)\) is an activation function. The purpose of training AE is to get \(\left\{ {W^{\left( l \right)} ,b^{\left( l \right)} } \right\}\).
Continuous sparse autoencoder
 1.
Setting \(\varDelta W^{\left( l \right)} : = 0,\varDelta b^{\left( l \right)} : = 0\)
 2.
Calculating \(\nabla_{{W^{\left( l \right)} }} J\left( {W,b;x,y} \right)\) and \(\nabla_{{b^{\left( l \right)} }} J\left( {W,b;x,y} \right)\)
 3.
Calculating \(\varDelta W^{\left( l \right)} : = \varDelta W^{\left( l \right)} + \nabla_{{W^{\left( l \right)} }} J\left( {W,b;x,y} \right)\) and \(\varDelta b^{\left( l \right)} : = \varDelta b^{\left( l \right)} + \nabla_{{b^{\left( l \right)} }} J\left( {W,b;x,y} \right)\)
 4.Updating the weight:$$W^{\left( l \right)} = W^{\left( l \right)}  \alpha \left[ {\left( {\frac{1}{m}\varDelta W^{\left( l \right)} } \right) + \lambda W^{\left( l \right)} } \right]$$$$b^{\left( l \right)} = b^{\left( l \right)}  \alpha \left[ {\left( {\frac{1}{m}\varDelta b^{\left( l \right)} } \right)} \right]$$
In this paper, manifold learning is drawn to analyze the effect of stochastic unit. According to the manifold learning theory, the highdimensional data can be represented by lowdimensional manifold. The operator \(p(x\tilde{x})\) attempts to transform the highdimensional \(x\) to lowdimensional \(\tilde{x}\). In the process of learning, the distribution of stochastic unit is not in highdimensional manifold, so the gradient of \(p(x\tilde{x})\) should be changed greatly to approximate x. Essentially, CSAE can be considered as a manifold learning algorithm. The stochastic unit added into activation function can change the gradient direction and prevent overfitting.
Experiments
Dataset and normalization
In this paper we use IEC TC 10 as the experiment dataset (Duval and DePablo 2001) provided by Mirowski and LeCun (2012). There are 134 transformer fault samples in this dataset. Each sample contains the concentrations of CH_{4}, C_{2}H_{2}, C_{2}H_{4}, C_{2}H_{6} and H_{2} in parts per million (ppm). Three ratios including CH_{4}/H_{2}, C_{2}H_{2}/C_{2}H_{4}, C_{2}H_{4}/C_{2}H_{6} can be calculated as the input of DBN. The five classifications of transformer faults corresponding to binary codes can be set as the output of DBN, they are 00001 (partial discharges), 00010 (low energy discharge), 00100 (high energy discharge), 01000 (thermal faults <700 °C) and 10000 (thermal faults >700 °C).
Network structure
Parameters setting
Parameters are very important for neural network. Recent studies (Nguyeny et al. 2013) have shown that if parameters is not set properly, the correct differentiation rate will be low and the speed of convergence will be slow. According to previous experience, the authors set parameters as follows.
Learning rate: the learning rate is very important. If it is big, the system will become unstable. Otherwise the training epoch will become too long. Generally, a relatively small learning rate will make the error converge asymptotically. At the same time, because the network size is different, the learning rate should be adjusted according to the network size. In this experiment, the learning rate is set to be 0.05.
Sparse parameter: sparse parameter is used to determine the unit activation. In this experiment, the sparse parameter is set to be 0.01.
Simulation
About the simulation environment, the software is Matlab 8.1.0 and the hardware is the desktop computer with Intel i5 processer with 8 GB RAM and 2.5 GHz frequency, and the operating system is Microsoft Windows 8.1 professional. In this experiment, the 125 samples are applied to the training dataset, and the other 9 samples are applied to the predicting dataset. The Kfold is adopted to the cross validation method. In this section, K is set to be 5, it means that 125 samples will be divided into 5 partitions. One partition is used for testing and the other 4 partitions are used for training. The process will repeat 5 times until each partition can be regarded as training and testing data.
Classification accuracy of KNN
K  10 (%)  15 (%)  20 (%)  60 (%) 

Accuracy (%)  88.9  90  83.9  77.8 
Classification accuracy of SVM
Kernel function  SVM_RBF (%)  SVM_SIG (%)  SVM_PLOY (%) 

Accuracy (%)  79.9  59.5  68.8 
Classification accuracy of BP and CSAE
Classification  CSAE (%)  BP (%) 

TF1 (%)  100  86.6 
TF2 (%)  93.7  81.2 
PD (%)  83.3  83.3 
LED (%)  95.6  82.6 
HED (%)  95.5  86.6 
Results of Wilcoxon rank sum test
State  CSAE (%)  BP (%) 

Standard deviation (%)  6.22  2.44 
Average accuracy (%)  93.6  84.1 
pvalue  0.0195 
A part of training results
No  CH_{4}/H_{2}  C_{2}H_{2}/C_{2}H_{4}  C_{2}H_{4}/C_{2}H_{6}  Actual fault  Forecast fault 

1  0.06  0  1.35  LED  LED 
2  1  0.007  2.52  TF1  TF1 
3  0.96  0.025  8.12  TF2  TF2 
4  2.3  0  3.83  TF2  TF2 
5  7.19  0.005  8.63  TF2  TF2 
6  0.235  1.1  7.67  PD  PD 
7  1.3  0  1.22  TF1  TF1 
8  1.23  0.05  9.22  TF2  TF2 
9  0.17  1  9.615  PD  PD 
Conclusion and future work
In this paper, we propose a novel CSAE model which can be used in unsupervised learning of representations. CSAE added Gaussian stochastic unit in activation function is adopted to solve the problem of transformer fault recognition. The IEC three ratios are calculated by the concentrations of dissolved gases. Then the three ratios are normalized to reduce data singularity. In the experiments, DBN is established by two layers of CSAE and one layer of BP. CSAE is applied to unsupervised training and getting features. BP is used for supervised training and transformer fault classification. Comparative experiments clearly show the advantages of CSAE on transformer fault diagnosis. This neural network diagnosis algorithm is better than the traditional algorithm with its value in the actual transformer fault diagnosis.
The CSAE model have the advantages of outstanding recognition ability of continuous data, unsupervised feature learning ability, high precision and robust ability. The main disadvantages of CSAE model include long time training and high performance computer requirement. In summary, CSAE has great potential. In the future work, we will continue to research CSAE and try to use some tricks to shorten the training time. Furthermore, we plan to investigate some optimization strategies to diagnosis the transformer fault.
Declarations
Authors’ contributions
A mathematical model for transformer fault diagnosis has been proposed. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by National Natural Science Foundation of China (41276086), National Natural Science Foundation of Shandong Province (ZR2015FM004).
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Andrew NG (2012) Autoencoders and sparsity. http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity
 Arakelian VG (2004) The long way to the automatic chromatographic analysis of gases dissolved in insulating oil. IEEE Electr Insul Mag 20(6):8–25. doi:10.1109/MEI.2004.1367506 View ArticleGoogle Scholar
 Ballal MS, Ballal DM, Suryawanshi HM, Choudhari BN (2013) Computational intelligence algorithm based condition monitoring system for power transformer. In: IEEE 1st international conference on condition assessment techniques in electrical systems, IEEE CATCON 2013, pp 154–159. doi:10.1109/CATCON.2013.6737538
 Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy layerwise training of deep networks. In: 20th annual conference on neural information processing systems, NIPS 2006, pp 153–160Google Scholar
 Bhalla D, Bansal RK, Gupta HO (2012) Function analysis based rule extraction from artificial neural networks for transformer incipient fault diagnosis. Int J Electr Power 43(1):1196–1203. doi:10.1016/j.ijepes.2012.06.042 View ArticleGoogle Scholar
 Chen W, Pan C, Yun Y, Liu Y (2009) Wavelet networks in power transformers diagnosis using dissolved gas analysis. IEEE Trans Power Deliver 24(1):187–194. doi:10.1109/TPWRD.2008.2002974 View ArticleGoogle Scholar
 Cottrell GW (2006) New life for neural networks. Science 313(5786):454–455. doi:10.1126/science.1129813 View ArticleGoogle Scholar
 Dahl GE, Yu D, Deng L, Acero A (2012) Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Trans Audio Speech 20(1):30–42. doi:10.1109/TASL.2011.2134090 View ArticleGoogle Scholar
 Dhote NK, Helonde JB (2012) Diagnosis of power transformer faults based on five fuzzy ratio method. WSEAS Trans Power Syst 7(3):114–125Google Scholar
 Duval M (2003) New techniques for dissolved gasinoil analysis. IEEE Electr Insul M 19(2):6–15. doi:10.1109/MEI.2003.1192031 View ArticleGoogle Scholar
 Duval M, DePablo A (2001) Interpretation of gasinoil analysis using new IEC publication 60599 and IEC TC 10 databases. IEEE Electr Insul Mag 17(2):31–41. doi:10.1109/57.917529 View ArticleGoogle Scholar
 Glorot X, Bordes A, Bengio Y (2011) Domain adaptation for largescale sentiment classification: a deep learning approach. In: Proceedings of the 28th international conference on machine learning, ICML 2011, pp 513–520Google Scholar
 Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507. doi:10.1126/science.1127647 View ArticleGoogle Scholar
 Hsu C, Lin C (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Network 13(2):415–425. doi:10.1109/72.991427 View ArticleGoogle Scholar
 Kullback S, Leibler RA (1951) On Information and Sufficiency. Ann Math Stat 22(1):79–86. doi:10.1214/aoms/1177729694 View ArticleGoogle Scholar
 Le Roux N, Bengio Y (2010) Deep belief networks are compact universal approximators. Neural Comput 22(8):2192–2207. doi:10.1162/neco.2010.08091081 View ArticleGoogle Scholar
 Miranda V, Castro ARG, Lima S (2012) Diagnosing faults in power transformers with autoassociative neural networks and mean shift. IEEE Trans Power Deliver 27(3):1350–1357. doi:10.1109/TPWRD.2012.2188143 View ArticleGoogle Scholar
 Mirowski P, LeCun Y (2012) Statistical machine learning and dissolved gas analysis: a review. IEEE Trans Power Deliv 27(4):1791–1799. http://www.mirowski.info/pub/dga
 Nguyeny TD, Tranyz T, Phungy D, Venkateshy S (2013) Learning partsbased representations with nonnegative restricted boltzmann machine. In: 5th Asian conference on machine learning, ACML 2013, pp 133–148Google Scholar
 Olshausen BA, Field DJ (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–609. doi:10.1038/381607a0 View ArticleGoogle Scholar
 Patel NK, Khubchandani RK (2004) ANN based power transformer fault diagnosis. J Inst Eng (India) Electr Eng Div 85:60–63Google Scholar
 Ren X, Zhang F, Zheng L, Men X (2010) Application of quantum neural network based on rough set in transformer fault diagnosis. In: Proceedings of the power and energy engineering conference (APPEEC), 2010 AsiaPacific, 28–31 March 2010, pp 1–4. doi:10.1109/APPEEC.2010.5448911
 Saha TK (2003) Review of modern diagnostic techniques for assessing insulation condition in aged transformers. IEEE Trans Dielectr El In 10(5):903–917. doi:10.1109/TDEI.2003.1237337 View ArticleGoogle Scholar
 Souahlia S, Bacha K, Chaari A (2012) MLP neural networkbased decision for power transformers fault diagnosis using an improved combination of Rogers and Doernenburg ratios DGA. Int J Electr Power 43(1):1346–1353. doi:10.1016/j.ijepes.2012.05.067 View ArticleGoogle Scholar
 Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning, pp 1096–1103Google Scholar
 Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408Google Scholar
 Vinje WE, Gallant JL (2000) Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287(5456):1273–1276. doi:10.1126/science.287.5456.1273 View ArticleGoogle Scholar
 Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83 View ArticleGoogle Scholar
 Yang HT, Liao CC, Chou JH (2001) Fuzzy learning vector quantization networks for power transformer condition assessment. IEEE Trans Dielect Electr Insul 8(1):143–149. doi:10.1109/94.910437 View ArticleGoogle Scholar