On multivariate imputation and forecasting of decadal wind speed missing data
- Ronald Wesonga^{1}Email author
Received: 6 October 2014
Accepted: 23 December 2014
Published: 13 January 2015
Abstract
This paper demonstrates the application of multiple imputations by chained equations and time series forecasting of wind speed data. The study was motivated by the high prevalence of missing wind speed historic data. Findings based on the fully conditional specification under multiple imputations by chained equations, provided reliable wind speed missing data imputations. Further, the forecasting model shows, the smoothing parameter, alpha (0.014) close to zero, confirming that recent past observations are more suitable for use to forecast wind speeds. The maximum decadal wind speed for Entebbe International Airport was estimated to be 17.6 metres per second at a 0.05 level of significance with a bound on the error of estimation of 10.8 metres per second. The large bound on the error of estimations confirms the dynamic tendencies of wind speed at the airport under study.
Keywords
Wind speed Missing data Imputations Forecasting Statistical modelsIntroduction
Wind speed studies have increased in the recent past (Li and Shi 2010; Lun and Lam 2000). The interest by researchers may be attributed to the increasing magnitude and frequency of both positive and adverse effects caused by wind speeds (Dorvlo 2002; Calif et al. 2011). It has become apparent that there is a close correlation with the climate change phenomenon. Winds are known to be advantageous in some circumstances that include provision of wind energy; hence the desire for more regular big magnitudes of wind speeds (Qin et al. 2011). Strong wind speeds, however, can result in hazardous phenomena such as wind shears, an atmospheric phenomenon consisting of a sudden variation in intensity and direction of the wind. Wind shears are responsible for many incidents and accidents that occur during takeoff or landing and during low altitude flights. A venture into understanding the tendencies of their occurrence can improve control, management and planning of air traffic flow at an airport (Wesonga et al. 2013). Conversely, forecasting of winds, however, helps the wind energy planning and production sustainability; as well as the pilots and operators responsible for air traffic control in their ongoing surveillance activities thereby giving a real time warning of wind shear phenomena.
This process of wind forecasting provides much needed information for energy generation to complement hydroelectricity that is sometimes unreliable. In many African airports, however, wind data is scanty and where attempts to record them are made, they are not consistent over desired long periods of time to provide for more systematic statistical analysis (van Buuren 2011). For example, at Entebbe International Airport, although wind data were observed, recorded and used in developing daily forecasts, the capacity to maintain wind databases for a long time was curtailed by a lack of the necessary infrastructure and equipment. None of the reviewed literature on wind speed attempted to logically handle missing wind speed imputation. The main objective of this study was to perform missing data imputation on the available historic wind speed data at Entebbe International Airport. I also developed a wind speed time series model using the imputed data to provide knowledge about the behavior of wind speed in the absence of properly maintained actual databases for both wind speed and direction (Wesonga and Nabugoomu 2014). The long term goal is to use wind speed data to inform operations such as air traffic flow management and wind energy production.
Data description and methodology
Wind speed decadal data variable description
Variable name | Variable description |
---|---|
Year | Year |
Month | Month of the year |
dmax1 | Decadal one maximum wind speed |
dmax2 | Decadal two maximum wind speed |
dmax3 | Decadal three maximum wind speed |
Sample of original maximum wind speed decadal data
Year | Month | dmax1 | dmax2 | dmax3 |
---|---|---|---|---|
2003 | 10 | NA | NA | NA |
2003 | 11 | NA | NA | NA |
2003 | 12 | NA | NA | NA |
2004 | 1 | 48 | 15 | 25 |
2004 | 2 | 20 | 20 | 13 |
2004 | 3 | 15 | 24 | 20 |
2004 | 4 | 22 | 25 | 20 |
2004 | 5 | 15 | 14 | 20 |
2004 | 6 | 15 | 12 | 15 |
2004 | 7 | NA | NA | NA |
2004 | 8 | 16 | 15 | 18 |
2004 | 9 | 18 | 10 | 15 |
2004 | 10 | 15 | 15 | 25 |
2004 | 11 | 15 | 16 | 18 |
2004 | 12 | 14 | 25 | 20 |
2005 | 1 | NA | NA | NA |
2005 | 2 | NA | NA | NA |
2005 | 3 | NA | NA | NA |
2005 | 4 | NA | NA | NA |
2005 | 5 | NA | NA | NA |
Markov Chain Monte Carlo (MCMC) method
The Markov Chain Monte Carlo (MCMC) has its origin in physics as a tool for exploring equilibrium distributions of interacting molecules (Walsh 2004). However, in statistical applications, it is used to generate pseudorandom draws from multidimensional and otherwise intractable probability distributions via Markov chains. A Markov chain is a sequence of random variables in which the distribution of each element depends on the value of the previous one.
- 1.
The imputation, I-step:
With the estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. That is, if you denote the variables with missing values for observation i by Y _{ i(mis) } and the variables with observed values by Y _{ i(obs) }, then the I-step draws values for Y _{ i(mis) } from a conditional distribution Y _{ i(mis) } given Y _{ i(obs) }.
- 2.
The posterior, P-step:
The P-step simulates the posterior population mean vector and covariance matrix from the complete sample estimates. These new estimates are then used in the I-step. Without prior information about the parameters, a non-informative prior is used. You can also use other informative priors. For example, a prior information about the covariance matrix may be helpful to stabilize the inference about the mean vector for a near singular covariance matrix. The two steps are iterated long enough for the results to be reliable for a multiple imputed data set (Schafer and Olsen 1998). The goal is for the convergence of the iterates to their stationary distribution and then to simulate an approximately independent draw of the missing values.
That is, with a current parameter estimate θ ^{(t)} at t ^{ th } iteration, the I-step draws \( {Y}_{mis}^{\left(t+1\right)} \) from p(Y _{ mis }/Y _{ obs }, θ ^{(t)}) and the P-step draws θ ^{(t+1)} from \( p\left(\theta /{Y}_{obs},{Y}_{mis}^{\left(t+1\right)}\right) \). This creates a Markov chain \( \left({Y}_{mis}^{(1)},{\theta}^1\right),\ \left({Y}_{mis}^{(2)},{\theta}^2\right), \dots \) which then converges to the distribution p(Y _{ mis }, θ/Y _{ obs }).
Table 3 presents a summary of the modules and functions used for performing the multiple imputations of the missing wind speed data and time series analysis. Further presentation and demonstration of the packages and R functions are well presented by the R development Team (R Core Team 2014).Table 3Abridged R code for missing wind speed imputation and forecasting
Code line
R code for imputation and time series prediction of wind speed data
Code line 1
# required R library functions
Code line 2
library(VIM)
Code line 3
library(mice)
Code line 4
library(lattice)
Code line 5
library(“TTR”)
Code line 6
library(“forecast”)
Code line 7
# Inspection of the missing data
Code line 8
p <− md.pairs(wind)
Code line 9
marginplot()
Code line 10
# MICE uses predictive mean matching, pmm
Code line 11
imp <− mice(wind)
Code line 12
# Further diagnostic checking
Code line 13
imp$imp$values
Code line 14
c1 <− complete(imp)
Code line 15
# Inspection of the distributions of original and the imputed data
Code line 16
com <− complete(imp, “long”, inc=T)
Code line 17
# Perform time series prediction modelling
Code line 18
windts<− ts(c1$values,start=c(1995,1),frequency=36)
Code line 19
# Decompose seasonal data
Code line 20
windtscmpnts <− decompose(windts)
Code line 21
plot(windtscmpnts)
Code line 22
# Seasonally Adjusting
Code line 23
windtssadjusted <− windts - windtscmpnts$seasonal
Code line 24
windforecasts <− HoltWinters(windts, beta=FALSE, gamma=FALSE)
Code line 25
windforecasts$fitted
Code line 26
plot(windforecasts)
Code line 27
windforecasts$SSE
Code line 28
# Forecast and forecast errors
Code line 29
windforecasts1 <− forecast.HoltWinters(windforecasts, h=360)
Multiple imputations of wind speed data
This process of data manipulation could not solve the problem of missing data as it was found out that several months, 47 out of a total of 168 (28%), of the data were missing. The problem was even bigger as in some cases maximum decadal wind speed data for two years (2002 and 2003) were missing.
In R, the mice package is known to impute incomplete multivariate data by fully conditional specification, FCS (Bartlett et al. 2014). Multivariate imputation by chained equations (MICE) first appeared in the year 2000 as an S-plus library and in 2001 as an R-Package. The first version introduced predictive selection, passive imputation and automatic pooling. Other extensions including imputing multi-level data, automatic predictor selection, data handling, post-processing imputed values, specialised pooling and model selection have since been advanced. The mice package advances two general approaches of imputing multivariate data based on MCMC. Firstly, the joint modelling (JM) that involves specifying multivariate distribution for the missing data and drawing imputations from their conditional distribution by MCMC techniques.
- 1)
missing data are filled in m times to generate m complete data sets; a process known as imputation;
- 2)
the m complete data sets are analysed by using standard procedures; commonly known as analysis and;
- 3)
the results from the m complete data sets are combined for the inference; also known as pooling.
The original dataset presents 28% missing records of the 168. The MCMC approach iterates the given steps until they converge to their stationary distribution and then simulates an approximately independent draw of the missing wind speed values. MCMC was applied under the package mice and specifically using the FCS (fully conditional specification) routine in R language for statistical computing.
Three steps are employed under FCS as provided for in the package MICE; that is, imputation, analysis and pooling where functions; mice(), with() and pool() are applied respectively. At each step, storage classes are provided for; mids, mira and mipo. The final pooled dataset with completely filled missing data is then developed and stored in the database under the mipo class.
Time series analysis of the imputed wind speed data
Given the short-term time effects within wind speeds, forecasts were made with exponential smoothing described using an additive model with constant level and no seasonality assumptions, to make short-term forecasts. The additive Holt-Winters prediction function was employed to develop a wind speed forecasting model. The smoothing scheme begins by setting \( {\mathrm{S}}_2 \) to \( {\mathrm{y}}_1 \) where \( {\mathrm{S}}_{\mathrm{i}} \) stands for smoothed observation or EWMA (exponentially weighted moving average) and y stand for the original observation. The subscripts refer to the time periods 1,2,..,n. Thus, S_{t} = αy_{t} + (1 − α)S_{t − 1} where, the smoothing parameter α is described as 0 < α ≤ 1; t ≥ 3. The speed at which older wind speed values are dampened or smoothed is a function of the smoothing constant α. When the smoothing parameter is 0, it implies that the current wind speed depend only on its speed one period before. When the smoothing parameter is 1, the wind speed will depend only on the current values. Determining the smoothing constant has been a challenge; (Hyndman and Khandakar 2007; Collopy and Armstrong 1992; Gardner and McKenzie 1985) agree that alpha can best be estimated from the data than just guessing.
Discussions
Discussions are based on descriptive analysis for missing wind speed data, time series analysis for imputed and missing time series data.
Descriptive analysis for level of missing wind speed data
The descriptive analysis shows that there were 168 records, an equivalence of the number of month between 1995 through 2008 considered. For each of the three decadal wind speeds studied, there were equal number of observations (121) and missing or incomplete observations (47). Correspondingly, the median year of observed wind speeds (2000) was the same for the three wind speed decadal data as it was the case for the missing or incomplete observations (2003). Outlier missingness in the data was observed in the earlier periods of 1995 and 1997 respectively for the three wind speed decadal data.
Time series analysis of imputed versus original wind speed data
Differences between the imputed and original wind speed data
Wind speed data imputation necessitated scientifically imputing data gaps that may be at the beginning, within or at the end of the wind speed dataset. The power and accuracy levels of the imputation methods often vary and therefore, a survey of the most suitable approach was carried out to ensure validity of an imputed wind speed dataset. Although sometimes, imputation methods are capable of predicting future occurrences, necessary assumptions should be made to produce reliable results.
Whereas data characterisation may be a good approach to understanding the general structure of data, it does not offer a good solution where further data analysis is required to guide decisions and policy for national operations. Maintenance of character of data, however, is a key factor in determining the reliability of the imputation methods applied.
One of the standards of assessing imputation methods is the ability to preserve the structure and probability functions of the imputed data. Thus, an attempt was made to test the hypothesis that the hypothesis that there was a significant difference between the original wind speed and the imputed datasets H _{ A } : μ _{ original } ≠ μ _{ imputed }. Findings, using the T-test statistics (t = 0.3915; P(|T| > |t|) = 0.6955) showed that there was no significant difference between the original and imputed wind speed datasets. Thereby failing to reject the null hypothesis that presupposed mean wind speed of the original dataset was equal to the mean wind speed of the imputed wind speed dataset. These findings confirm the high level of reliability for the imputation method applied in this study.
Time series model for the imputed wind speed data
Conclusions
In this study, missing decadal wind speed were imputed using multivariate imputation methods. The multiple imputation by chained equations (MICE), whose implementation in the R language for statistical computing was applied. The approach was found to preserve not only the relations and categorisations within the data, but also the uncertainties about these relations over time period. Furthermore, time series model for the imputed decadal wind speeds was developed with a view of presenting a simple exponential smoothing prediction model. Further works on time series modelling is recommended to reconstruct the stochastic tendencies of wind speeds at the case study. The maximum decadal wind speed for Entebbe International Airport was estimated to vary within (17.6 ± 10.8) metres per second.
Like all developing countries, wind speed data management, analysis and prediction is an area that receives the least priority given the competing demands these countries are faced with. Management of these vital data can be improved to minimise any threats to lives and property, especially during departures and arrivals (Wesonga et al. 2012). Technologically, low level wind shear alert systems (LLWAS) are recommended for management, analysis and predictions of wind phenomena.
On the positive side, further analysis and comprehension of the behaviour of these high wind speeds could be a good source of energy when harnessed by competent authorities in charge of energy generation. The wind energy would complement the energy demands for units of operations at the airport (Burton et al. 2011; Giebel et al. 2011). Data is a conveyor of information, complete and timely data, especially wind speed data at an international airport, if appropriately tapped can facilitate smooth airport operations.
Declarations
Acknowledgement
My appreciations go to the staff of the School of Statistics and Planning, Makerere University for providing a challenging, but sometimes favorable study environment and to the staff and Management of the Uganda National Meteorological Authority for availing the wind speed data for Entebbe International Airport. I am grateful to the three anonymous reviewers for their guidance and direction that shaped this paper into its current form.
Authors’ Affiliations
References
- Bartlett JW, Seaman SR, White IR, Carpenter JR (2014) “Multiple imputation of covariates by fully conditional specification: accommodating the substantive model”. arXiv preprint arXiv:1210.6799Google Scholar
- Burton T, Jenkins N, Sharpe D, Bossanyi E (2011). Wind energy handbook. John Wiley & SonsGoogle Scholar
- Calif R, Emilion R, Soubdhan T (2011) Classification of wind speed distributions using a mixture of Dirichlet distributions. Renew Energy 36:3091–3097View ArticleGoogle Scholar
- Collopy F, Armstrong JS (1992) Rule-based forecasting: development and validation of an expert systems approach to combining time series extrapolations. Manage Sci 38:1394–1414View ArticleGoogle Scholar
- Dorvlo ASS (2002) Estimating wind speed distribution. Energy Conversion Manage 43:2311–2318View ArticleGoogle Scholar
- Gardner ES Jr, McKenzie ED (1985) Forecasting trends in time series. Manage Sci 31:1237–1246View ArticleGoogle Scholar
- Gelman A (2004) “Parameterization and bayesian modeling”. J Am Stat Assoc 99:537-545Google Scholar
- Giebel G, Brownsword R, Kariniotakis G, Denhard M, Draxl C (2011) “The state-of-the-art in short-term prediction of wind power: a literature overview”. ANEMOS Plus: Technical University of Denmark, DenmarkGoogle Scholar
- Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn Res 1:49–75Google Scholar
- Hyndman RJ, Khandakar Y (2007) “Automatic time series for forecasting: the forecast package for R”. Monash University, Department of Econometrics and Business StatisticsGoogle Scholar
- Kennickell AB (1991) “Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation”. Proceedings of the Survey Research Methods Section of the American Statistical Association, 1–10Google Scholar
- Li G, Shi J (2010) Application of Bayesian model averaging in modeling long-term wind speed distributions. Renew Energy 35:1192–1202View ArticleGoogle Scholar
- Lun IYF, Lam JC (2000) A study of Weibull parameters using long-term wind observations. Renew Energy 20:145–153View ArticleGoogle Scholar
- Qin Z, Li W, Xiong X (2011) Estimating wind speed probability distribution using kernel density method. Electr Power Syst Res 81:2139–2146View ArticleGoogle Scholar
- R Core Team (2014) “R: a language and environment for statistical computing”. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
- Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27:85–96Google Scholar
- Rubin DB (2003) Nested multiple imputation of NMES via partially incompatible MCMC. Statistica Neerlandica 57:3–18View ArticleGoogle Scholar
- Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8:3–15View ArticleGoogle Scholar
- Schafer JL (2003) Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica 57:19–35View ArticleGoogle Scholar
- Schafer JL, Olsen MK (1998) Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivariate Behav Res 33:545–571View ArticleGoogle Scholar
- Van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16:219–242View ArticleGoogle Scholar
- van Buuren S (2011) “Multiple imputation of multilevel data”. In: Handbook of advanced multilevel analysis., pp 173–196Google Scholar
- Van Buuren S, Oudshoorn K (1999) Flexible multivariate imputation by MICE. TNO Prevention Center, Leiden, The NetherlandsGoogle Scholar
- Van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 18:681–694View ArticleGoogle Scholar
- Walsh B (2004) Markov chain monte carlo and gibbs samplingGoogle Scholar
- Wesonga R, Nabugoomu F (2014) Bayesian model averaging: an application to the determinants of airport departure delay in Uganda. Am J Theoretical Appl Stat 3:1–5View ArticleGoogle Scholar
- Wesonga R, Nabugoomu F, Jehopio P (2012) Parameterized framework for the analysis of probabilities of aircraft delay at an airport. J Air Transport Manage 23:1–4View ArticleGoogle Scholar
- Wesonga R, Nabugoomu F, Jehopio P, Mugisha X (2013) “Modelling airport efficiency with distributions of the inefficient error term: an application of time series data for aircraft departure delay”. Int J Sci Basic Appl Res 1:115–123Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.