MI is a principled missing data method that provides valid statistical inferences under the MAR condition (Little and Rubin 2002). MI was proposed to impute missing data while acknowledging the uncertainty associated with the imputed values (Little and Rubin 2002). Specifically, MI acknowledges the uncertainty by generating a set of m plausible values for each unobserved data point, resulting in m complete data sets, each with one unique estimate of the missing values. The m complete data sets are then analyzed individually using standard statistical procedures, resulting in m slightly different estimates for each parameter. At the final stage of MI, m estimates are pooled together to yield a single estimate of the parameter and its corresponding SE. The pooled SE of the parameter estimate incorporates the uncertainty due to the missing data treatment (the between imputation uncertainty) into the uncertainty inherent in any estimation method (the within imputation uncertainty). Consequently, the pooled SE is larger than the SE derived from a single imputation method (e.g., mean substitution) that does not consider the between imputation uncertainty. Thus, MI minimizes the bias in the SE of a parameter estimate derived from a single imputation method.
In sum, MI handles missing data in three steps: (1) imputes missing data m times to produce m complete data sets; (2) analyzes each data set using a standard statistical procedure; and (3) combines the m results into one using formulae from Rubin (1987) or Schafer (1997). Below we discuss each step in greater details and demonstrate MI with a real data set in the section Demonstration.
Step 1: imputation
The imputation step in MI is the most complicated step among the three steps. The aim of the imputation step is to fill in missing values multiple times using the information contained in the observed data. Many imputation methods are available to serve this purpose. The preferred method is the one that matches the missing data pattern. Given a univariate or monotone missing data pattern, one can impute missing values using the regression method (Rubin 1987), or the predictive mean matching method if the missing variable is continuous (Heitjan and Little 1991;Schenker and Taylor 1996). When data are missing arbitrarily, one can use the Markov Chain Monte Carlo (MCMC) method (Schafer 1997), or the fully conditional specification (also referred to as chained equations) if the missing variable is categorical or non-normal (Raghunathan et al. 2001;van Buuren 2007;van Buuren et al. 1999;van Buuren et al. 2006). The regression method and the MCMC method are described next.
The regression method for univariate or monotone missing data pattern
Suppose that there are p variables, Y
1, Y
2, …, Y
p
in a data set and missing data are uniformly or monotonically present from Y
j
to Y
p
, where 1 < j ≤ p. To impute the missing values for the j th variable, one first constructs a regression model using observed data on Y
1 through Y
j - 1 to predict the missing values on Y
j
:
(3)
The regression model in Equation 3 yields the estimated regression coefficients and the corresponding covariance matrix. Based on these results, one can impute one set of regression coefficients from the sampling distributions of . Next, the missing values in Y
j
can be imputed by plugging into Equation 3 and adding a random error. After missing data in Y
j
are imputed, missing data in Y
j + 1, …, Y
p
are imputed subsequently in the same fashion, resulting in one complete data set. The above steps are repeated m times to derive m sets of missing values (Rubin 1987;pp. 166–167; SAS Institute Inc 2011).
The MCMC method for arbitrary missing pattern
When the missing data pattern is arbitrary, it is difficult to develop analytical formulae for the missing data. One has to turn to numerical simulation methods, such as MCMC (Schafer 1997) in this case. The MCMC technique used by the MI procedure of SAS is described below [interested readers should refer to SAS/STAT 9.3 User’s Guide (SAS Institute Inc 2011) for a detailed explanation].
Recall that the goal of the imputation step is to draw random samples of missing data based on information contained in the observed data. Since the parameter (θ) of the data is also unknown, the imputation step actually draws random samples of both missing data and θ based on the observed data. Formally, the imputation step is to draw random samples from the distribution P(θ, Y
mis
|Y
obs
). Because it is much easier to draw estimates of Y
mis
from P(Y
mis
|Y
obs
, θ) and estimates of θ from P(θ|Y
obs
, Y
mis
) separately, the MCMC method draws samples in two steps. At step one, given the current estimate of θ
(t) at the t th iteration, a random sample is drawn from the conditional predictive distribution of P(Y
mis
|Y
obs
, θ
(t)). At step two, given , a random sample of θ
(t + 1) is drawn from the distribution of . According to Tanner and Wong (1987), the first step is called the I-step (not to be confused with the first imputation step in MI) and the second step is called the P-step (or the posterior step). Starting with an initial value θ
(0) (usually an arbitrary guess), MCMC iterates between the I-step and the P-step, leading to a Markov Chain: and so on. It can be shown that this Markov Chain converges in distribution to P(θ, Y
mis
|Y
obs
). It follows that the sequence θ
(1), θ
(2), …, θ
(t), … converges to P(θ|Y
obs
) and the sequence converges to P(Y
mis
|Y
obs
). Thus, after the Markov Chain converges, m draws of Y
mis
can form m imputations for the missing data. In practice, the m draws are separated by several iterations to avoid correlations between successive draws. Computation formulae of P(Y
mis
|Y
obs
, θ) and P(θ|Y
obs
, Y
mis
) based on the multivariate normal distribution can be found in SAS/STAT 9.3 User’s Guide (SAS Institute Inc 2011). At the end of the first step in MI, m sets of complete data are generated.
Step 2: statistical analysis
The second step of MI analyzes the m sets of data separately using a statistical procedure of a researcher’s choice. At the end of the second step, m sets of parameter estimates are obtained from separate analyses of m data sets.
Step 3: combining results
The third step of MI combines the m estimates into one. Rubin (1987) provided formulae for combining m point estimates and SE s for a single parameter estimate and its SE. Suppose denotes the estimate of a parameter Q, (e.g., a regression coefficient) from the i th data set. Its corresponding estimated variance is denoted as . Then the pooled point estimate of Q is given by:
(4)
The variance of is the weighted sum of two variances: the within imputation variance () and the between imputation variance (B). Specifically, these three variances are computed as follows:
(5)
(6)
(7)
In Equation 7, the () factor is an adjustment for the randomness associated with a finite number of imputations. Theoretically, estimates derived from MI with small m yield larger sampling variances than ML estimates (e.g., those derived from FIML), because the latter do not involve randomness caused by simulation.
The statistic is approximately distributed as a t distribution. The degrees of freedom (ν
m
or ) for this t distribution are calculated by Equations 8–10 (Barnard and Rubin 1999):
(9)
(10)
In Equation 8, r is the relative increase in variance due to missing data. The r is defined as the adjusted between-imputation variance standardized by the within-imputation variance. In Equation 10, gamma = (1 + 1/m)B/T, and ν
0 is the degrees of freedom if the data are complete. is a correction of ν
m
, when ν
0 is small and the missing rate is moderate (SAS Institute Inc 2011).
According to Rubin (1987), the severity of missing data is measured by the fraction of missing information (), defined as:
(11)
As the number of imputations increases to infinity, is reduced to the ratio of the between-imputation variance over the total variance. In its limiting form, can be interpreted as the proportion of total variance (or total uncertainty) that is attributable to the missing data (Schafer 1999).
For multivariate parameter estimation, Rubin (1987) provided a method to combine several estimates into a vector or matrix. The pooling procedure is a multivariate version of Equations (4) through (7), which incorporates the estimates of covariances among parameters. Rubin’s method assumes that the fraction of missing information (i.e., ) is the same for all variables (SAS Institute Inc 2011). To our knowledge, no published studies have examined whether this assumption is realistic with real data sets, or Rubin’s method is robust to violation of this assumption.
MI related issues
When implementing MI, the researcher needs to be aware of several practical issues, such as, the multivariate normality assumption, the imputation model, the number of imputations, and the convergence of MCMC. Each is discussed below.
The multivariate normality assumption
The regression and MCMC methods implemented in statistical packages (e.g., SAS) assume multivariate normality for variables. It has been shown that MI based on the multivariate normal model can provide valid estimates even when this assumption is violated (Demirtas et al. 2008;Schafer 1997
1999). Furthermore, this assumption is robust when the sample size is large and when the missing rate is low, although the definition for a large sample size or for a low rate of missing is not specified in the literature (Schafer 1997).
When an imputation model contains categorical variables, one cannot use the regression method or MCMC directly. Techniques such as, logistic regression and discriminant function analysis, can substitute for the regression method, if the missing data pattern is monotonic or univariate. If the missing data pattern is arbitrary, MCMC based on other probability models (such as the joint distribution of normal and binary) can be used for imputation. The free MI software NORM developed by Schafer (1997) has two add-on modules—CAT and MIX—that deal with categorical data. Specifically, CAT imputes missing data for categorical variables, and MIX imputes missing data for a combination of categorical and continuous variables. Other software packages are also available for imputing missing values in categorical variables, such as the ICE module in Stata (Royston 2004
2005
2007;Royston and White 2011), the mice package in R and S-Plus (van Buuren and Groothuis-Oudshoorn 2011), and the IVEware (Raghunathan et al. 2001(Yucel). Interested readers are referred to a special volume of the Journal of Statistical Software
2011) for recent developments in MI software.
When researchers use statistical packages that impose a multivariate normal distribution assumption on categorical variables, a common practice is to impute missing values based on the multivariate normal model, then round the imputed value to the nearest integer or to the nearest plausible value. However, studies have shown that this naïve way of rounding would not provide desirable results for binary missing values (Ake 2005;Allison 2005;Enders 2010). For example, Horton et al. (2003) showed analytically that rounding the imputed values led to biased estimates, whereas imputed values without rounding led to unbiased results. Bernaards et al. (2007) compared three approaches to rounding in binary missing values: (1) rounding the imputed value to the nearest plausible value, (2) randomly drawing from a Bernoulli trial using the imputed value, between 0 and 1, as the probability in the Bernoulli trial, and (3) using an adaptive rounding rule based on the normal approximation to the binomial distribution. Their results showed that the second method was the worst in estimating odds ratio, and the third method provided the best results. One merit of their study is that it is based on a real-world data set. However, other factors may influence the performance of the rounding strategies, such as the missing mechanism, the size of the model, distributions of the categorical variables. These factors are not within a researcher’s control. Additional research is needed to identify one or more good strategy in dealing with categorical variables in MI, when a multivariate normal-based software is used to perform MI.
Unfortunately, even less is known about the effect of rounding in MI, when imputing ordinal variables with three or more levels. It is possible that as the level of the categorical variable increases, the effect of rounding decreases. Again, studies are needed to further explore this issue.
The imputation model
MI requires two models: the imputation model used in step 1 and the analysis model used in step 2. Theoretically, MI assumes that the two models are the same. In practice, they can be different (Schafer 1997). An appropriate imputation model is the key to the effectiveness of MI; it should have the following two properties.
First, an imputation model should include useful variables. Rubin (1996) recommends a liberal approach when deciding if a variable should be included in the imputation model. Schafer (1997) and van Buuren et al. (1999) recommended three kinds of variables to be included in an imputation model: (1) variables that are of theoretical interest, (2) variables that are associated with the missing mechanism, and (3) variables that are correlated with the variables with missing data. The latter two kinds of variables are sometimes referred to as auxiliary variables (Collins et al. 2001). The first kind of variables is necessary, because omitting them will downward bias the relation between these variables and other variables in the imputation model. The second kind of variables makes the MAR assumption more plausible, because they account for the missing mechanism. The third kind of variables helps to estimate missing values more precisely. Thus, each kind of variables has a unique contribution to the MI procedure. However, including too many variables in an imputation model may inflate the variance of estimates, or lead to non-convergence. Thus, researchers should carefully select variables to be included into an imputation model. van Buuren et al. (1999) recommended not including auxiliary variables that have too many missing data. Enders (2010) suggested selecting auxiliary variables that have absolute correlations greater than .4 with variables with missing data.
Second, an imputation model should be general enough to capture the assumed structure of the data. If an imputation model is more restrictive, namely, making additional restrictions than an analysis model, one of two consequences may follow. One consequence is that the results are valid but the conclusions may be conservative (i.e., failing to reject the false null hypothesis), if the additional restrictions are true (Schafer 1999). Another consequence is that the results are invalid because one or more of the restrictions is false (Schafer 1999). For example, a restriction may restrict the relationship between a variable and other variables in the imputation model to be merely pairwise. Therefore, any interaction effect that involves at least three variables will be biased toward zero. To handle interactions properly in MI, Enders (2010) suggested that the imputation model include the product of the two variables if both are continuous. For categorical variables, Enders suggested performing MI separately for each subgroup defined by the combination of the levels of the categorical variables.
Number of imputations
The number of imputations needed in MI is a function of the rate of missing information in a data set. A data set with a large amount of missing information requires more imputations. Rubin (1987) provided a formula to compute the relative efficiency of imputing m times, instead of an infinite number of times: RE = [1+ /m]-1, where is the fraction of missing information, defined in Equation 11.
However, methodologists have not agreed on the optimal number of imputations. Schafer and Olsen (1998) suggested that “in many applications, just 3–5 imputations are sufficient to obtain excellent results” (p. 548). Schafer and Graham (2002)were more conservative in asserting that 20 imputations are enough in many practical applications to remove noises from estimations. Graham et al. (2007) commented that RE should not be an important criterion when specifying m, because RE has little practical meaning. Other factors, such as, the SE, p-value, and statistical power, are more related to empirical research and should also be considered, in addition to RE. Graham et al. (2007) reported that statistical power decreased much faster than RE, as λ increases and/or m decreases. In an extreme case in which λ=.9 and m = 3, the power for MI was only .39, while the power of an equivalent FIML analysis was 0.78. Based on these results, Graham et al. (2007) provided a table for the number of imputations needed, given λ and an acceptable power falloff, such as 1%. They defined the power falloff as the percentage decrease in power, compared to an equivalent FIML analysis, or compared to m = 100. For example, to ensure a power falloff less than 1%, they recommended m = 20, 40, 100, or > 100 for a true λ =.1, .5, .7, or .9 respectively. Their recommended m is much larger than what is derived from the Rubin rule based on RE (Rubin 1987). Unfortunately, Graham et al.’s study is limited to testing a small standardized regression coefficient (β = 0.0969) in a simple regression analysis. The power falloff of MI may be less severe when the true β is larger than 0.0969. At the present, the literature does not shed light on the performance of MI when the regression model is more complex than a simple regression model.
Recently, White et al. (2011) argued that in addition to relative efficiency and power, researchers should also consider Monte Carlo errors when specifying the optimal number of imputations. Monte Carlo error is defined as the standard deviation of the estimates (e.g. regression coefficients, test statistic, p-value) “across repeated runs of the same imputation procedure with the same data” (White et al. 2011p. 387). Monte Carlo error converges to zero as m increases. A small Monte Carlo error implies that results from a particular run of MI could be reproduced in the subsequent repetition of the MI analysis. White et al. also suggested that the number of imputations should be greater than or equal to the percentage of missing observations in order to ensure an adequate level of reproducibility. For studies that compare different statistical methods, the number of imputations should be even larger than the percentage of missing observations, usually between 100 and 1000, in order to control the Monte Carlo error (Royston and White 2011).
It is clear from the above discussions that a simple recommendation for the number of imputations (e.g., m = 5) is inadequate. For data sets with a large amount of missing information, more than five imputations are necessary in order to maintain the power level and control the Monte Carlo error. A larger imputation model may require more imputations, compared to a smaller or simpler model. This is so because a large imputation model results in increased SE s, compared to a smaller or simpler model. Therefore, for a large model, additional imputations are needed to offset the increased SE s. Specific guidelines for choosing m await empirical research. In general, it is a good practice to specify a sufficient m to ensure the convergence of MI within a reasonable computation time.
Convergence of MCMC
The convergence of the Markov Chain is one of the determinants of the validity of the results obtained from MI. If the Markov Chain does not converge, the imputed values are not considered random samples from the posterior distribution of the missing data, given the observed data, i.e., P(Y
mis
|Y
obs
). Consequently, statistical results based on these imputed values are invalid. Unfortunately, the importance of assessing the convergence was rarely mentioned in articles that reviewed the theory and application of MCMC (Schafer 1999;Schafer and Graham 2002;Schlomer et al. 2010;Sinharay et al. 2001). Because the convergence is defined in terms of both probability and procedures, it is complex and difficult to determine the convergence of MCMC (Enders 2010). One way to roughly assess convergence is to visually examine the trace plot and the autocorrelation function plot; both are provided by SAS PROC MI (SAS Institute Inc 2011). For a parameter θ, a trace plot is a plot of the number of iterations (t) against the value of θ
(t) on the vertical axis. If the MCMC converges, there is no indication of a systematic trend in the trace plot. The autocorrelation plot displays the autocorrelations between θ
(t)s at lag k on the vertical axis against k on the horizontal axis. Ideally, the autocorrelation at any lag should not be statistically significantly different from zero. Since the convergence of a Markov Chain may be at different rates for different parameters, one needs to examine these two plots for each parameter. When there are many parameters, one can choose to examine the worst linear function (or WLF, Schafer 1997). The WLF is a constructed statistic that converges more slowly than all other parameters in the MCMC method. Thus if the WLF converges, all parameters should have converged (see pp. 2–3 of the Appendix for an illustration of both plots for WLF, accessible from https://oncourse.iu.edu/access/content/user/peng/Appendix.Dong%2BPeng.Principled%20missing%20methods.current.pdf). Another way to assess the convergence of MCMC is to start the chain multiple times, each with a different initial value. If all the chains yield similar results, one can be confident that the algorithm has converged.