 Methodology
 Open Access
 Published:
Impact of alternative approaches to assess outlying and influential observations on health care costs
SpringerPlusvolume 2, Article number: 614 (2013)
Abstract
The distributions of medical costs are often skewed to the right because small numbers of patients use large amounts of health care resources. Using data from a study of colon cancer costs, we show, by example, the impact and magnitude of outliers and influential observations on health care costs and compared the effects of statistical costing methods for addressing the disproportionate influence of outliers and influential observations. We used data from a retrospective cohort study of 3,842 elderly veterans with colon cancer who were enrolled in and used health care from, both the Department of Veterans Affairs and Medicare in 1999–2004. After calculating the average colon cancer episode cost and distribution for the full cohort, we used boxplot methods, Winsorization, DFBETAs, and Cook's distance to identify and assess or adjust the outlying and/or influential observations. The number of observations identified as outlying and/or influential ranged from 13 when the predicted DFBETA measurement was greater than 0.15 and the observation was a qualified boxplot outlier to 384 cases using the Winsorization method at the 5th and 95th percentiles. Average costs of colon cancer episodes using these methods were similar. The method of choice from the results of this particular analysis can be conditionally based on whether the purpose is to control only for influential observations or to simultaneously control for outliers and influential observations. Understanding how estimates could change with each approach is important in assessing the impact of a particular method on the results.
Introduction
Determining the costs of episodes of medical care is an important step in making policy decisions about allocating health care resources. However, as has been well documented in the literature, accurately estimating costs is challenging due to right skewing when small numbers of patients use larger amounts of health care resources than most other patients (Mullahy 1998). In 2009, for example, 22% of total health care expenditures in the United States were allocated to just 1% of the U.S. population, and almost 50% of health care spending was devoted to 5% of the population (Cohen and Yu 2012). In addition, no single estimator is appropriate for all of the processes typically used to generate health care costs data (Basu et al. 2011). The data values for patients at the extreme ends of the value range do not represent the typical experience and can disproportionately influence statistical point estimates. The lack of symmetry, or skewness that is frequently observed in medical cost data, is characterized by these extreme values, known as outliers.
Statistical procedures are useful to identify cases that have deviated from other cases in the sample, resulting in skewness in large datasets. Some of the statistical techniques are nonparametric and avoid assumptions that the data are represented by a particular statistical distribution.
In the medical literature, outliers are often identified by selecting data on patients with the highest costs based on statistical trimming rules (Gregori et al. 2009). Researchers often use cutoff levels ranging from the upper 0.5% to 20% of the cost distribution, for example. Other approaches include selecting outliers based on the geometric mean plus one or more standard deviations or the interquartile method (Cots et al. 2003; Pirson et al. 2006). The arithmetic mean is then calculated based on the data that remain after the outliers have been trimmed. Disadvantages of these approaches are that the analysis results are relevant only to the sample used and the findings cannot be compared to those of other studies.
In addition to identifying outlying cases in a sample, investigators frequently identify observations that are influential. An influential observation is a type of outlying observation whose exclusion results in major changes in the fitted regression function or parameters. Usually, observations exhibiting high leverage (potential to influence regression results) and large residual (in absolute value) are influential. Although all influential observations are outliers, not all outliers are influential observations.
Standard linear regression models are often used to predict average costs for patients because these models are easy to use and their results are easy to interpret. However, these models are based on the assumption that the regression errors have a normal distribution and linear relationships (Paddock et al. 2004; Barber and Thompson 2004). When these assumptions are violated, as in data on costs of episodes of care with values that are markedly different from the rest of the sample, these models are not appropriate.
Generalized linear models (GLMs) can accommodate skewness in large datasets by weighting variances (Blough and Ramsey 2000). Using these models involves specifying an appropriate model for the mean of the outcome variable and the correct meanvariance relationship (variance function) (Mihaylova et al. 2011). Parameters are then estimated after these structural assumptions are taken into consideration. The mean function estimates from GLMs are generally robust, and GLMs are less sensitive than linear regression models to outliers and/or influential observations. However, misspecifying the variance function in GLMs could result in losses of precision. Also, GLMs can lose efficiency if the data have a large logscale error variance or the distribution of errors on the log scale is symmetrical but has a heavy tail (Manning and Mullahy 2001; Mihaylova et al. 2011).
Several statistical techniques can be used to identify and address outlying and/or influential cases in highly skewed cost datasets, potentially improving the precision and efficiency of GLMs. Techniques to assess outliers include boxplot analysis (interquartile method), which involves the use of distributional characteristics to identify outliers (Pirson et al. 2006). Winsorization can be used to transform the costs of outlier episodes so that they are equal to a preestablished percentile of the data (Thomas and Ward 2006). For example, if the maximum percentile is set at 95% and the minimum at 5%, Winsorization transforms costs for patients with costs above the 95^{th} percentile to the costs of patients in the 95^{th} percentile and those with costs in the bottom 5% to the costs of patients in the 5^{th} percentile. Approaches to identify influential observations include DFBETAs, which are measures of standardized differences between regression coefficients when a given observation is included or excluded (Choi 2009). Cook’s distance, another method for identifying influential observations, summarizes the influence of each observation on the fitted model parameters after deleting each observation from the estimation and measuring the resulting aggregate changes in estimated costs (Indurkhya et al. 2001).
The goal of this study was to demonstrate, by example, how to identify and handle outliers and how to assess and handle influential observations by measuring their magnitude and impact on colon cancerrelated costs (including average episodebased costs and key costdrivers). This study also compared the effects of statistical costing methods and approaches for overcoming the disproportionate influence of outliers and influential observations.
Methods
Study design
We examined data from a retrospective cohort study of veterans aged 66 years or older with colon cancer who were enrolled in both the Department of Veterans Affairs (VA) and Medicare between July 1999 and December 2001. Data included health care use and cost data from the VA; Medicare; eight National Cancer Institute Surveillance, Epidemiology, and End Results (SEER)affiliated cancer registries; and the VA Central Cancer Registry. A description of a similar cohort is available elsewhere (Tarlov et al. 2012). We excluded patients who had no colon cancerrelated costs, were enrolled in a Medicare health maintenance organization, and whose cancer stage at diagnosis was unknown. The final sample comprised 3,842 elderly veterans with stages IIV colon cancer.
The Edward Hines, Jr. VA Hospital institutional review board (IRB) and the IRBs of the SEER registries approved the study and waived the requirement for informed consent.
Measures and data sources
We measured colon cancerrelated costs in the 12 months following diagnosis and methods are described elsewhere (Hynes et al. 2010). In brief, we classified encounters in Medicare claims and VA records during this period as colon cancer related if they included an International Classification of Diseases, 9^{th} revision, colon cancer diagnosis or colectomy procedure code; Current Procedural Terminology, 4th edition, chemotherapy or colectomy procedure code; Medicare revenue center code; VA outpatient clinic stop code; or Pharmacy Benefits Management (PBM) pharmacy class code for chemotherapy or chemotherapyrelated service.
We based costs of services provided through Medicare on payments in institutional inpatient (Medicare Provider Analysis and Review file) and outpatient (Outpatient Standard Analytical File) claims. We also included allowed charge amounts from noninstitutional provider claims for care provided under Medicare (Carrier file). We obtained data on costs of care provided through the VA from the Health Economic Resource Center (HERC) Average Cost datasets. HERC estimated average costs for VA inpatient stays using a Medicare cost function estimate developed using patient admission characteristics (Wagner et al. 2003). HERC estimated average costs for VA outpatient visits based on reimbursement rates from Medicare and other health care payers and adjusted these payments to reflect the actual aggregate cost of VA outpatient care (Phibbs et al. 2003). We used VA Fee Basis data (Inpatient, Inpatient Ancillary, and Outpatient files) to identify costs of covered care provided to VA patients outside of VA facilities. Our VA pharmacy costs came from PBM data. The costs we calculated did not include the costs of home health, longterm care (VA only), or hospice care.
We combined the coloncancer related health care costs for VA and Medicare to determine the costs of a 12month colon cancer episode of care for each patient in our cohort. We used the Consumer Price Index to adjust these costs to 2004 dollars (Bureau of Labor Statitics and U.S. Department of Labor 2012).
Approaches to identify outliers and influential observations
We examined four approaches, alone or in combination, for identifying and assessing or adjusting outliers (boxplot analysis and Winsorization) and influential observations (DFBETAs and Cook’s distance) in our full cohort (Tukey 1962; Barnett and Lewis 1994).
The boxplot (interquartile method) is a graphical approach that displays the distribution of data and indicates which observations might be outliers (Pirson et al. 2006). We identified observations from the full cohort as boxplot outliers if ln(cost) > Q3 + 1.5*IQR or ln(cost) < Q1 – 1.5*IQR, where ln refers to the natural logarithm, Q3 is the 75^{th} percentile (upper quartile), Q1 is the 25^{th} percentile (lower quartile), and the interquartile range (IQR) is Q3 – Q1. We used the natural logarithm transformation because the link function we chose for our examination of the GLM models was the logarithmic function.
Winsorization involves replacing (or limiting) extreme values to reduce the effect of outlying values (Thomas and Ward 2006). We Winsorized costs at the 2^{nd} and 98^{th} percentiles by assigning the cost of the 2^{nd} percentile to observations with costs less than that value and by assigning costs of the 98^{th} percentile to costs above that value. In an additional analysis, we Winsorized costs at the 5^{th} and 95^{th} percentiles.
DFBETAs measure, for each regressor in the model, the standardized difference between the regression coefficient when the j^{th} observation is included or excluded. This measurement can be used to determine an observation’s magnitude of influence on each regression parameter estimate. We predicted DFBETA measurements for each regressor in the model. We identified an observation as influential if the absolute value of the predicted DFBETA measurements for stage at diagnosis and colectomy (key costdriving characteristics) was greater than the sizeadjusted cutoff value of 2/√N or 2/√3,842, or approximately 0.03 (Belsley et al. 1980). We also used 0.15 as a cutoff value for identifying an observation as influential because 10–15% changeinestimate criteria are frequently used to assess confounding in epidemiological studies (Rothman et al. 2008).
Cook’s distance is a technique to measure the aggregate change in the estimated parameter coefficients when each observation is omitted from the estimation and then summarize how each observation influences the fitted model (Indurkhya et al. 2001). We identified observations from the full cohort as influential if their predicted Cook’s distance measurement was greater than the conventional sizeadjusted cutoff value of 4/N or 4/3,842 (Fox 1991).
We also considered an observation from the full cohort to be influential and outlying if the predicted DFBETA measurement was greater than 0.15 and the observation was a qualified boxplot outlier.
Identification and comparison of outlying/influential observations
We calculated the average episode of care cost and distribution for the full cohort. We then identified outlying and/or influential observations using boxplot methods, DFBETAs, and Cook’s distance, and assessed the impact on our calculations of not including these observations. We also adjusted cost values for outlying observations using the Winsorization method. We compared the average costs of each episode of care to those of the cohorts we identified using these methods for handling outliers and influential observations.
Multivariate analysis
We used multivariate GLM models (gamma family based on modified Park test (Manning and Mullahy 2001) with log link, where ln(E(yx)) = xβ), to evaluate the association between select key costdriving characteristics (stage at diagnosis and colectomy) and 12month colon cancer episode costs of care, while controlling for additional factors. We also performed the GLM modeling using the Poisson and inverse Gaussian families to compare the robustness of our parameter estimates. We calculated estimated expense rate ratios (ERRs) and 95% confidence intervals (CIs). We then compared the key costdriving variable estimates and the CI widths as a measure of precision from the full cohort to the estimates we obtained after employing the approaches for handing outliers and influential observations described above. Finally, we calculated postmodeling adjusted cost predictions for the key costdriving variables from the full cohort and we compared these to the cost predictions calculated after we employed the approaches described above.
We used SAS (version 9.3; SAS Institute, Cary, NC) and Stata® MP software (version 12.1, Stata, College Station, TX) for our analyses. Figures were produced using Stata®.
Results
Cohort characteristics
Among the 3,842 veterans with colon cancer in our cohort who were enrolled in both the VA and Medicare between 1999 and 2001, the average age was 76 years (standard deviation [SD] = 5.7), 96.5% were male, and 15.5% were African American (Table 1). Of these veterans, 26.8% had Stage I, 30.7% had Stage II, 23.2% had Stage III, and 19.3% had Stage IV colon cancer. In addition, 89.4% had undergone cancerdirected colectomy and 33.6% had received chemotherapy within the 12 months following diagnosis. Twentythree percent had a modified DeyoCharlson comorbidity score with Romano adaptations of 2 or higher (higher scores indicate a worse baseline health status) (Charlson et al. 1987; Romano et al. 1993; Klabunde et al. 2000; Klabunde et al. 2006). The average cost of colon cancer episodes for the cohort was $38,327 (SD = 37,388), with a range of $43 to $679,472 (Figure 1).
Comparisons after the identification of outlying/influential observations
The number of observations we identified as outlying and/or influential varied widely depending on the method we employed.
The boxplot method identified 227 observations as outlying (Table 2). Based on their distribution, 45 observations were upper outlying values and 182 were lower outlying values. Cases identified as outlying using the boxplot method had the lowest average cost ($52,952) of all the methods we used, and the boxplot method identified the second highest number of outlying cases.
Winsorization at the 2^{nd} and 98^{th} percentiles replaced 152 observations (76 observations in the lower end and 76 in the upper end; Table 2). By definition, Winsorization at this level replaced 2% of the skewed observations to the right. This method had a middle average cost ($108,152) for the cases identified compared to the other methods. Winsorization at the 5^{th} and 95^{th} percentiles replaced 384 observations (192 observations in the lower end and 192 in the upper end). Winsorization at this level replaced 5% of the skewed observations to the right. The average cost ($77,669) of outlying cases was lower for Winsorization at this level than at the 2^{nd} and 98^{th} percentiles.
The DFBETA method identified 275 observations as influential at the 0.03 cutoff value and 16 at the 0.15 cutoff value (Table 2). The 0.03 threshold, as expected, identified a much larger proportion of influential observations (more than 15 times as many) as the 0.15 threshold. This method identified observations that were influential on both the upper and lower ends, as shown by the lowest cost of all cases identified as influential, $100. The 0.03 threshold resulted in a lower average cost ($99,398) for cases identified as influential compared to the other influential observation methods. The average cost ($265,093) of influential observations identified using the DFBETA method was higher with a 0.15 threshold than a 0.03 threshold, and the minimum cost of influential cases identified using the 0.15 threshold was $50,397.
The Cook’s distance method identified 113 observations as influential using the specified cutoff value (Table 2). Among these influential cases, the average cost ($164,845) was higher than for cases identified with the boxplot and Winsorization methods. The lowest cost of all cases identified as influential by the Cook’s distance method was $33,642.
The method that combined a DFBETA threshold of 0.15 and qualified boxplot outliers identified 13 observations as influential and outlying. Imposing the additional boxplot outlier criterion led to the selection of three fewer cases than the DFBETA method with a 0.15 threshold alone (Table 2). Compared to the other methods, this combined method had the highest average cost ($299,690) for cases identified as influential while identifying the smallest number of influential cases. In addition, the minimum cost, at $174,413, was the highest of all the methods we used.
The average 12month episode of care costs in the cohorts generated using all of the methods for handling outliers and influential observations were similar (Table 3). The average cost for each colon cancer episode was lowest ($33,619, SD = 22,633, range $43–$210,530) in the cohort generated using the DFBETA method with a threshold of 0.03. The average colon cancer episode cost was highest, at $37,440 (SD = 33,754; range $43–$679,472), in the analysis that combined a DFBETA threshold of 0.15 and qualified boxplot outliers.
Multivariate analysis comparisons
The GLM regression results using the gamma family (Figure 2) for the full cohort indicate that costs were 51% higher in patients who underwent colectomy (ERR: 1.51, 95% CI: 1.31–1.73) than in those who did not have a colectomy. The colectomy ERRs were similar (range 1.37–1.58) after we employed each of the approaches for handling outliers and influential observations, except for the boxplot method for defining outliers, which resulted in an ERR for colectomy of 1.18. The stage at diagnosis ERRs for the full cohort were of similar magnitude to those obtained with each of the outlier/influential observation methods; the estimates from some of the methods were consistently lower than the estimates from the full cohort and from others were consistently higher. When we examined the CIs for each of the key costdriving variables, the widths were consistently shortest for the DFBETA method with a threshold of 0.03 and greatest for the method that combined a DFBETA threshold of 0.15 and qualified boxplot outliers.
The parameter estimates generated with GLM modeling using the Poisson family (results not shown) were qualitatively similar to the estimates that resulted from our use of the gamma family GLM (i.e., the stage at diagnosis and colectomy estimates were in the same direction after we used each method for identifying outliers and/or influential observations). The results were also similar quantitatively and of comparable magnitude. All estimates produced from the Poisson modeling were closer than the gamma family estimates to the null hypothesis value, except for the DFBETA method with a threshold of 0.03, which produced estimates for Stage II and Stage IV colon cancer that were further than the gamma family estimates to the null hypothesis value. However, this difference was small.
The results of the GLM modeling using the inverse Gaussian family (results not shown) were also qualitatively similar to the gamma family estimates. The magnitude of the estimates was consistently larger for the inverse Gaussian modeling. All methods for identifying outliers and/or influential observations in the inverse Gaussian modeling were further than the gamma family estimates from the null hypothesis value, except for the full sample, boxplot method, and Winsorization at the 5^{th} and 95^{th} percentiles, whose estimates for stage IV colon cancer were closer than the gamma family estimates to the null hypothesis value. Again, these differences appeared to be negligible.
Postmodeling predictions revealed that the adjusted costs for patients grouped by stage at diagnosis and colectomy status were consistently lower for each of the methods for identifying outliers and/or influential observations compared to the full sample. Exceptions were the boxplot method, which yielded higher predictions for Stage I colon cancer and patients who did not have a colectomy, and Winsorization at the 5^{th} and 95^{th} percentiles, which yielded higher predictions for patients who did not have a colectomy (Figure 3). Although the ERR estimates were qualitatively similar to one another, the adjusted averages varied depending on the method used. The predicted adjusted average cost that was closest the majority of the time to that of the full sample while selecting the smallest amount of cases came from the method that used a combination of the DFBETA threshold of 0.15 and qualified boxplot outliers.
Discussion
In this study, we examined four approaches, alone and in combination, for addressing outliers and influential observations in a cohort of 3,842 elderly veterans with colon cancer. The number of observations we identified as outlying and/or influential varied widely depending on the method we employed—from 13 cases when the predicted DFBETA measurement was greater than 0.15 and the observation was a qualified boxplot outlier to 384 cases when we used the Winsorization method at the 5^{th} and 95^{th} percentiles. The average cost of outlying/influential observations ranged from $52,952 with the boxplot method to $299,690 with the combination of a DFBETA threshold of 0.15 and qualified boxplot outliers. But in spite of these differences, the average costs of colon cancer episodes in the cohorts we identified using all of these methods for handling outliers and influential observations were similar.
The variations in the numbers of observations identified as outlying and/or influential by the different methods we employed can be explained by each method’s ability to distinguish between different degrees of skewness to the right. The boxplot method, which identified slightly more than 1% of the skewness to the right, might have overemphasized the lower values of the cost distribution. Similarly, Winsorization might have placed too much emphasis on the lower percentiles of the distribution. The fact that the DFBETA method with the 0.03 threshold resulted in a lower average cost for cases identified as influential compared to the other influential observation methods demonstrates that this method identified more cases on the lower end of the right skewed distribution. However, the fact that the average cost ($265,093) of influential observations was higher and the minimum cost was $50,397 with the DFBETA method and a 0.15 threshold demonstrates that more cases were removed to the right of the population average and that this method selects high leverage values with large residual error. The minimum cost of all cases identified as influential by Cook’s distance, at $33,642, shows that the Cook’s distance method identifies the larger costs of a right skewed cost distribution. Using the method that combined a DFBETA threshold of 0.15 and qualified boxplot outliers resulted in the highest average cost ($299,690) for cases identified as influential, the smallest number of influential cases, and the highest minimum cost for influential cases of all the methods we used. This method targets those observations that are skewed to the right and has a greater than 15% change on the parameter estimate.
All of the methods for handling outliers and influential observations appeared to yield similar results with regard to the average cost estimates. The number of cases that we identified as influential was highest when we used the DFBETA method with a threshold of 0.03, which explains why the calculated mean cost was lowest using the cases in this cohort that we identified. The average colon cancer episode cost was highest, at $37,440 (SD = 33,754; range $43–$679,472), in the analysis that used the combination of a DFBETA threshold of 0.15 and qualified boxplot outliers is due to the fact that the method identified the smallest number of cases. Although these cases were highly influential and outlying, their number was too small to induce a major change in the cost of the average colon cancer episode.
The colectomy ERRs were similar (range 1.37–1.58) after we employed each of the approaches for handling outliers and influential observations, except for the boxplot method for defining outliers, which resulted in an ERR of 1.18. The identification and handling of cases on both the lower and upper ends of the distribution in the boxplot method greatly reduced the margin of difference in cost between the colectomy and nocolectomy cases.
The ERR estimates for stage at diagnosis were consistently lower for the boxplot and Winsorization methods than the ERR estimates for the full cohort. The upper outlying cost values from the cohort identified using the boxplot method had a larger impact on the regression estimates than the lower outlying values because the estimates for each cancer stage were consistently lower than the estimates for each stage in the full cohort even though the boxplot method identified more lower outlying values than higher outlying values. If the patients with the highest costs in our original cohort tended to have Stage III or Stage IV colon cancer, the Winsorization processes were most likely to adjust for these higher costs. As a result, Winsorization consistently yielded stage estimates that were lower than for the full cohort.
Winsorization at the 5^{th} and 95^{th} percentiles resulted in estimates that were lower than Winsorization at the 2^{nd} and 98^{th} percentiles. One likely reason for this was that the method adjusted the higher costs associated with advancedstage cases to a smaller value (the value of cases at the 95^{th} percentile) than the costs of cases in the 98^{th} percentile, and the costs of Stage I cases, which were lower than the costs of more advancedstage cases, were adjusted to the cost of cases in the 5^{th} percentile, which was higher than the costs of cases in the 2^{nd} percentile. When we compared the regression estimates of the full sample to the estimates from the boxplot and Winsorization methods, reductions in estimates were generally greater for patients with more advancedstage cancer because these cases were more likely to be identified as cost outliers.
The regression estimates for stage at diagnosis were consistently higher for the two methods that identified influential costs—DFBETA and Cook’s distance—than for the full cohort. Thus, it is possible that the DFBETA and Cook’s distance methods identified many cases with low or middle costs that were influential in addition to some influential highcost records, which would increase the regression estimates and gradually increase estimated costs from lowerstage to higherstage colon cancer.
The method that combined the DFBETA threshold of 0.15 and qualified boxplot outliers produced regression estimates that were very similar to those of the full cohort. A possible explanation might be that, at only 13, the number and value of influential observations we identified using the combined criteria was too small to induce a large change in the model. This method is robust as it uses a combination of outlying and influential criteria and yields results that are consistent with the regression estimates for the full cohort.
We observed that the CI widths were consistently shortest for the DFBETA method with a threshold of 0.03. Even though this method identified the largest number of influential observations, the widths were almost half of the distance compared to the full cohort, indicating that this method produces the greatest improvement in precision. In contrast, the widths were greatest for the method that combined a DFBETA threshold of 0.15 and qualified boxplot outliers. This method identified the smallest number of influential observations, and although it was robust in its targeting of outlying and influential observations, precision was the lowest of all methods.
The similarity of the regression estimates comparing the GLM models using the gamma family to the Poisson and inverse Gaussian families (results not shown) suggested robustness of GLMs in addressing skewness in large datasets. Although we observed this similarity in our data, this might not be the case in all circumstances and careful consideration should be given to successfully specifying the variance function (Manning and Mullahy 2001; Mihaylova et al. 2011).
This study showed that although each of the methods we used identified different numbers of cases as outliers and/or influential observations, these methods produced generally similar overall average costs and average costs by stage at diagnosis and colectomy receipt. Furthermore, the ERRs of the key costdrivers produced from the GLM modeling were quantitatively and qualitatively similar and of comparable magnitude. However, our postmodeling predictions of average costs for stage at diagnosis and colectomy receipt varied slightly depending on the method we used.
This study compared the effects of using alternative approaches to identifying outlying and influential observations on costs of colon cancer episodes of care. Understanding how estimates could change with each approach is important in determining whether to use a particular method. We used ruleofthumb cutoff values to identify observations as outlying or influential that are, to some extent, arbitrary, and our findings might have been different if we had used different cutoff values. These remedial measures for handling outliers and influential observations should be employed if the fitted model leads to major changes in the inferences drawn when cases are omitted (Kutner et al. 2004).
Conclusions
Although we do not recommend any single method for all analyses, we believe that based on the results of this study, the method of choice can be conditionally based on the analytic purpose. If the purpose is to control only for influential observations, then the method of choice is the DFBETA method with a threshold of 0.03 because it produced estimates of similar magnitude to those produced using the full cohort while demonstrating the most improvement in precision as CI widths were consistently shortest. If the purpose is to simultaneously control for outliers and influential observations, then the method of choice is the one that identifies outliers and influential observations using the combination of a DFBETA threshold of 0.15 and qualified boxplot outliers because this method targets those observations that are skewed to the right and has a substantial influence on the parameter estimate. This method produced the closest average colon cancer episode cost and similar regression estimates to those of the full cohort but did so at the expense of precision. The analysis of skewed data should always consider different options for handling outlying and influential cases. Although the conditional methods of choice were applied to cost data in this case, the methods could be appropriate for other data with right skewness as well and the analyst should select an approach for handling outliers and influential observations based on the specific data structure and subject matter knowledge.
Abbreviations
 VA:

Department of Veterans Affairs
 GLM:

Generalized linear model
 SEER:

Surveillance, Epidemiology, and End Results
 IRB:

Institutional review board
 PBM:

Pharmacy Benefits Management
 HERC:

Health Economic Resource Center
 Q1:

25^{th} percentile (lower quartile)
 Q3:

75^{th} percentile (upper quartile)
 IQR:

Interquartile range
 ERR:

Expense rate ratio
 CI:

Confidence interval
 SD:

Standard deviation
 BP:

Boxplot.
References
Barber J, Thompson S: Multiple regression of cost data: use of generalised linear models. J Health Serv Res Policy 2004, 9(4):197204. 10.1258/1355819042250249
Barnett V, Lewis T: Outliers in Statistical Data. 3rd edition. Chichester, England: John Wiley & Sons; 1994.
Basu A, Polsky D, Manning WG: Estimating treatment effects on healthcare costs under exogenicity: is there a "magic bullet"? Health Serv Outcomes Res Methodol 2011, 11: 126. 10.1007/s1074201100728
Belsley DA, Kuh E, Welsch RE: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley; 1980.
Blough DK, Ramsey SD: Using generalized linear models to assess medical care costs. Health Serv Outcomes Res Methodol 2000, 1(2):185202. 10.1023/a:1012597123667
Bureau of Labor Statitics, U.S. Department of Labor Consumer Price Index 2012.http://www.bls.gov/cpi/ . Accessed March 5, 2012
Charlson ME, Pompei P, Ales KL, MacKenzie CR: A new method of classifying prognostic comorbodity in longiduinal studies: development and validation. J Chronic Dis 1987, 40: 373383. 10.1016/00219681(87)901718
Choi SW: The effect of outliers on regression analysis: regime type and foreign direct investment. Q J Political Sci 2009, 4: 153165. 10.1561/100.00008021
Cohen SB, Yu W: Statistical Brief #354: The Concentration and Persistence in the Level of Health Expenditures over Time: Estimates for the U.S. Population, 2008–2009, vol March 29, 2012. Rockville, MD: Agency for Healthcare Research and Quality; 2012.
Cots F, Elvira D, Castells X, Saez M: Relevance of outlier cases in case mix systems and evaluation of trimming methods. Health Care Manag Sci 2003, 6: 2735. 10.1023/A:1021908220013
Fox J: Regression Diagnostics. Newbury Park, CA: Sage Publications; 1991.
Gregori D, Petrinco M, Barbati G, Bo S, Desideri A, Zanetti R, Merletti F, Pagano E: Extreme regression models for characterizing highcost patients. J Eval Clin Pract 2009, 15(1):164171. 10.1111/j.13652753.2008.00976.x
Hynes DM, Tarlov E, DurazoArvizu R, Perrin R, Zhang Q, Weichle T, Ferreira MR, Lee T, Benson AB, Bhoopalam N, Bennett CL: Surgery and adjuvant chemotherapy use among veterans with colon cancer: insights from a California study. J Clin Oncol 2010, 28(15):25712576. 10.1200/jco.2009.23.5200
Indurkhya A, Gardiner JC, Luo Z: The effect of outliers on confidence interval procedures for costeffectiveness ratios. Stat Med 2001, 20(9–10):14691477.
Klabunde CN, Potosky AL, Legler JM, Warren JL: Development of a comorbidity index using physician claims data. J Clin Epidemiol 2000, 44: 921928.
Klabunde CN, Harlan LC, Warren JL: Data sources for measuring comorbidity: a comparison of hospital records and medicare claims for cancer patients. Med Care 2006, 44(10):921928. 10.1097/01.mlr.0000223480.52713.b9
Kutner MH, Nachtsheim CJ, Neter J, Li W: Applied Linear Statistical Models. 5th edition. Chicago, Illinois: McGrawHill/Richard D. Irwin, Inc.; 2004.
Manning WG, Mullahy J: Estimating log models: to transform or not to transform? J Health Econ 2001, 20(4):461494. 10.1016/S01676296(01)000868
Mihaylova B, Briggs A, O'Hagan A, Thompson SG: Review of statistical methods for analysing healthcare resources and costs. Health Econ 2011, 20(8):897916. 10.1002/hec.1653
Mullahy J: Much ado about two: reconsidering retransformation and the twopart model in health econometrics. J Health Econ 1998, 17: 247281. 10.1016/S01676296(98)000307
Paddock SM, Wynn BO, Carter GM, Buntin MB: Identifying and accommodating statistical outliers when setting prospective payment rates for inpatient rehabilitation facilities. Health Serv Res 2004, 39(6 Pt 1):18591879. 10.1111/j.14756773.2004.00322.x
Phibbs CS, Bhandari A, Yu W, Barnett PG: Estimating the costs of VA ambulatory care. Med Care Res Rev 2003, 60(3 Suppl):54S73S.
Pirson M, Dramaix M, Leclercq P, Jackson T: Analysis of cost outliers within APRDRGs in a Belgian general hospital: two complementary approaches. Health Policy (Amsterdam) 2006, 76(1):1325. 10.1016/j.healthpol.2005.04.008
Romano PS, Roos LL, Jollis JG: Adapting a clinical comorbidity index for use with ICD9CM administrative data: differing perspectives. J Clin Epidemiol 1993, 46(10):10751090. 10.1016/08954356(93)901038
Rothman KJ, Greenland S, Lash TL: Modern Epidemiology. Philadelphia, Pennsylvania: Lippincott, Williams & Wilkins; 2008.
Tarlov E, Lee T, Weichle T, DurazoArvizu R, Zhang Q, Perrin R, Bentrem DJ, Hynes DM: Reduced overall and eventfree survival among colon cancer patients using dual system care. Cancer Epidemiol Biomakers Prev 2012, 21(12):22312241. 10.1158/10559965.EPI120548
Thomas JW, Ward K: Economic profiling of physician specialists: use of outlier treatment and episode attribution rules. Inquiry 2006, 43(3):271282.
Tukey JL: The future of data analysis. Ann Math Stat 1962, 23(1):167.
Wagner TH, Chen S, Barnett PG: Using average cost methods to estimate encounterlevel costs for medicalsurgical stays in the VA. Med Care Res Rev 2003, 60(3 Suppl):15S36S.
Acknowledgements
This work was supported in part by funding from the Department of Veterans Affairs, Veterans Health Administration, Health Services Research and Development Service (Project Number IIR 03–196). Support for VA/CMS data was provided by the Department of Veterans Affairs, VA Health Services Research and Development Service, VA Information Resource Center (Project Numbers SDR 02–237 and 98–004). The collection of the California cancer incidence data used in this study was supported by the California Department of Public Health and by the Northern California Cancer Center under contract from the National Cancer Institute's Surveillance, Epidemiology and End Results Program. Dr. Hynes is also supported by a Department of Veterans Affairs Research Career Scientist Award (RCS98352). The University of Illinois at Chicago provided financial support through the Research Open Access Publishing (ROAAP) Fund for the open access publishing fee. The opinions expressed in this paper are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs or other institutions. The authors would like to acknowledge other members of the research project team (Todd Lee, Ruth Perrin) and clinical advisory committee (Al B. Benson, Nirmala Bhoopalam, Vivien W. Chen, Marc T. Goodman, M. Rosario Ferreira, Dawn Provenzale, Beth Virnig, MinWoong Sohn) for work in original analyses used in our example. The authors would also like to thank Debby Berlyne for editorial assistance in manuscript preparation.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
TW performed the statistical analysis, evaluated and interpreted the data and statistical analysis, and drafted the manuscript. DMH contributed toward the overall study design and critical revision of the manuscript. RD contributed toward the evaluation and interpretation of the data and statistical analysis and critical revision of the manuscript. ET contributed toward critical revision of the manuscript. QZ carried out data acquisition, data management, and assisted with data analysis. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Health care costs
 Outliers
 Influential observations
 Episode of care
 Colon cancer