Interim analysis for binary outcome trials with a long fixed follow-up time and repeated outcome assessments at pre-specified times
© Parpia et al.; licensee Springer. 2014
Received: 2 April 2014
Accepted: 20 June 2014
Published: 26 June 2014
In trials with binary outcomes, assessed repeatedly at pre-specified times and where the subject is considered to have experienced a failure at the first occurrence of the outcome, interim analyses are performed, generally, after half or more of the subjects have completed follow-up. Depending on the duration of accrual relative to the length of follow-up, this may be inefficient, since there is a possibility that the trial will have completed accrual prior to the interim analysis. An alternative is to plan the interim analysis after subjects have completed follow-up to a time that is less than the fixed full follow-up duration. Using simulations, we evaluated three methods to estimate the event proportion for the interim analysis in terms of type I and II errors and the probability of early stopping. We considered: 1) estimation of the event proportion based on subjects who have been followed for a pre-specified time (less than the full follow-up duration) or who experienced the outcome; 2) estimation of the event proportion based on data from all subjects that have been randomized by the time of the interim analysis; and 3) the Kaplan-Meier approach to estimate the event proportion at the time of the interim analysis. Our results show that all methods preserve and have comparable type I and II errors in certain scenarios. In these cases, we recommend using the Kaplan-Meier method because it incorporates all the available data and has greater probability of early stopping when the treatment effect exists.
Interim analyses that permit early stopping of a randomized controlled trial (RCT) for extremely positive results or for futility are included in the design for ethical and economic reasons. Strategies have been developed for interim analyses such that the overall type I error of the entire trial is preserved at a fixed level (Haybittle 1971; O'Brien and Fleming 1979; Peto et al. 1976; Pocock 1977).
Often, the primary outcome is whether or not a subject experienced an event over a fixed period of time T. In some trials, the outcome is assessed repeatedly at pre-specified times during follow-up, and the subject is considered a failure if the event occurs at any time. For example, in a cardiovascular RCT investigating the effect of an intervention for preventing post-thrombotic syndrome, subjects can be assessed every 6 months for up to 24 months using a disease-specific questionnaire (Enden et al. 2012; Vedantham et al. 2013). A failure has occurred if the questionnaire score exceeds a pre-specified threshold. Another example would be a breast cancer radiotherapy RCT where adverse cosmesis (i.e. a dichotomy), assessed at 1, 3 and 5 years post-randomization, would be the primary safety outcome and the focus of the interim analysis.
Interim analyses are generally performed after half or more of the subjects have completed follow-up (Pedley 2011). Depending on the duration of accrual relative to the length of follow-up, this strategy may be inefficient because it is possible that accrual will have been completed and patients will have finished treatment prior to the interim analysis. If, however, the interim analysis was done earlier and a statistically significant effect was found, the trial may be stopped, and all future subjects would receive the experimental therapy.
In this situation, one alternative is to plan an interim analysis after a smaller percentage of subjects have completed full follow-up. However, there is a low probability of terminating the trial early when the interim analysis is based on so little information, and, therefore, such an analysis would unnecessarily spend alpha (Togo and Iwasaki 2013). A second alternative is to plan the interim analysis after half or more of the subjects have completed a specified portion of the follow-up R, where R < T, and T is the fixed full follow-up duration for each subject.
Several researchers have studied methods that combine data from subjects who have completed full follow-up with those who have been followed for duration R in situations where the outcome is reversible (Marschner and Becker 2001; Sooriyarachchi et al. 2006; Whitehead et al. 2008). In our research, however, the situation is different in that the outcome can be ascertained at any of the pre-specified visits during follow-up and is irreversible.
In this paper, we consider 3 methods of estimating the interim event proportion (risk) for each treatment group in an RCT for an interim analysis: 1) estimated event proportion based only on subjects who have been followed for at least duration R or who had an outcome event; 2) the event proportion based on data from subjects that have been randomized by the time of the interim analysis, and 3) the Kaplan-Meier approach to estimate the event proportion. We investigate the effect of each method on the type I and II errors and the probability of early stopping through computer simulation of various trial scenarios.
where and are the observed proportions, n 0 and n 1 are the group sample sizes, and we are testing the one-sided hypotheses H0: π1 ≥ π0 versus H1: π1 < π0. Furthermore, we assume 90% power, an alpha of 0.025 and a 1:1 randomization. Since the normal distribution is symmetric, the p-value for a one-sided test is equivalent to half of the two-sided p-value.
Notation table for estimation of event proportions
Visit number J
Visit time t j
Subjects at risk m j
New events e j
Incidence at visit j d j
t 0 (<6 m)
e0 = 0
d0 = 0
t 1 (6 m)
d1 = e1/m1
t 2 (12 m)
d2 = e2/m2
t 3 (18 m)
d3 = e3/m3
t 4 (24 m)
d4 = e4/m4
Method 1: event proportion based on subjects followed for at least duration R or who had an event
The individuals who have experienced an event but have not completed duration R of follow-up are included in the numerator and the denominator.
Method 2: event proportion based on data from subjects that have been randomized by the time of the interim analysis
which is simply the total number of observed events divided by the number of subjects randomized by time τ1.
Method 3: Kaplan-Meier approach
We evaluated these methods in terms of overall type I and II errors and the probability of early stopping of the trial for a positive result at the interim. The interim analysis was performed using the Haybittle-Peto (Haybittle 1971; Peto et al. 1976) and O’Brien-Fleming (O'Brien and Fleming 1979) monitoring boundaries for extreme positive results. These boundaries are conservative and require small p-values for early stopping of the trial. Other less conservative boundaries such as the Pocock approach were not evaluated (Freidlin and Korn 2009; Pocock 2005).
Summary of six trials considered for simulation with β = 0.10 and a one-sided α = 0.025
Standard group event proportion (π0)
Experimental group event proportion (π1)
Absolute risk reduction (π0-π1)
Summary of the event distribution probabilities for the simulated scenarios
Event distribution probabilities by visit time
t 1 , t 2 , t 3 , t 4
0.25, 0.25, 0.25. 0.25
same as standard
0.35, 0.30, 0.20, 0.15
same as standard
0.15, 0.20, 0.30, 0.35
same as standard
0.15, 0.20, 0.30, 0.35
0.35, 0.30, 0.20, 0.15
0.35, 0.30, 0.20, 0.15
0.15, 0.20, 0.30, 0.35
The results of the type I error rates for the three methods are shown graphically in Figure 2. The three methods have comparable type I error rates across each of the trials and event distribution scenarios. The methods in general have nominal or close-to-nominal type I error rates when the event distribution probabilities are equivalent between treatment groups or when the experimental treatment group events occurred earlier in the trial compared with the standard group. However, under these same scenarios, slightly greater-than-nominal type I error rates are seen in the trials where (π 0, π 1 ) = (0.30, 0.10) and (π 0, π 1 ) = (0.50, 0.45), where the type I error rates are approximately 0.03. For the scenario where the experimental group events occurred later in the trial compared with the standard group, the type I error was generally inflated for all methods.
In RCTs with binary endpoints, interim analyses are generally conducted after a considerable percentage of subjects have completed follow-up. However, under certain situations this approach is not optimal since the trial may have completed accrual and all the subjects will have been treated by that time. We evaluated three approaches for an interim analysis when a considerable percentage of subjects complete a follow-up time that is less than the planned trial follow-up.
We observed that the type I error rates were comparable for all three methods. For most trials simulated, under the scenarios where the event distributions were equivalent between treatment groups or the experimental group had events occur earlier than the standard group, the type I error rates were close to the nominal value. These results concur with those of Pedley (2011), who showed that conducting the interim analysis after a considerable percentage of subjects had completed full follow-up (using method 2) produced nominal type 1 error rates, albeit in the situation where events could be measured at any time during follow-up and not just at specific time points. However, we also observed that the type I error rate increased with increasing absolute risk reduction for trials with a standard group event proportion of 0.3, thus resulting in slightly higher type I error rates for the trial with ARR to 0.20. In addition, similar slightly higher type I error rates were seen in the trial with a standard group event proportion of 0.5 and the ARR = 0.05. This is perhaps due to a combination of less variability and a small sample size for the former, and a large sample size and small ARR for the latter. Therefore, trialists should be cautious of using either of these methods under these situations.
While there were situations in which the type I errors were slightly inflated with all methods, the methods performed much better with regard to the type II errors under all scenarios, suggesting that these methods will not have a negative effect on the power to detect the hypothesized difference between treatment groups provided the difference exists. Under the scenarios where the experimental group had events occur later compared with the standard group, the methods showed increased overall power because the probability of early stopping was greater in these scenarios. However, under these scenarios, the type I error rates are inflated.
The methods differed on the probability of early stopping under the alternative hypothesis with method 2 having the lowest probability. This is because this approach includes data from all subjects that have been randomized by the time of the interim analysis in the denominator of the estimation of the event proportion even though a subgroup of these patients would not have had any assessment of the outcome since they would not have reached their first time point for outcome assessment. The consequence is the dilution of the interim treatment effect leading to lower interim power. Method 3 also uses all available data from randomized subjects at the time of the interim analysis. However, it employs a conditional probability approach which differentiates between those subjects who have not yet had an assessment visit (i.e. censored) and who are at risk at each assessment visit, thus yielding a greater probability of early stopping. Similarly, since method 1 uses only a subset of randomized subjects at the time of the interim analysis, the estimated interim treatment effect is less diluted and, therefore, has greater probability for early stopping than method 2. Conversely, since it uses a smaller number of subjects compared with method 3, the probability for early stopping is slightly lower than method 3 in trials where the standard group event proportion is 0.5, because the variability is greater for proportions closer to 0.5. Furthermore, we observed that the probabilities for early stopping are greater using the O’Brien-Fleming boundary compared with the Haybittle-Peto boundary since it is less conservative.
Although the largest probabilities of early stopping under the alternative hypothesis and the smallest type II errors were seen under the scenario where the experimental group had events occurring later compared with the standard group, the type I errors is greatly inflated and, therefore, none of the methods can be recommended in this situation. Since there is a delay in occurrence of the event in the experimental group, this may be perceived as an effect of treatment. However, in situations where investigators are interested in the occurrence of an event over a fixed time period, this scenario, although rare, would still be considered under the null hypothesis.
Our study had some limitations. The generalizability of our findings may be limited since we evaluated six trial scenarios with particular event distributions over time. In diseases where the event distributions over time differ from the ones evaluated in this research, further simulations would be required to evaluate these methods. Secondly, we evaluated trials with one interim analysis after 50% of the subjects completed 12 months of follow-up using the O’Brien-Fleming or Haybittle-Peto approach. These findings may not be applicable to trials in which interim analyses are required at multiple times or when using the alpha spending function approach to monitor the trial. Finally, the biases of the interim event proportions and treatment effects were not evaluated primarily because it is well known that estimators at the interim are biased, especially for estimators that allow for early stopping for positive results. However, further investigation on the estimators is needed.
Nonetheless, we have shown that under certain scenarios, conducting an interim analysis when a considerable number of subjects have some follow-up data, using any of the methods, preserves the type I and II errors. Although all three methods preserve type I and II errors under these scenarios, we recommend using the Kaplan-Meier method because it incorporates all the available data and has greater probability of early stopping when the treatment effect exists. We have also shown that under certain scenarios, none of these methods is suitable for an interim analysis, and trialists should be cautious when using them. Finally, when possible, an interim analysis should be undertaken when data from a considerable number of subjects who have completed full follow-up are available. However, if waiting for a considerable number of subjects to complete full follow-up is not an efficient approach, such as in the examples described, the methods outlines in this paper should be considered and evaluated to fit the specific needs of the trial.
SP, JAJ, CG, LT and MNL conceived the study. SP conducted literature review, designed and implemented the simulation, and wrote the initial draft of the manuscript. All authors reviewed and revised the draft version of the manuscript. All authors read and approved the final version of the manuscript.
This research was funded in part by funds from the CANNeCTIN Program.
- Enden T, Haig Y, Klow NE, Slagsvold CE, Sandvik L, Ghanima W, Hafsahl G, Holme PA, Holmen LO, Njaastad AM, Sandbaek G, Sandset PM, CaVenT Study Group: Long-term outcome after additional catheter-directed thrombolysis versus standard treatment for acute iliofemoral deep vein thrombosis (the CaVenT study): a randomised controlled trial. Lancet 2012, 379(9810):31-38. 10.1016/S0140-6736(11)61753-4View ArticleGoogle Scholar
- Freidlin B, Korn EL: Stopping clinical trials early for benefit: impact on estimation. Clin Trials 2009, 6(2):119-125. 10.1177/1740774509102310View ArticleGoogle Scholar
- Haybittle JL: Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol 1971, 44(526):793-797. 10.1259/0007-1285-44-526-793View ArticleGoogle Scholar
- Marschner IC, Becker SL: Interim monitoring of clinical trials based on long-term binary endpoints. Stat Med 2001, 20(2):177-192. 10.1002/1097-0258(20010130)20:2<177::AID-SIM653>3.0.CO;2-KView ArticleGoogle Scholar
- O'Brien P, Fleming T: A multiple testing procedure for clinical trials. Biometrics 1979, 35: 549-556. 10.2307/2530245View ArticleGoogle Scholar
- Pedley A: Applying survival analysis techniques to interim analysis and sample size reassessment of clinical trials with dichotomous endpoint. ProQuest, UMI Dissertations Publishing, Dissertation, Boston University; 2011.Google Scholar
- Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J, Smith PG: Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br J Cancer 1976, 34(6):585-612. 10.1038/bjc.1976.220View ArticleGoogle Scholar
- Pocock S: Group sequential methods in the design and analysis of clinical trials. Biometrika 1977, 64: 191-199. 10.1093/biomet/64.2.191View ArticleGoogle Scholar
- Pocock SJ: When (not) to stop a clinical trial for benefit. JAMA 2005, 294(17):2228-2230. 10.1001/jama.294.17.2228View ArticleGoogle Scholar
- Sooriyarachchi MR, Whitehead J, Whitehead A, Bolland K: The sequential analysis of repeated binary responses: a score test for the case of three time points. Stat Med 2006, 25(12):2196-2214.View ArticleGoogle Scholar
- Togo K, Iwasaki M: Optimal timing for interim analyses in clinical trials. J Biopharm Stat 2013, 23(5):1067-1080. 10.1080/10543406.2013.813522View ArticleGoogle Scholar
- Vedantham S, Goldhaber SZ, Kahn SR, Julian J, Magnuson E, Jaff MR, Murphy TP, Cohen DJ, Comerota AJ, Gornik HL, Razavi MK, Lewis L, Kearon C: Rationale and design of the ATTRACT Study: a multicenter randomized trial to evaluate pharmacomechanical catheter-directed thrombolysis for the prevention of postthrombotic syndrome in patients with proximal deep vein thrombosis. Am Heart J 2013, 165(4):530. e3View ArticleGoogle Scholar
- Whitehead A, Sooriyarachchi MR, Whitehead J, Bolland K: Incorporating intermediate binary responses into interim analyses of clinical trials: a comparison of four methods. Stat Med 2008, 27(10):1646-1666. 10.1002/sim.3046View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.