Bayesian test for hazard ratio in survival analysis

Over the decades, testing for equivalence of hazard functions has received a wide attention in survival analysis. In this paper, we proposed a Bayesian test to address this testing equivalence problem, Most of all, proposed test is methodologically flexible so that a procedure determining weights is not required when the proportional assumption is violated. In comparison with popularly exploited methods, the proposed test is shown to be more powerful and robust in testing differences of hazard functions, in spite of the presence of crossing hazard functions. Extensive applications to simulation and real data were conducted, demonstrating that the proposed test presents outstanding performance and hold desirable properties in terms of numerical aspects.

In practice, we often encounter a censoring random variable C, and observe X = min(T , C).
(1) (t) = lim δ↓0 P(t + δ > T ≥ t|T ≥ t)/δ, If we have separate groups and our main interest aims at testing differences between hazard functions, we need to address testing the equality of the hazard functions. For this end, Mantel (1966) proposed the log rank test, and many analogous methods motivated by the log rank test (e.g., the weighted log rank tests) were studied by Gehan (1965), Peto and Peto (1972), and Prentice (1978). The log rank test commonly suffers low power when the ratio of the hazard functions differs in the time line. For this reason, the weighted log rank tests were developed to overcome the limitation of the log rank test, and various theoretical properties of these tests were introduced in Gill (1980), Harrington et al. (1982), Fleming and Harrington (2005), and Andersen et al. (1993) relying on martingale theories. More importantly, it was shown that the tests hold consistency and the test proposed in Harrington et al. (1982) proved to be the locally most powerful rank test in the specific class of survival functions. However, power of aforementioned tests may possibly vary depending on types of the hazard functions. Also Renyi test motivated by Rényi (1953) has been widely used in practice. This test requires weights similar to weighted log rank tests.
In this paper, we primarily focus on testing equivalence of hazard functions through the Cox's proportional hazards model (Cox 1972) such that Here β ∈ R and z is a covariate. If we perform a test procedure for M 0 : β = 0 against M 1 : β � = 0 where z is an indicator variable for each group (0 = control group, 1 = treatment group), it is equivalent to test equivalence of the hazard functions against (t)/ 0 (t) = c � = 1 for all t. Thus this test may decrease power when (t)/ 0 (t) is a time-varying function, especially in the case of (t)/ 0 (t) = (t − 1/2) on [0, 1], i.e, crossing hazards. Thus if we consider the time-varying Cox's model such as incorporating a time-varying coefficient, and have a test procedure for the testing then we can construct the test working well in spite of the crossing hazards. Inspired by the frequentist approach, Hess (1994) and Verweij and van Houwelingen (1995) studied time-varying coefficient model in Cox' regression, and provided the estimation methodology, In particular, Verweij and van Houwelingen (1995) proposed a test procedure using the B-spline basis functions. Also Yang and Prentice (2005) proposed the advanced semiparametric model including the proportional hazards model and proportional odds model, and proposed a test procedure for detecting the crossing hazards. Yang and Prentice test has no adaptive step such as selecting weights, and shows efficient performance. Recently, Chauvel and O'Quigley (2014) studied the test based on Cox's regression with time-varying coefficients. They used the stochastic integral and its limit distribution to test β(·) ≡ 0.
When it comes to testing equivalence of hazards including crossing hazards, few Bayesian studies have been scarcely utilized. Although Kalbfleisch (1978), Hjort (1990), and Kim (2006) turned to the Bayesian methodology for estimation of hazard or survival function and Kim et al. (2011) proposed the Bayesian test for monotone hazards, to our best knowledge, there are only a few studies done for testing equivalence of hazards including crossing hazards.
In the context of Bayesian approach, the testing is equivalent to model selection using posterior probabilities of M 0 and M 1 where F is a function class such as Sobolev space. Also Bayesian asymptotic theories proved that if data are randomly sampled, Bayesian test is consistent when in probability as n → ∞ under M 1 , and in probability as n → ∞ under M 0 . So Bayesian test can't give the typical p value, but the construction of the test procedure is easy and interpretation of this test is straightforward.
In addition, theoretical studies of Kim (2012) imply consistency of this Bayesian test using only the partial likelihood when we use a prior of π(β) under M 1 having the support on the function class absolutely bounded and spanned by the B-spline basis functions (obviously the prior for β under M 0 is a Dirac measure at 0). Under regularity conditions and prior masses of q and 1 − q (0 < q < 1) for the model M 0 and M 1 , respectively, Kim (2012) shows that we can have F as the function class such that all derivatives from 0 to p (∈ N) are absolutely bounded at a compact set in the time line.
In this paper, we construct the Bayesian test based on the results of Kim (2012). Considered model, data and test are explained. Priors and posteriors for Bayesian test are shown. We performed various simulation studies and real data analysis. Concluding remarks and discussions are presented in the last section.

Model and Bayesian test
Assume that we have D 1: for t ∈ [0, ∞), and F z i , G and I are distribution functions and an indicator function, respectively. Here (C i , δ i ) is a random vector of censoring variable, censoring indicator and z i is a group indicator, respectively. We also assume that for some 0 < τ < ∞, G(t−) = G(t) on t ∈ [0, τ ) and G(τ ) = 1. Note that we have no ties in the uncensored failure time X i s, and observed X i s are bounded by τ.
Since we have the survival function of T i given z i : we can consider the testing where and β (p) is the pth (p ∈ N) derivative of β.

Partial likelihood, priors and posteriors for the test
Before a description of the test procedure, we observe the likelihood L(β, 0 ) and the partial likelihood L(β) as where R(t) = {k : X k ≥ t}. The use of (12) requires difficult computations because of 0 and integration in the exponent term. Especially in the Bayesian method, priors for 0 can greatly increase computational burdens. On the other hand, using (13) is relatively easy to implement. We can refer to Andersen and Gill (1982) for the large sample property of partial likelihood. So it is attractive to use only the partial likelihood in Bayesian analysis, and Kim (2006Kim ( , 2012 have reported that we can use the partial likelihood and π(β), the priors for β to obtain the Bayes estimators for β since L(β; D 1:n )π(β) is proportional to the marginal posterior of β when beta processes are used as priors for 0 .
For the test, we consider the expansion by the B-spline basis functions such that for β(X i ), where B d,a n ,l is the B-spline basis functions of degree d with equally spaced knots and η ∈ {0, 1}. See de Boor (2001) and Lyche and Mørken (2008) for details of the B-spline, and Fig. 1 shows the B-spline basis functions of degree 1 and 2. We can use d ≥ p − 1 to approximate the function in F p,M . Priors are then put on η ∈ {0, 1} and γ l s, also let a n = (n/ log n) 1/(2p+1) for the consistent Bayesian test (Kim 2012). If we obtain a posterior probability of η = 1, we can test β(·) ≡ 0.
For the priors on η and γ l s, we consider the following: η a n l=1 γ l B d,a n ,l (X i ) where φ(·; 0, σ 2 ) is the probability density function of the normal distribution with mean 0 and variance σ 2 > 0, and for a large L > 0. Here σ 2 and L are hyper parameters.
We obtain posteriors by using only the partial likelihood of (13) instead of the full likelihood of (12). If we let and then we have π(D 1:n , η, {γ } a n l=1 , q) ∝ P 1 (D 1:n |{γ l } a n l=1 ) η P 0 (D 1: from the partial likelihood, and the posteriors are Here the posterior probability of η = 1 is equivalent to the posterior of M 1 where π({γ } a n i=1 ) is a prior for M 1 (shown in the Appendix). Posteriors can be obtained from the Metropolis-Hastings algorithm (Metropolis et al. 1953;Hastings 1970) or rejection sampling. Note that since zβ is in the exponent term, too large value of it can breaks down the MCMC (Markov chain Monte Carlo) algorithm. Thus we choose moderate σ 2 > 0 with sufficiently large L > 0.
Although we can put priors on σ 2 , we instead use hyper-parameters for the simplicity of computation. We also use the Bayesian bootstrap proposed by Kim and Lee (2003) for posterior sampling for γ l s since it is speedy and gives more stable results. Details of the Bayesian bootstrap and applications can be found in Kim and Lee (2003) and Kim et al. (2011). After obtaining the posterior samples, we calculated the Bayes estimates of If the estimates are over 0.5, we reject equivalence of hazard functions, i.e., we choose a model with higher posterior probability. Kass and Adrian (1995) proposed the procedures for model selection, but we have only two models. Thus this approach is reasonable though it can be seen a little liberal.

Simulation studies
In this section, we performed numerical studies for various values of β(t). Let τ = 6, L = 10, σ 2 = 1.0, and the censoring random variables were generated from truncated exponential distributions. Knot points selection is commonly critical especially when censoring rates are high, but here we considered rather simple cases such that inner knots are equally spaced on l n + ǫ, u n − ǫ for very small ǫ > 0, where Note that data out of the range of l n , u n have no effect for testing β. Emmanuel et al. (2010) and Eduard and Paulo (2014) introduced adaptive knots selection whereas our simulations instead adopt simplified scenarios to address properties of the proposed test.
(20) P(η = 1|D 1:n ). (21) 0.3. From these simulations, we were able to verify numerical properties of the proposed method. The details of the three models are below and further details are illustrated in Fig. 2 along with the results summarized in Table 1. First, we conducted the log rank, Yang Prentice, Fleming and Harrington, and Renyi tests using p values (reject null hypothesis-equivalence of hazard if p value is not greater than 0.05). Also proposed test were implemented by a five B-spline basis functions.
All numbers in Table 1 represent the rejection ratio of equivalence of hazards functions from 100 replications. In Fleming and Harrington tests, 1 and 2 mean that we use Ŝ (t) and (1 −Ŝ(t)) as weights, respectively, where Ŝ (t) is the pooled estimator of survival function. Also Renyi tests, 1 and 2 means giving more weight to differences early on and later on, respectively. As shown in Table 1, the power of the log rank test is outstandingly high when proportional assumption is true. Fleming and Harrington test seems to similar to log rank test under the proportional assumption, while its performances are variable in the case of a changing ratio. In the Fleming and Harrington tests, performance is very sensitive to weight selection. Behaviors of Renyi tests are similar to Fleming and Harrington tests, and its powers are slightly lower than Fleming and Harrington tests.The Yang and Prentice test largely performs well in a range of scenarios because it theoretically covers wider models than the proportional hazards model. It is also interesting to note that the proposed test performs well in the various simulation conditions, particularly when ratios of hazards functions are quite far away from 1 even though the ratio of hazard functions is not continuous.

Crossing and diverging hazards
The following simulation setups are motivated by crossing hazards. For example, it is reported by Schein (1982) (Gastrointestinal Tumor Study Group) that, a trial that compared chemotherapy with combined chemotherapy and radiation therapy in the treatment produced a ratio of survival functions (denominated by former group) that varied from under 1 to over 1, crossing along the time line. Importantly, this argument implied that enduring radiation is somewhat risky, but increases the life expectancy of patients. It is a conventional problem of crossing hazards, which has tendency to cause low power of the standard log rank test. Crossing hazards are interesting topics including identification of changing points in the ratio of hazard functions and estimation of hazard functions, which is studied by Muggeo and Miriam (2010). Also we consider diverging hazards that hazard ratio is a monotone function but not being 1 (if the ratio may have 1, it is the same as the crossing hazard problems).
Here  Figure 3 shows the survival function generated from each model. The survival functions appear no crossing, despite the crossing hazards. Interestingly, however, the difference between survival functions is shown to have the variation in curvatures.
To examine numerical properties and testing powers, we increase data size in combination with varied censoring rates (e.g., n = 50, 100, and 200 with censoring rates of 0.30, 0.50, and 0.70). We contrived simulation schemes similar to the previous section, and summarized simulation results in Table 2. The numbers in the table represent the rejection ratio of hazard function equivalence from 100 replications.
Most of all Table 2 clearly shows that increasing data sizes and lower censoring rates improve performance. Note that the Fleming and Harrington tests' performance and Ranyi tests' performance depend on the weights yet and it performed best with some appropriate weights. In contrast wrong weight selection tends to results in fairly poor performances. Moreover, we found that the proposed test performs better than the log rank, Yang and Prentice test for all simulation scenarios, when the censoring rate is not high. However, high censoring rates generally bring about attenuate performance with respect to other tests when data size is relatively large.

Fig. 3 Survival functions of crossing hazards and diverging hazards
In the left side of each plot in Fig. 4, the proposed test is shown to perform best when censoring rate is lower or moderate. The proposed test is inferior to Fleming and Harrington test 2 when censoring rates are high. When it comes to diverging hazard functions, the proposed test, Yang and Prentice test, and log rank test revealed the similar patterns when censoring rate is not high. With high censoring rate, the proposed test did not outperform any other test in M 4 , while, it performed well in M 3 . We omitted Ranyi tests in this figure because the performance is similar to Fleming and Harrington tests, but we draw the plot of Fig. 6 in Appendix to compare these two tests.
Since our test is based on B-spline basis functions, which concerns non-parametric in theory, data size therefore strongly associated with the testing power. In high censoring environments, non-censoring data are rare, and so it could reduce the efficiency of the proposed test primarily due to non-parametric nature.
When performing data analysis, it is integral to carefully select the number of knots and knot points to circumvent the shortcoming of the proposed method. Nevertheless, it is certain that the proposed method accommodate many challenging testing equivalence problems, which existing method cannot effectively address in many ways.
Remark Each MCMC chain in simulations had a size of 200 obtained by 1000 burnin and thinned by 25. We observed the posterior of η in one replication in Fig. 7 of the Appendix. The cumulative means became stable as the posterior sample became larger, implying the estimates of P(η = 1|D 1:n ) are stable. In addition, Fig. 8

Real data analysis
We consider the data set available in R package YPmodel by Yang and Prentice (2005). Data sets include 90 patients, half of whom were treated with chemotherapy, the other half with the chemotherapy combined with the radiation therapy. There were two censoring in the former group and six censoring in the latter group. Yang and Prentice (2005) showed that the survival functions crossed near 1000 days by Kaplan-Meier estimates (Kaplan and Meier 1958) crossed near 1000 days on the x-axis in Fig. 5. This implies the strong evidence for the crossing hazards. In addition, we report that Fleming and Harrington test (1/2) give p values of 0.04 and 0.16, respectively. Also Renyi test (1/2) give p values of 0.01 and 0.30, respectively. It supports that crossing hazards can exist. The posterior probability of η = 1 is 0.62 from the proposed test, and the Yang and Prentice test gives a p value of 0.03. The log rank test gives a p value over 0.25. Taken together, these results showed that the proposed test performs well, and the proposed, Yang and Prentice tests identify non-equivalence of hazard functions. In contrast to the a b

Conclusions
We showed that Bayesian test worked well to test hazard function equivalence, especially when crossing hazards appeared. It is commonplace that Bayesian test suffer from computation complexity or inconsistent phenomenon. Even so, we can construct a consistent Bayesian test via the B-spline basis functions. However, we also found that selection of p and the number of the B-spline basis functions still remains controversial. Using P-splines or putting priors for p in a n can be further considered, possibly giving better performance for high censoring environments.
In addition, we can extend the proposed test for more than three groups by modeling of where {z i } k−1 i=1 are indicators to distinguish from the baseline group. In this paper, we are only allowed for testing β(·) ≡ 0, however, estimation of β and detecting the time of crossing are interesting works in medical research. Since the proposed approach is based on the Bayesian methodology, extension to estimation of β is feasible and tractable for implementation. We left Bayesian estimation and testing for crossing hazards for interesting future work.