# Bayesian test for hazard ratio in survival analysis

- Gwangsu Kim
^{1}Email author and - Seong-Whan Lee
^{2}

**Received: **24 November 2015

**Accepted: **22 April 2016

**Published: **17 May 2016

## Abstract

Over the decades, testing for equivalence of hazard functions has received a wide attention in survival analysis. In this paper, we proposed a Bayesian test to address this testing equivalence problem, Most of all, proposed test is methodologically flexible so that a procedure determining weights is not required when the proportional assumption is violated. In comparison with popularly exploited methods, the proposed test is shown to be more powerful and robust in testing differences of hazard functions, in spite of the presence of crossing hazard functions. Extensive applications to simulation and real data were conducted, demonstrating that the proposed test presents outstanding performance and hold desirable properties in terms of numerical aspects.

## Keywords

## Background

Inference of the survival function \(\mathrm{P}( T >t )\) is a main focus of survival analysis, where *T* follows the distribution *F* on \([0, \infty ).\) Survival functions play a key role in testing the effects of clinical therapies or drugs, reliability analysis in engineering, and estimating the risk of bankrupts.

*T*be

*T*,

*f*dominated by Lebesgue measure, then

*C*, and observe \(X = \mathrm{min} (T, C).\)

If we have separate groups and our main interest aims at testing differences between hazard functions, we need to address testing the equality of the hazard functions. For this end, Mantel (1966) proposed the log rank test, and many analogous methods motivated by the log rank test (e.g., the weighted log rank tests) were studied by Gehan (1965), Peto and Peto (1972), and Prentice (1978). The log rank test commonly suffers low power when the ratio of the hazard functions differs in the time line. For this reason, the weighted log rank tests were developed to overcome the limitation of the log rank test, and various theoretical properties of these tests were introduced in Gill (1980), Harrington et al. (1982), Fleming and Harrington (2005), and Andersen et al. (1993) relying on martingale theories. More importantly, it was shown that the tests hold consistency and the test proposed in Harrington et al. (1982) proved to be the locally most powerful rank test in the specific class of survival functions. However, power of aforementioned tests may possibly vary depending on types of the hazard functions. Also Renyi test motivated by Rényi (1953) has been widely used in practice. This test requires weights similar to weighted log rank tests.

*z*is a covariate. If we perform a test procedure for \(M_0 : \beta =0 \ \mathrm{against} \ M_1 : \beta \ne 0\) where

*z*is an indicator variable for each group (0 = control group, 1 = treatment group), it is equivalent to test equivalence of the hazard functions against \(\lambda (t)/\lambda _0(t) =c \ne 1\) for all

*t*. Thus this test may decrease power when \(\lambda (t)/\lambda _0(t)\) is a time-varying function, especially in the case of \(\lambda (t)/\lambda _0(t) = (t-1/2)\) on [0, 1], i.e, crossing hazards. Thus if we consider the time-varying Cox’s model such as

When it comes to testing equivalence of hazards including crossing hazards, few Bayesian studies have been scarcely utilized. Although Kalbfleisch (1978), Hjort (1990), and Kim (2006) turned to the Bayesian methodology for estimation of hazard or survival function and Kim et al. (2011) proposed the Bayesian test for monotone hazards, to our best knowledge, there are only a few studies done for testing equivalence of hazards including crossing hazards.

*p*value, but the construction of the test procedure is easy and interpretation of this test is straightforward.

In addition, theoretical studies of Kim (2012) imply consistency of this Bayesian test using only the partial likelihood when we use a prior of \(\pi (\beta )\) under \(M_1\) having the support on the function class absolutely bounded and spanned by the B-spline basis functions (obviously the prior for \(\beta\) under \(M_0\) is a Dirac measure at 0). Under regularity conditions and prior masses of *q* and \(1-q \ (0< q <1)\) for the model \(M_0\) and \(M_1\), respectively, Kim (2012) shows that we can have \(\mathcal {F}\) as the function class such that all derivatives from 0 to \(p \ ( \in \mathbb {N} )\) are absolutely bounded at a compact set in the time line.

In this paper, we construct the Bayesian test based on the results of Kim (2012). Considered model, data and test are explained. Priors and posteriors for Bayesian test are shown. We performed various simulation studies and real data analysis. Concluding remarks and discussions are presented in the last section.

## Model and Bayesian test

*I*are distribution functions and an indicator function, respectively. Here \((C_i, \delta _i)\) is a random vector of censoring variable, censoring indicator and \(z_i\) is a group indicator, respectively. We also assume that for some \(0< \tau <\infty ,\) \(G(t-) = G(t)\) on \(t \in [0, \tau )\) and \(G(\tau ) = 1.\) Note that we have no ties in the uncensored failure time \(X_i\)s, and observed \(X_i\)s are bounded by \(\tau\).

*p*th \(( p \in \mathbb {N} )\) derivative of \(\beta\).

## Partial likelihood, priors and posteriors for the test

*d*with equally spaced knots and \(\eta \in \{ 0, 1\}\). See de Boor (2001) and Lyche and Mørken (2008) for details of the B-spline, and Fig. 1 shows the B-spline basis functions of degree 1 and 2. We can use \(d \ge p-1\) to approximate the function in \(\mathcal {F}_{p, M}.\) Priors are then put on \(\eta \in \{ 0, 1\}\) and \(\gamma _l\)s, also let \(a_n = \left[ ( n / \log n )^{1/(2p+1)} \right]\) for the consistent Bayesian test (Kim 2012). If we obtain a posterior probability of \(\eta =1\), we can test \(\beta (\cdot ) \equiv 0\).

*L*are hyper parameters.

## Simulation studies

### Simple setups and results

*p*values (reject null hypothesis-equivalence of hazard if

*p*value is not greater than 0.05). Also proposed test were implemented by a five B-spline basis functions.

Results from various setups and tests

Model | Proposed | Y&P | F&H (1/2) | Renyi (1/2) | Log rank |
---|---|---|---|---|---|

\(M_0\) | 0.01 | 0.04 | 0.02/0.04 | 0.03 /0.02 | 0.04 |

\(M_1\) | 1.00 | 0.86 | 0.78/0.72 | 0.71/0.67 | 0.86 |

\(M_2\) | 0.97 | 0.90 | 0.47/0.94 | 0.39/0.93 | 0.79 |

All numbers in Table 1 represent the rejection ratio of equivalence of hazards functions from 100 replications. In Fleming and Harrington tests, 1 and 2 mean that we use \(\hat{S}(t)\) and \((1-\hat{S}(t))\) as weights, respectively, where \(\hat{S}(t)\) is the pooled estimator of survival function. Also Renyi tests, 1 and 2 means giving more weight to differences early on and later on, respectively. As shown in Table 1, the power of the log rank test is outstandingly high when proportional assumption is true. Fleming and Harrington test seems to similar to log rank test under the proportional assumption, while its performances are variable in the case of a changing ratio. In the Fleming and Harrington tests, performance is very sensitive to weight selection. Behaviors of Renyi tests are similar to Fleming and Harrington tests, and its powers are slightly lower than Fleming and Harrington tests.The Yang and Prentice test largely performs well in a range of scenarios because it theoretically covers wider models than the proportional hazards model. It is also interesting to note that the proposed test performs well in the various simulation conditions, particularly when ratios of hazards functions are quite far away from 1 even though the ratio of hazard functions is not continuous.

### Crossing and diverging hazards

To examine numerical properties and testing powers, we increase data size in combination with varied censoring rates (e.g., \(n=50, 100,\) and 200 with censoring rates of 0.30, 0.50, and 0.70). We contrived simulation schemes similar to the previous section, and summarized simulation results in Table 2. The numbers in the table represent the rejection ratio of hazard function equivalence from 100 replications.

Most of all Table 2 clearly shows that increasing data sizes and lower censoring rates improve performance. Note that the Fleming and Harrington tests’ performance and Ranyi tests’ performance depend on the weights yet and it performed best with some appropriate weights. In contrast wrong weight selection tends to results in fairly poor performances. Moreover, we found that the proposed test performs better than the log rank, Yang and Prentice test for all simulation scenarios, when the censoring rate is not high. However, high censoring rates generally bring about attenuate performance with respect to other tests when data size is relatively large.

Since our test is based on B-spline basis functions, which concerns non-parametric in theory, data size therefore strongly associated with the testing power. In high censoring environments, non-censoring data are rare, and so it could reduce the efficiency of the proposed test primarily due to non-parametric nature.

Results from various setups and tests

Data size and model | Censoring rate | Proposed | Y&P | F&H (1/2) | Renyi (1/2) | Log rank |
---|---|---|---|---|---|---|

\(n = 50\) | ||||||

\(M_3\) | 0.30 | 0.94 | 0.69 | 0.27/0.89 | 0.18/0.82 | 0.52 |

\(M_3\) | 0.50 | 0.59 | 0.34 | 0.10/0.63 | 0.06/0.53 | 0.26 |

\(M_3\) | 0.70 | 0.22 | 0.10 | 0.03/0.26 | 0.02/0.22 | 0.07 |

\(M_4\) | 0.30 | 0.98 | 0.88 | 0.72/0.93 | 0.62/0.90 | 0.85 |

\(M_4\) | 0.50 | 0.68 | 0.69 | 0.53/0.70 | 0.46/0.64 | 0.66 |

\(M_4\) | 0.70 | 0.50 | 0.36 | 0.29/0.41 | 0.23/0.36 | 0.34 |

\(n = 100\) | ||||||

\(M_3\) | 0.30 | 1.00 | 0.96 | 0.51/1.00 | 0.43/0.98 | 0.80 |

\(M_3\) | 0.50 | 0.84 | 0.73 | 0.29/0.91 | 0.23/0.85 | 0.61 |

\(M_3\) | 0.70 | 0.22 | 0.24 | 0.10/0.50 | 0.10/0.44 | 0.19 |

\(M_4\) | 0.30 | 1.00 | 1.00 | 0.93/1.00 | 0.91/1.00 | 0.99 |

\(M_4\) | 0.50 | 0.96 | 0.94 | 0.82/0.96 | 0.80/0.96 | 0.93 |

\(M_4\) | 0.70 | 0.56 | 0.66 | 0.60/0.67 | 0.51/0.61 | 0.65 |

\(n = 200\) | ||||||

\(M_3\) | 0.30 | 1.00 | 1.00 | 0.74/1.00 | 0.61/1.00 | 1.00 |

\(M_3\) | 0.50 | 0.95 | 0.98 | 0.37/1.00 | 0.29/1.00 | 0.88 |

\(M_3\) | 0.70 | 0.31 | 0.43 | 0.09/0.84 | 0.18/0.78 | 0.29 |

\(M_4\) | 0.30 | 1.00 | 1.00 | 1.00/1.00 | 1.00/1.00 | 1.00 |

\(M_4\) | 0.50 | 1.00 | 1.00 | 0.98/1.00 | 0.96/1.00 | 1.00 |

\(M_4\) | 0.70 | 0.79 | 0.98 | 0.88/0.94 | 0.83/0.93 | 0.95 |

###
*Remark*

Each MCMC chain in simulations had a size of 200 obtained by 1000 burn-in and thinned by 25. We observed the posterior of \(\eta\) in one replication in Fig. 7 of the Appendix. The cumulative means became stable as the posterior sample became larger, implying the estimates of \(\mathrm{P} ( \eta =1 | D_{1:n})\) are stable. In addition, Fig. 8 in Appendix displays the posterior mean of \(\eta\) for 100 replications when the data size 100 and censoring rate 0.50. This result proves the stability of the Bayes estimates.

## Real data analysis

We consider the data set available in R package YPmodel by Yang and Prentice (2005). Data sets include 90 patients, half of whom were treated with chemotherapy, the other half with the chemotherapy combined with the radiation therapy. There were two censoring in the former group and six censoring in the latter group. Yang and Prentice (2005) showed that the survival functions crossed near 1000 days by Kaplan–Meier estimates (Kaplan and Meier 1958) crossed near 1000 days on the *x*-axis in Fig. 5. This implies the strong evidence for the crossing hazards. In addition, we report that Fleming and Harrington test (1/2) give *p* values of 0.04 and 0.16, respectively. Also Renyi test (1/2) give *p* values of 0.01 and 0.30, respectively. It supports that crossing hazards can exist.

The posterior probability of \(\eta =1\) is 0.62 from the proposed test, and the Yang and Prentice test gives a *p* value of 0.03. The log rank test gives a *p* value over 0.25. Taken together, these results showed that the proposed test performs well, and the proposed, Yang and Prentice tests identify non-equivalence of hazard functions. In contrast to the success of the proposed and Yang and Prentice tests, the log rank test cannot detect the non-equivalence of hazard functions.

## Conclusions

We showed that Bayesian test worked well to test hazard function equivalence, especially when crossing hazards appeared. It is commonplace that Bayesian test suffer from computation complexity or inconsistent phenomenon. Even so, we can construct a consistent Bayesian test via the B-spline basis functions. However, we also found that selection of *p* and the number of the B-spline basis functions still remains controversial. Using P-splines or putting priors for *p* in \(a_n\) can be further considered, possibly giving better performance for high censoring environments.

## Declarations

### Authors’ contributions

We construct the Bayesian test to test equality of hazard functions. It can be applicable in various circumstances including crossing hazards in spite of the non-proportionality. Thus it is better than log rank test having limitation in the case of non-proportionality. Both authors read and approved the final manuscript.

### Acknowledgements

Lee’s research was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (No. 2012-005741).

### Competing interests

The authors declare that they have no competing interests.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, New YorkView ArticleGoogle Scholar
- Andersen PK, Gill RD (1982) Cox’s regression model for counting processes: a large sample study. Ann Stat 10(4):1100–1120View ArticleGoogle Scholar
- Chauvel C, O’Quigley J (2014) Tests for comparing estimated survival functions. Biometrika 101(3):535–552View ArticleGoogle Scholar
- Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B 34(2):187–220Google Scholar
- de Boor C (2001) A practical guide to splines. Springer, New YorkGoogle Scholar
- Eduard B, Paulo S (2014) Adaptive priors based on splines with random knots. Bayesian Anal 9(4):859–882View ArticleGoogle Scholar
- Emmanuel S, Robert LS, David R, Mark C, Lakshmi H (2010) Bayesian adaptive B-spline estimation in proportional hazards frailty models. Electron J Stat 4:606–642View ArticleGoogle Scholar
- Fleming TR, Harrington DP (2005) Counting processes and survival analysis. Wiley, HobokenView ArticleGoogle Scholar
- Gehan EA (1965) A generalized two-sample Wilcoxon test for doubly censored data. Biometrika 52(3):650–653View ArticleGoogle Scholar
- Gill R (1980) Censoring and stochastic integrals. Stat Neerl 34(2):124View ArticleGoogle Scholar
- Harrington DP, Fleming TR, Gill R (1982) A class of rank test procedures for censored survival data. Biometrika 69(3):553–566View ArticleGoogle Scholar
- Hastings EK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109View ArticleGoogle Scholar
- Hess KR (1994) Assessing time-by-covariate interactions in proportional hazards regression models using cubic spline functions. Stat Med 13(10):1045–1062View ArticleGoogle Scholar
- Hjort NL (1990) Nonparametric Bayes estimators based on beta processes for life history data. Ann Stat 18(3):1259–1294View ArticleGoogle Scholar
- Kalbfleisch JD (1978) Non-parametric Bayesian analysis of survival time data. J R Stat Soc Ser B 40(2):214–221Google Scholar
- Kaplan E, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481View ArticleGoogle Scholar
- Kass RE, Adrian ER (1995) Bayes factors. J Am Stat Assoc 90(430):773–795View ArticleGoogle Scholar
- Kim Y (2006) The Bernstein–von Mises theorem for the proportional hazard model. Ann Stat 34(4):1678–1700View ArticleGoogle Scholar
- Kim G (2012) Posterior contraction rate of the proportional hazards model having a nonparametric link and its applications. PhD thesis, Seoul National University, Department of StatisticsGoogle Scholar
- Kim Y, Lee JY (2003) Bayesian bootstrap for proportional hazards models. Ann Stat 31(6):1905–1922View ArticleGoogle Scholar
- Kim Y, Park JK, Kim G (2011) Bayesian analysis for monotone hazard ratio. Lifetime Data Anal 17(2):302–320View ArticleGoogle Scholar
- Lyche T, Mørken K (2008) Spline Methods (draft). Department of Informatics, Center of Mathematics for Applications, University of Oslo, OsloGoogle Scholar
- Mantel N (1966) Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep Part 1 50(3):163–170Google Scholar
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equations of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092View ArticleGoogle Scholar
- Muggeo VMR, Miriam T (2010) A flexible approach to crossing hazards problem. Stat Med 29(18):1947–1957View ArticleGoogle Scholar
- Peto R, Peto J (1972) Asymptotically efficient rank invariant test procedures (with discussion). J R Stat Soc Ser A 135(2):185–206View ArticleGoogle Scholar
- Prentice RL (1978) Linear rank test with right censored data. Biometrika 65(1):291–298Google Scholar
- Rényi A (1953) On the theory of order statistics. Acta Math Hung 4(3–4):191–231View ArticleGoogle Scholar
- Schein PD (1982) A comparison of combination chemotherapy and combined modality therapy for locally advanced gastric carcinoma. Cancer 49(9):1771–1777View ArticleGoogle Scholar
- Verweij JM, van Houwelingen HC (1995) Time-dependent effects of fixed covariates in Cox regression. Biometrics 51:1088–1108View ArticleGoogle Scholar
- Yang S, Prentice R (2005) Semiparametric analysis of short-term and long-term hazard ratio with two-sample survival data. Biometrika 92(1):1–17View ArticleGoogle Scholar