A gentle introduction to group sequential design
Source:vignettes/GentleIntroductionToGSD.Rmd
GentleIntroductionToGSD.Rmd
Introduction
This article is intended to give a gentle mathematical and statistical introduction to group sequential design. We also provide relatively simple examples from the literature to explain clinical applications. There is no programming shown, but by accessing the source for the article all required programming can be accessed; substantial commenting is provided in the source in the hope that users can understand how to implement the concepts developed here. Hopefully, the few mathematical and statistical concepts introduced will not discourage those wishing to understand some underlying concepts for group sequential design.
A group sequential design enables repeated analysis of an endpoint for a clinical trial to enable possible early stopping of a trial for either a positive result, for futility, or for a safety issue. This approach can
- limit exposure risk to patients and clinical trial investment past the time where known unacceptable safety risks have been established for the endpoint of interest,
- limit investment in a trial where interim results suggest further evaluation for a positive efficacy finding is futile, or
- accelerate the availability of a highly effective treatment by enabling early approval following an early positive finding.
Examples of outcomes that might be considered include:
- a continuous outcome such as change from baseline at some fixed follow-up time in the HAM-D depression score,
- absolute or difference or risk ratio for a response rate (e.g., in oncology) or failure rate for a binary (yes/no) outcome, and
- a hazard ratio for a time-to-event out such such as time-to-death or disease progression in an oncology trial or for time until a cardiovascular event (death, myocardial infarction or unstable angina).
Examples of the above include:
- a new treatment for major depression where an interim analysis of a continuous outcome stopped the trial for futility (Binneman et al. (2008)),
- a new treatment for patients with unstable angina undergoing balloon angioplasty with a positive interim finding for a binary outcome of death, myocardial infarction or urgent repeat intervention within 30 days (The CAPTURE Investigators (1997)), and
- a new treatment for patients with lung cancer based on a positive interim finding for time-to-death (Gandhi et al. (2018)).
Group sequential design framework
We assume
- A two-arm clinical trial with a control and experimental group.
- There are k analyses planned for some integer k> 1.
- There is a natural parameter \delta describing the underlying treatment difference with an estimate that has an asymptotically normal and efficient estimate \hat\delta_j with variance \sigma_j^2 and corresponding statistical information \mathcal{I}_j=1/\sigma_j^2, at analysis j=1,2,\ldots,k. A positive value favoring experimental treatment and negative value favoring control. We assume a consistent estimate \hat\sigma_j^2 of \sigma_j^2, j=1,2,\ldots,k.
- The information fraction is defined as t_j=\mathcal{I}_i/\mathcal{I}_j at analysis j=1,\ldots,k.
- Correlations between estimates at different analyses are \text{Corr}(\hat\delta_i,\hat\delta_j)=\sqrt{\mathcal{I}_i/\mathcal{I}_j}=\sqrt{t_j} for 1\le i\le j\le k.
- There is a test test Z_j\approx\hat\delta_j/\hat{\sigma}^2_j.
For a time-to-event outcome, \delta would typically represent the logarithm of the hazard ratio for the control group versus the experimental group. For a difference in response rates, \delta would represent the underlying response rates. For a continuous outcome such as the HAM-D, we would examine the difference in change from baseline at a milestone time point (e.g., at 6 weeks as in Binneman et al. (2008)). For j=1,\ldots,k, the tests Z_j are asymptotically multivariate normal with correlations as above, and for i=1,\ldots,k have \text{Cov}(Z_i,Z_j)=\text{Corr}(\hat\delta_i,\hat\delta_j) and E(Z_j)=\delta\sqrt{I_j}.
This multivariate asymptotic normal distribution for Z_1,\ldots,Z_k is referred to as the canonical form by Jennison and Turnbull (2000) who have also summarized much of the surrounding literature.
Bounds for testing
One-sided testing
We assume that the primary test the null hypothesis H_{0}: \delta=0 against the alternative H_{1}: \delta = \delta_1 for a fixed effect size \delta_1 > 0 which represents a benefit of experimental treatment compared to control. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For i=1,2,\ldots,k-1, interim cutoffs l_{i}< u_{i} are set; final cutoffs l_{k}\leq u_{k} are also set. For i=1,2,\ldots,k, the trial is stopped at analysis i to reject H_{0} if l_{j}<Z_{j}< u_{j}, j=1,2,\dots,i-1 and Z_{i}\geq u_{i}. If the trial continues until stage i, H_{0} is not rejected at stage i, and Z_{i}\leq l_{i} then H_{1} is rejected in favor of H_{0}, i=1,2,\ldots,k. Thus, 3k parameters define a group sequential design: l_{i}, u_{i}, and \mathcal{I}_{i}, i=1,2,\ldots,k. Note that if l_{k}< u_{k} there is the possibility of completing the trial without rejecting H_{0} or H_{1}. We will often restrict l_{k}= u_{k} so that one hypothesis is rejected.
We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus l_i= -\infty, i=1,2,\ldots,k. The probability of first crossing an upper bound at analysis i, i=1,2,\ldots,k, is
\alpha_{i}^{+}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{Z_{j}< u_{j}\}\}
The Type I error is the probability of ever crossing the upper bound when \delta=0. The value \alpha^+_{i}(0) is commonly referred to as the amount of Type I error spent at analysis i, 1\leq i\leq k. The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by
\alpha^+(\delta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\delta)
and the total Type I error by \alpha^+(0). Assuming \alpha^+(0)=\alpha the design will be said to provide a one-sided group sequential test at level \alpha.
Asymmetric two-sided testing
With both lower and upper bounds for testing and any real value \delta representing treatment effect we denote the probability of crossing the upper boundary at analysis i without previously crossing a bound by
\alpha_{i}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{ l_{j}<Z_{j}< u_{j}\}\},
i=1,2,\ldots,k. The total probability of crossing an upper bound prior to crossing a lower bound is denoted by
\alpha(\delta)\equiv\sum_{i=1}^{k}\alpha_{i}(\delta).
Next, we consider analogous notation for the lower bound. For i=1,2,\ldots,k denote the probability of crossing a lower bound at analysis i without previously crossing any bound by \beta_{i}(\delta)=P_{\delta}\{\{Z_{i}\leq l_{i}\}\cap_{j=1}^{i-1}\{ l_{j} <Z_{j}< u_{j}\}\}. The total lower boundary crossing probability in this case is written as \beta(\delta)= {\sum\limits_{i=1}^{k}} \beta_{i}(\delta).
When a design has final bounds equal (l_k=u_k), \beta(\delta_1) is the Type II error which is equal to 1 minus the power of the design. In this case, \beta_i(\delta) is referred to as the \beta-spending at analysis i, i=1,\ldots,k.
Spending function design
Type I error is most often defined with \alpha_i^+(0), i=1,\ldots,k. This is referred to as non-binding Type I error since any lower bound is ignored in the calculation. This means that if a trial is continued in spite of a lower bound being crossed at an interim analysis that Type I error is still controlled at the design \alpha-level. For Phase III trials used for approvals of new treatments, non-binding Type I error calculation is generally expected by regulators.
For any given 0<\alpha<1 we define a non-decreasing \alpha-spending function f(t; \alpha) for t\geq 0 with \alpha\left( 0\right) =0 and for t\geq 1, f( t; \alpha) =\alpha. Letting t_0=0, we set \alpha_j(0) for j=1,\ldots,k through the equation \alpha^+_{j}(0) = f(t_j;\alpha)-f(t_{j-1}; \alpha). Assuming an asymmetric lower bound, we similarly use a \beta-spending function and to set \beta-spending at analysis j=1,\ldots, k as: \beta_{j}(\delta_1) = g(t_j;\delta_1, \beta) - g(t_{j-1};\delta_1, \beta).
In the following example, the function \Phi() represents the cumulative distribution function for the standard normal distribution function (i.e., mean 0, standard deviation 1). The major depression study of Binneman et al. (2008) considered above used the Lan and DeMets (1983) spending function approximating an O’Brien-Fleming bound for a single interim analysis half way through the trial with
f(t; \alpha) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\alpha/2)}{\sqrt{t}}\right) \right).
g(t; \beta) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\beta/2)}{\sqrt{t}}\right) \right).
delta1 <- 3 # Treatment effect, alternate hypothesis
delta0 <- 0 # Treatment effect, null hypothesis
ratio <- 1 # Randomization ratio (experimental / control)
sd <- 7.5 # Standard deviation for change in HAM-D score
alpha <- 0.1 # 1-sided Type I error
beta <- 0.17 # Targeted Type II error (1 - targeted power)
k <- 2 # Number of planned analyses
test.type <- 4 # Asymmetric bound design with non-binding futility bound
timing <- .5 # information fraction at interim analyses
sfu <- sfLDOF # O'Brien-Fleming spending function for alpha-spending
sfupar <- 0 # Parameter for upper spending function
sfl <- sfLDOF # O'Brien-Fleming spending function for beta-spending
sflpar <- 0 # Parameter for lower spending function
delta <- 0
endpoint <- "normal"
# Derive normal fixed design sample size
n <- nNormal(
delta1 = delta1,
delta0 = delta0,
ratio = ratio,
sd = sd,
alpha = alpha,
beta = beta
)
# Derive group sequential design based on parameters above
x <- gsDesign(
k = k,
test.type = test.type,
alpha = alpha,
beta = beta,
timing = timing,
sfu = sfu,
sfupar = sfupar,
sfl = sfl,
sflpar = sflpar,
delta = delta, # Not used since n.fix is provided
delta1 = delta1,
delta0 = delta0,
endpoint = "normal",
n.fix = n
)
# Convert sample size at each analysis to integer values
x <- toInteger(x)
#> toInteger: rounding done to nearest integer since ratio was not specified as postive integer .
The planned design used \alpha=0.1,
one-sided and Type II error 17% (83% power) with an interim analysis at
50% of the final planned observations. This leads to Type I \alpha-spending of 0.02 and \beta-spending of 0.052 at the planned
interim. An advantage of the spending function approach is that bounds
can be adjusted when the number of observations at analyses are
different than planned. The actual observations for experimental versus
control at the analysis were 59 as opposed to the planned 67, which
resulted in interim spending fraction t_1= 0.4403. With the Lan-DeMets spending
function to approximate O’Brien-Fleming bounds this results in \alpha-spending of 0.0132
(P(Cross) if delta=0
row in Efficacy column) and \beta-spending of 0.0386
(P(Cross) if delta=3
row in Futility column). We note that
the Z-value and 1-sided p-values in the table below correspond exactly
and either can be used for evaluation of statistical significance for a
trial result. The rows labeled ~delta at bound
are
approximations that describe approximately what treatment difference is
required to cross a bound; these should not be used for a formal
evaluation of whether a bound has been crossed. The O’Brien-Fleming
spending function is generally felt to provide conservative bounds for
stopping at interim analysis. Most of the error spending is reserved for
the final analysis in this example. The futility bound only required a
small trend in the wrong direction to stop the trial; a nominal p-value
of 0.77 was observed which crossed the futility bound, stopping the
trial since this was greater than the futility p-value bound of 0.59.
Finally, we note that at the final analysis, the cumulative probability
for P(Cross) if delta=0
is less than the planned \alpha=0.10. This probability represents
\alpha(0) which excludes the
probability of crossing the lower bound at the interim analysis and the
final analysis. The value of the non-binding Type I error is still \alpha^+(0) = 0.10.
# Updated alpha is unchanged
alphau <- 0.1
# Updated sample size at each analysis
n.I <- c(59, 134)
# Updated number of analyses
ku <- length(n.I)
# Information fraction is used for spending
usTime <- n.I / x$n.I[x$k]
lsTime <- usTime
# Update design based on actual interim sample size and planned final sample size
xu <- gsDesign(
k = ku,
test.type = test.type,
alpha = alphau,
beta = x$beta,
sfu = sfu,
sfupar = sfupar,
sfl = sfl,
sflpar = sflpar,
n.I = n.I,
maxn.IPlan = x$n.I[x$k],
delta = x$delta,
delta1 = x$delta1,
delta0 = x$delta0,
endpoint = endpoint,
n.fix = n,
usTime = usTime,
lsTime = lsTime
)
# Summarize bounds
gsBoundSummary(xu, Nname = "N", digits = 4, ddigits = 2, tdigits = 1)
#> Analysis Value Efficacy Futility
#> IA 1: 44% Z 2.2209 -0.2304
#> N: 59 p (1-sided) 0.0132 0.5911
#> ~delta at bound 4.3370 -0.4500
#> P(Cross) if delta=0 0.0132 0.4089
#> P(Cross) if delta=3 0.2468 0.0386
#> Final Z 1.3047 1.3047
#> N: 134 p (1-sided) 0.0960 0.0960
#> ~delta at bound 1.6907 1.6907
#> P(Cross) if delta=0 0.0965 0.9035
#> P(Cross) if delta=3 0.8350 0.1650