2 Group sequential design theory and notation in brief

We begin by defining the distribution theory for the joint set of statistics used for testing in a group sequential design. While the primary purpose of the gsDesign package is to design group sequential trials, computing boundary crossing probabilities is the essential next step. Finally, we discuss the expected sample size for a group sequential design.

2.1 Distributional assumptions

We illustrate the distribution theory with a sequence of normal random variates. Let \(X_{1}\), \(X_{2}\),... be independent and identically distributed normal random variables with mean \(\delta\) and variance \(\sigma^2\). For some positive integer \(k\), let \(n_{1}<n_{2}...<n_{k}\) represent fixed sample sizes where data will be analyzed and inference surrounding \(\delta\) will be examined. This is referred to as a group sequential design. The first \(k-1\) analyses are referred to as interim analyses, while the \(k^{th}\) analysis is referred to as the final analysis. For \(i=1,\) \(2,...k\) consider estimating \(\delta\) with

\[ \hat{\delta}_{i}=\sum_{j=1}^{n_{i}} X_{j}/n_i \tag{2.1}\]

The variance of \(\hat{\delta}_i\) is \(\sigma^2/n_i\) and the corresponding statistical information for \(\hat{\delta}_i\) is the reciprocal of the variance,

\[ \mathcal{I}_{i}\equiv n_{i}/\sigma^2 \tag{2.2}\]

for \(i=1,2,\ldots,k\). The value \(\delta\) will be referred to as the natural parameter for this problem. We next define the standardized parameter \(\theta=\delta/\sigma\) and its corresponding estimates (assuming \(\sigma\) known) as

\[ \hat{\theta}_{i}=\sum_{j=1}^{n_{i}} X_{j}/(\sigma n_i) \tag{2.3}\]

which has variance \(1/n_i\) for \(i=1,2,\ldots,k\).

Next, for \(i=1,2,\ldots,k\) we consider the test statistic

\[ Z_i=\sqrt{n_i}\hat{\theta}_i=\sqrt{I_i}\hat{\delta}_i. \tag{2.4}\]

The test statistics \(Z_{1}\), \(Z_{2}\),...,\(Z_{k}\) follow a multivariate normal distribution with, for \(1\leq j\leq i\leq k\),

\[ E\{Z_{i}\}=\sqrt{n_i}\delta/\sigma\equiv \sqrt{n_i}\theta \tag{2.5}\]

and

\[ Cov(Z_{j},Z_{i})=\sqrt{n_{j}/n_{i}}, \tag{2.6}\]

or, equivalently,

\[ E\{Z_{i}\}=\delta\sqrt{\mathcal{I}_{i}} \tag{2.7}\]

and

\[ Cov(Z_{j},Z_{i})=\sqrt{\mathcal{I}_{j}/\mathcal{I}_{i}}. \tag{2.8}\]

Given the above formulations, we will tend to think in terms of the natural parameter \(\delta\) and statistical information when considering estimation problems. When considering sample size, on the other hand, it is often convenient to think in terms of the standardized parameter \(\theta\).

Jennison and Turnbull (Jennison and Turnbull 2000) refer to Equation 2.7 and Equation 2.8 as the canonical form and present several types of outcomes for which test statistics for comparing treatment groups take this form asymptotically. Specific examples in this manual consider 2-arm trials with binary or time-to-event outcomes. Note that when \(\delta=0\), the multivariate normal distribution expressed in Equation 2.7 and Equation 2.8 (or Equation 2.5 and Equation 2.6) depends only on \(\mathcal{I}_i/\mathcal{I}_k=n_i/n_k\), \(i = 1, 2, \ldots, k-1\). Scharfstein, Tsiatis and Robins (Scharfstein, Tsiatis, and Robins 1997) provided a unifying theory to justify the above distribution theory when efficient test statistics are applied to test a single parameter in a parametric or semi-parametric model; they noted many cases where test statistics have been shown to be efficient and analyze and then simulate a study applying longitudinal data analysis. The application of the canonical distribution to a wide variety of problems is also discussed by Jennison and Turnbull (Jennison and Turnbull 2000), among others. In this manual we consider binomial endpoint studies and time-to-event endpoint studies as our primary examples.

Computational methods for the gsDesign package related to the above multivariate normal distribution are described in Chapter 19 of Jennison and Turnbull (Jennison and Turnbull 2000) and are not provided here. Note that other software programs such as EAST and the University of Wisconsin software use this distributional assumption for group sequential design computations.

2.2 Hypotheses and testing

We assume that the primary test the null hypothesis \(H_{0}\): \(\theta=0\) against the alternative \(H_{1}\): \(\theta=\theta_1\) for a fixed standardized effect size \(\theta_1>0\). Corresponding to \(\theta_1\) we have \(\delta_1=\theta_1\sigma\). We will work with \(\theta\) most often. The values of \(\theta\) and \(\delta\) will be referred to as standardized and natural parameter treatment effects, respectively. We have arbitrarily assumed that \(\theta, \delta>0\) represents a treatment benefit and will refer to this case as a positive treatment effect. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For \(i=1,2,\ldots,k-1\), interim cutoffs \(l_{i}< u_{i}\) are set; final cutoffs \(l_{k}\leq u_{k}\) are also set. For \(i=1,2,\ldots,k\), the trial is stopped at analysis \(i\) to reject \(H_{0}\) if \(l_{j}<Z_{j}< u_{j}\), \(j=1,2,\dots,i-1\) and \(Z_{i}\geq u_{i}\). If the trial continues until stage \(i\), \(H_{0}\) is not rejected at stage \(i\), and \(Z_{i}\leq l_{i}\) then \(H_{1}\) is rejected in favor of \(H_{0}\), \(i=1,2,\ldots,k\). Thus, \(3k\) parameters define a group sequential design: \(l_{i}\), \(u_{i}\), and \(\mathcal{I}_{i}\), \(i=1,2,\ldots,k\). Note that if \(l_{k}< u_{k}\) there is the possibility of completing the trial without rejecting \(H_{0}\) or \(H_{1}\). We will often restrict \(l_{k}= u_{k}\) so that one hypothesis is rejected.

2.3 Boundary crossing probabilities: `gsProbability()`

2.3.1 One-sided testing

We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus \(l_i= -\infty\), \(i=1,2,\ldots,k\). The probability of first crossing an upper bound at analysis \(i\), \(i=1,2,\ldots,k\), is

\[ \alpha_{i}^{+}(\theta)=P_{\theta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{Z_{j}< u_{j}\}\} \]

The Type I error is the probability of ever crossing the upper bound when \(\theta=0\). The value \(\alpha^+_{i}(0)\) is commonly referred to as the amount of Type I error spent at analysis \(i\), \(1\leq i\leq k\). The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by

\[ \alpha^+(\theta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\theta) \]

and the total Type I error by \(\alpha^+(0)\). Assuming \(\alpha^+(0)=\alpha\) the design will be said to provide a one-sided group sequential test at level \(\alpha\).

As an example, assume \(k = 5\), \(n_i = 100 \times i\), and \(u_i = \Phi^{-1}(0.975) = 1.96\), \(i = 1, 2, 3, 4, 5\). Thus, we are testing 5 times at a nominal 0.025 level at five equally spaced analyses. The function gsProbability() is a simple group sequential calculator to compute the probability of crossing bounds at each analysis as follows. gsProbability() requires a lower bound; we let \(l_i= -20\), \(1 \le i\le 5\) to effectively make the probability of crossing a lower bound 0.

library(gsDesign)

k <- 5
x <- gsProbability(
  k = k,
  theta = 0,
  n.I = 1:k * 100,
  a = array(-20, k),
  b = array(qnorm(0.975), k)
)
tibble(
  Analysis = 1:x$k,
  N = x$n.I,
  "Upper bound" = x$upper$bound,
  "Nominal test level" = pnorm(
    x$upper$bound,
    lower.tail = FALSE
  ),
  "alpha+[i](0)" = x$upper$prob[, 1],
  "Cumulative alpha" = cumsum(x$upper$prob[, 1])
) |>
  gt() |>
  fmt_number(columns = 3:6, decimals = 3) |>
  tab_header(
    title = "Multiplicity problem with repeated testing"
  ) |>
  tab_options(data_row.padding = px(1))

Analysis	N	Upper bound	Nominal test level	alpha+[i](0)	Cumulative alpha
Multiplicity problem with repeated testing
1	100	1.960	0.025	0.025	0.025
2	200	1.960	0.025	0.017	0.042
3	300	1.960	0.025	0.012	0.054
4	400	1.960	0.025	0.009	0.063
5	500	1.960	0.025	0.008	0.071

Users of gsProbability() may want to look at the the list of items output since we have just provided a simple summary above. The help file for the routine provides details of what is returned, which can help you to produce simple summary tables such as the above. The table above shows that for the first two equally spaced analyses tested at the \(\alpha = 0.025\) level that cumulative Type I error is already increased to \(0.042\). At 5 equally-spaced analyses, this increases to \(0.071\), nearly triple the nominal level used for testing at each analysis. These calculations are based on the multivariate normal distribution above.

In the above code, the call to gsProbability() returned a value into x which was then printed. The command class(x) following the above code will indicate that x has class gsProbability. The elements of this class are displayed as follows:

class(x)
#> [1] "gsProbability"

The names of components are:

names(x)
#> [1] "k"       "theta"   "n.I"     "lower"   "upper"   "en"     
#> [7] "r"       "overrun"

Briefly, k is the number of analyses, theta a vector of standardized effect sizes, and n.I is a vector containing \(\mathcal{I}_i\), \(i=1,2,\ldots,k\). The value in \(r\) is a positive integer input to gsProbability() that is used to define how fine of a grid is used for numerical integration when computing boundary crossing probabilities; this is the same as input and will normally not be changed from the default of 18. The values of en and overrun will be discussed below in Section 2.4. This leaves the lists lower and upper, which have identical structure. We examine upper by

x$upper
#> $bound
#> [1] 1.959964 1.959964 1.959964 1.959964 1.959964
#> 
#> $prob
#>             [,1]
#> [1,] 0.025000000
#> [2,] 0.016558911
#> [3,] 0.012070163
#> [4,] 0.009460567
#> [5,] 0.007769166

to see that it contains a vector bound which contains the values for \(u_i\) and upper boundary crossing probabilities in prob[i,j] for analysis \(i\) and the \(\theta\)-value in theta[j] for i=1,2,...,k and j from 1 to the length of theta. Further documentation is in the online help file displayed using help(gsProbability).

A Bonferroni adjustment maintains a Type I error of \(\leq 0.025\). For \(1 \leq i\leq 5\) this would use the upper bound \(u_i=\Phi^{-1}(1~-~.025/5)\). Substituting qnorm(1 - 0.025 / 5) for qnorm(0.975) in the above call to gsProbability() yields an upper bound of 2.576 (nominal \(p = 0.005\)) and total Type I error of \(0.016\). Thus, the Bonferroni adjustment is overly conservative.

k <- 5
x <- gsProbability(
  k = k,
  theta = 0,
  n.I = 1:k * 100,
  a = array(-20, k),
  b = array(qnorm(1 - 0.025 / k), k)
)
tibble(
  Analysis = 1:x$k,
  N = x$n.I,
  "Upper bound" = x$upper$bound,
  "Nominal test level" = pnorm(
    x$upper$bound,
    lower.tail = FALSE
  ),
  "alpha+[i](0)" = x$upper$prob[, 1],
  "Cumulative alpha" = cumsum(x$upper$prob[, 1])
) |>
  gt() |>
  fmt_number(columns = 3:6, decimals = 3) |>
  tab_header(
    title = paste0(
      "Conservative control of Type I error ",
      "using a Bonferroni correction"
    )
  ) |>
  tab_options(data_row.padding = px(1))

Analysis	N	Upper bound	Nominal test level	alpha+[i](0)	Cumulative alpha
Conservative control of Type I error using a Bonferroni correction
1	100	2.576	0.005	0.005	0.005
2	200	2.576	0.005	0.004	0.009
3	300	2.576	0.005	0.003	0.012
4	400	2.576	0.005	0.002	0.014
5	500	2.576	0.005	0.002	0.016

The simple Bonferroni correction does not take advantage of the known correlations between tests and over corrects Type I error. This illustrates the rationale for finding bounds that control the total Type I error at a desired level such as \(\alpha = 0.025\) without being overly conservative. We now illustrate bound calculation with the gsDesign() function to derive bounds for controlling Type I error under the multivariate normal distribution noted above. We do not fully explain now all of the code used to limit output to what is essential here. We note initially that we input the value n.fix as a sample size for a fixed design with no interim analyses. The resulting sample sizes at interims and final are inflated to retain power at the same level as such a fixed design.

k <- 5
x <- gsDesign(
  # Number of analyses
  k = k,
  # Superiority testing only,
  test.type = 1,
  # Equal Z-values for each bound
  sfu = "Pocock",
  # Equally spaced analysis timing
  timing = 1,
  # Assume fixed design would require N = k * 100
  n.fix = k * 100
)
x |>
  gsBoundSummary(
    exclude = c(
      "B-value", "Spending", "CP",
      "CP H1", "PP", "P(Cross) if delta=1",
      "~delta at bound"
    )
  ) |>
  gt() |>
  tab_header(
    title = "Pocock bounds for equally spaced analyses"
  ) |>
  tab_footnote("Now alpha = 0.025 is fully utilized.") |>
  tab_footnote(
    "Nominal p-value at bound",
    locations = cells_body(rows = 0:4 * 3 + 2, columns = 2)
  ) |>
  tab_footnote(
    paste0(
      "Null hypothesis probability of Type I error ",
      "at or before current analyses"
    ),
    cells_body(rows = 1:5 * 3, columns = 2)
  ) |>
  tab_options(data_row.padding = px(1))

Analysis	Value	Efficacy
Pocock bounds for equally spaced analyses
IA 1: 20%	Z	2.4132
N: 121	p (1-sided)¹	0.0079
	P(Cross) if delta=0²	0.0079
IA 2: 40%	Z	2.4132
N: 242	p (1-sided)¹	0.0079
	P(Cross) if delta=0²	0.0138
IA 3: 60%	Z	2.4132
N: 362	p (1-sided)¹	0.0079
	P(Cross) if delta=0²	0.0183
IA 4: 80%	Z	2.4132
N: 483	p (1-sided)¹	0.0079
	P(Cross) if delta=0²	0.0219
Final	Z	2.4132
N: 604	p (1-sided)¹	0.0079
	P(Cross) if delta=0²	0.0250
Now alpha = 0.025 is fully utilized.
¹ Nominal p-value at bound
² Null hypothesis probability of Type I error at or before current analyses

We now see that although we cannot test at a nominal \(0.025\) level repeatedly, we can test at a nominal level of \(0.0079\) and control Type I error at \(\alpha = 0.025\) without using the overly conservative Bonferroni nominal level of \(0.0025\).

2.3.2 Two-sided testing

With both lower and upper bounds for testing and any real value \(\theta\) representing treatment effect we denote the probability of crossing the upper boundary at analysis \(i\) without previously crossing a bound by

\[ \alpha_{i}(\theta)=P_{\theta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{ l_{j}<Z_{j}< u_{j}\}\}, \]

\(i = 1, 2, \ldots, k.\) The total probability of crossing an upper bound prior to crossing a lower bound is denoted by

\[ \alpha(\theta)\equiv\sum_{i=1}^{k}\alpha_{i}(\theta). \]

Next, we consider analogous notation for the lower bound. For \(i=1,2,\ldots,k\) denote the probability of crossing a lower bound at analysis \(i\) without previously crossing any bound by

\[ \beta_{i}(\theta)=P_{\theta}\{\{Z_{i}\leq l_{i}\}\cap_{j=1}^{i-1}\{ l_{j} <Z_{j}< u_{j}\}\}. \]

For symmetric testing for analysis \(i\) we would have \(l_i= - u_i\), \(\beta_i(0)=\alpha_i(0),\) \(i=1,2,\ldots,k\). The total lower boundary crossing probability in this case is written as \(\beta(\theta)=% %TCIMACRO{\tsum \limits_{i=1}^{k}}% %BeginExpansion {\textstyle\sum\limits_{i=1}^{k}} %EndExpansion \beta_{i}(\theta).\) The total lower boundary crossing probability for a trial is denoted by

\[ \beta(\theta)\equiv\sum_{i=1}^{k}\beta_{i}(\theta). \]

To extend the one-sided example using repeated testing at a \(0.025\) level to two-sided testing at the \(0.05\) level, try the commands

b <- array(qnorm(0.975), 3)
x2 <- gsProbability(
  k = 3,
  theta = 0,
  n.I = c(100, 200, 300),
  a = -b,
  b = b
)
x2
#>                Lower bounds   Upper bounds
#>   Analysis  N    Z   Nominal p  Z   Nominal p
#>          1 100 -1.96     0.025 1.96     0.025
#>          2 200 -1.96     0.025 1.96     0.025
#>          3 300 -1.96     0.025 1.96     0.025
#> 
#> Boundary crossing probabilities and expected sample size assume
#> any cross stops the trial
#> 
#> Upper boundary
#>           Analysis
#>   Theta     1      2      3  Total  E{N}
#>       0 0.025 0.0166 0.0121 0.0536 286.7
#> 
#> Lower boundary
#>           Analysis
#>   Theta     1      2      3  Total
#>       0 0.025 0.0166 0.0121 0.0536

The fact that a lower bound can be crossed before crossing an upper bound means that after the first interim analysis the upper boundary crossing probability here should be less than it was for the one-sided computation performed previously. To examine this further, we print the upper boundary crossing probability at each analysis for the above one-sided and two-sided examples, respectively, to see that there is a small difference:

x$upper$prob
#>             [,1]       [,2]
#> [1,] 0.007906997 0.20587421
#> [2,] 0.005855842 0.26025148
#> [3,] 0.004509295 0.20860438
#> [4,] 0.003655118 0.14019500
#> [5,] 0.003072747 0.08507493

Group sequential designs most often employ more stringent bounds at early interim analyses than later. We modify the above example to demonstrate this.

b <- qnorm(0.975) / sqrt(c(1, 2, 3) / 3)
b
#> [1] 3.394757 2.400456 1.959964

x2b <- gsProbability(
  k = 3,
  theta = 0,
  n.I = c(100, 200, 300),
  a = -b,
  b = b
)
x2b
#>                Lower bounds   Upper bounds
#>   Analysis  N    Z   Nominal p  Z   Nominal p
#>          1 100 -3.39    0.0003 3.39    0.0003
#>          2 200 -2.40    0.0082 2.40    0.0082
#>          3 300 -1.96    0.0250 1.96    0.0250
#> 
#> Boundary crossing probabilities and expected sample size assume
#> any cross stops the trial
#> 
#> Upper boundary
#>           Analysis
#>   Theta     1     2      3  Total  E{N}
#>       0 3e-04 0.008 0.0195 0.0279 298.3
#> 
#> Lower boundary
#>           Analysis
#>   Theta     1     2      3  Total
#>       0 3e-04 0.008 0.0195 0.0279

By setting the interim boundaries to be substantially higher than \(\Phi^{-1}(0.975) = 1.96\) we have drastically reduced the excess Type I error caused by multiple testing while still testing at the nominal 0.05 (2-sided) level at the final analysis. Thus, minimal adjustment to the final boundary should be required when employing such a strategy. Precise control of Type I error when using either equal bounds or adjusting relative sizes of bounds using the square root of sample size is discussed further in Section 6.2.

2.4 Expected sample size

We denote the sample size at analysis \(i\) by \(n_i\), \(i=1,2,\ldots,k\) and the sample size at the time a boundary is crossed by \(N\). The expected sample size is

\[ E_\theta \{N\}=\sum_{i=1}^k n_i\times(\alpha_i(\theta)+\beta_i(\theta)). \]

Values of \(E_\theta \{N\}\) corresponding to \(\theta\)-values input to gsProbability() in theta are output in the vector en returned by that function. In the one- and two-sided examples above we only had a single element 0 in theta and the expected sample sizes rounded to 293 and 298, respectively; these were labeled E{N} in the printed output. Since the probability of crossing a boundary at an interim analysis was small, the trial usually proceeds to the final analysis with 300 observations. We consider an additional \(\theta\)-value to demonstrate that the average sample number can be substantially smaller than the maximum sample size:

x2c <- gsProbability(
  k = 3,
  theta = c(0, 0.3),
  n.I = c(100, 200, 300),
  a = -b,
  b = b
)
x2c
#>                Lower bounds   Upper bounds
#>   Analysis  N    Z   Nominal p  Z   Nominal p
#>          1 100 -3.39    0.0003 3.39    0.0003
#>          2 200 -2.40    0.0082 2.40    0.0082
#>          3 300 -1.96    0.0250 1.96    0.0250
#> 
#> Boundary crossing probabilities and expected sample size assume
#> any cross stops the trial
#> 
#> Upper boundary
#>           Analysis
#>   Theta      1      2      3  Total  E{N}
#>     0.0 0.0003 0.0080 0.0195 0.0279 298.3
#>     0.3 0.3465 0.6209 0.0320 0.9994 168.6
#> 
#> Lower boundary
#>           Analysis
#>   Theta     1     2      3  Total
#>     0.0 3e-04 0.008 0.0195 0.0279
#>     0.3 0e+00 0.000 0.0000 0.0000

Thus, assuming a positive treatment effect, the average sample number was 169 compared to 298 when there was no treatment difference.