1  Introduction

1.1 Overview

The gsDesign package is intended to provide a flexible set of tools for designing and analyzing group sequential trials. There are other adaptive methods that can also be supported using this underlying toolset. This manual is intended as an introduction to gsDesign. Many users may just want to apply basic, standard design methods. Others will be interested in applying the toolset to very ambitious adaptive designs. We try to give some orientation to each of these sets of users, and to distinguish between the material needed by each. For those looking for a particularly simple approach to using gsDesign, a web-based Shiny app is available at https://rinpharma.shinyapps.io/gsdesign/.

The remainder of this overview provides a quick review of topics covered in this manual. The introduction continues with some basic theory behind group sequential design to provide background for the routines. There is no attempt to fully develop the theory for statistical tests or for group sequential design in general since many statisticians will already be familiar with these and there are excellent texts available such as Jennison and Turnbull (2000) and Proschan, Lan, and Wittes (2006).

The introduction continues with a simple outline of the main routines provided in the gsDesign package followed by motivational examples that will be used later in the manual. Basic sample size calculations for 2-arm binomial outcome trials using the nBinomial() function and 2-arm time-to-event endpoint trials using nSurvival() are shown, including an example of a non-inferiority trial. Both superiority and noninferiority trials are considered.

Further material is arranged by topic in subsequent sections. Chapter 2 provides a minimal background in asymptotic probability theory for group sequential testing. The basic calculations involve computing boundary crossing probabilities for correlated normal random variables. We demonstrate the gsProbability() routine to compute boundary crossing probabilities and expected sample size for group sequential designs.

Setting boundaries for group sequential designs, particularly using spending functions is the main point of emphasis in the gsDesign package. Section 4.1 through Section 8.5 of the manual present the design and evaluation of designs for group sequential trials using the gsDesign() routine.

Default parameters for gsDesign() are demonstrated for the motivational examples in Section 4.1. Basic computations for group sequential designs using boundary families and error spending are provided in Chapter 6. The primary discussion of Wang-Tsiatis boundary families (Wang and Tsiatis 1987) (e.g., O’Brien-Fleming O’Brien and Fleming (1979) and Pocock Pocock (1977) designs) is provided here in Section 6.2.

Next, we proceed to a short discussion in Chapter 7 of gsDesign() parameters for setting Type I and II error rates and the number and timing of analyses. The section also explains how to use a measure of treatment effect to size trials, with specific discussion of event-based computations for trials with time-to-event analyses.

The basics of standard spending functions are provided in Chapter 8. Subsections defining spending functions and spending function families are followed by a description of how to use built-in standard Hwang-Shih-DeCani Hwang, Shih, and De Cani (1990) and power Kim and Demets (1987) spending functions in gsDesign(). Section 8.4 shows how to reset timing of interim analyses using gsDesign().

The final section on spending functions is Section 8.5 which presents details of how spending functions are defined for gsDesign() and other advanced topics that will probably not be needed by many users. The section will be of use to those interested in investigational spending functions and optimized spending function choice. Spending function families by Anderson and Clark (2010) providing additional flexibility to standard one-parameter spending functions are detailed as part of a comprehensive list of built-in spending functions. This is followed by examples of how to derive optimal designs and how to implement new spending functions.

Next comes Section Chapter 9 on the basic analysis of group sequential trials. This includes computing stagewise and repeated \(p\)-values as well as repeated confidence intervals.

Conditional power and B-values are presented in Chapter 10. These are methods used for evaluating interim trends in a group sequential design, but may also be used to adapt a trial design at an interim analysis using the methods of Müller and Schäfer (2001). The routine gsCP() provides the basis for applying these adaptive methods.

We end with a discussion of Bayesian computations in Chapter 11. The gsDesign package can quite simply be used with decision theoretic methods to derive optimal designs. We also apply Bayesian computations to update the probability of success a trial based on knowing a bound has not been crossed, but without knowledge of unblinded treatment results.

Future extensions of the manual could further discuss implementation of information-based designs and additional adaptive design topics.

1.2 Quick start: installation and online help

This brief section is meant to get you up and going. The package is most easily downloaded and installed in R from the CRAN website using:

install.packages("gsDesign")

Since the package is released on a regular basis, there is usually little value installing the most recent development version from GitHub using:

remotes::install_github("keaven/gsDesign")

After installation, attach the gsDesign package with:

Online help can be obtained by entering the following on the command line:

help(gsDesign)

There are many help topics covered there which should be sufficient information to keep you from needing to use this document for day-to-day use or if you just generally prefer not using a manual. The same information plus additional vignettes providing more long-form description of usage are at https://keaven.github.io/gsDesign/.

1.3 Package testing

While there are no guarantees provided, we note that extensive unit testing has been written to ensure package quality. At the time of this writing, code coverage is at 82%.

1.4 The primary routines in the gsDesign package

As an overview to the R package, there is a handful of R functions that provide basic computations related to designing and evaluating many group sequential clinical trials:

  1. The gsDesign() function provides sample size and boundaries for a group sequential design based on treatment effect, spending functions for boundary crossing probabilities, and relative timing of each analysis. Standard and user-specified spending functions may be used. In addition to spending function designs, the family of Wang-Tsiatis designs—including O’Brien-Fleming and Pocock designs—are also available.
  2. The gsSurv() function extends the gsDesign() function to design group sequential trials for time-to-event endpoints.
  3. The gsProbability() function computes boundary crossing probabilities and expected sample size of a design for arbitrary user-specified treatment effects, bounds, and interim analysis sample sizes.
  4. The gsCP() function computes the conditional probability of future boundary crossing given a result at an interim analysis. There are related functions to support other conditional and predictive power calculations that can be used for interim descriptions of or adaptive design. For instance, the ssrCP() function supports sample size re-estimation directly for 2-stage trials.

We note that the full reference of all functions is organized by topic and easily available at the package documentation site in the reference section.

The package design strategy should make its tools useful both as an everyday tool for simple group sequential design as well as a research tool for a wide variety of group sequential design problems. Both print() and plot() functions are available for both gsDesign() and gsProbability(). The gsBoundSummary() function provides a formatted table for incorporation in a protocol; gsBoundSummary() provides the capability to summarize many boundary characteristics, including conditional power, treatment effect approximations and B-values. This is particularly helpful with the gt package to publish tables.

The most extensive set of supportive routines enables design and evaluation of binomial endpoint trials and time-to-event endpoint trials. For binomial endpoints, we use the Farrington and Manning (1990) method for sample size estimation in nBinomial() and the corresponding Miettinen and Nurminen (1985) method for testing, confidence intervals, and simulation. We also implement the Lachin and Foulkes (1986) for sample size for survival studies. The examples we present apply these methods to group sequential trial design for binomial and time-to-event endpoints.

Functions are set up to be called directly from the R command line. Default arguments and output for gsDesign() and gsSurv() are included to make initial use simple. Sufficient options are available, however, to make the routine very flexible. Guided use not requiring looking up function arguments is provided by the Shiny interface at https://rinpharma.shinyapps.io/gsdesign/.

Simple examples provide the best overall motivation for group sequential design. This manual does not attempt to comprehensively delineate all that the gsDesign package may accomplish. The intent is to include enough detail to demonstrate a variety of approaches to group sequential design that provide the user with a useful tool and the understanding of ways that it may be applied and extended. Examples that will reappear throughout the manual are introduced here.

1.5 The CAPTURE trial: binary endpoint example

The CAPTURE Investigators (1997) presented the results of a randomized trial in patients with unstable angina who required treatment with angioplasty, an invasive procedure where a balloon is inflated in one or more coronary arteries to reduce blockages. In the process of opening a coronary artery, the balloon can injure the artery which may lead to thrombotic complications. Standard treatment at the time the trial was run included treatment with heparin and aspirin before and during angioplasty to reduce the thrombotic complications such as the primary composite endpoint comprising myocardial infarction, recurrent urgent coronary intervention and death over the course of 30 days. This trial compared this standard therapy to the same therapy plus abciximab, a platelet inhibitor. While the original primary analysis used a logrank statistic to compare treatment groups, for this presentation we will consider the outcome binary. Approximately 15% of patients in the control group were expected to experience a primary endpoint, but rates from 7.5% to 20% could not be ruled out. There was an expectation that the experimental treatment would reduce incidence of the primary endpoint by at least 1/3, but possibly by as much as 1/2 or 2/3. Since a 1/3 reduction was felt to be conservative, the trial was planned to have 80% power. Given these various possibilities, the desirable sample size for a trial with a fixed design had over a 10-fold range from 202 to 2942; see Table below.

n <- NULL
p <- c(0.075, 0.1, 0.15, 0.2)
for (p1 in p) {
  n <- rbind(
    n,
    ceiling(
      nBinomial(
        p1 = p1,
        p2 = p1 * c(2 / 3, 1 / 2, 1 / 3),
        beta = 0.2
      ) / 2
    ) * 2
  )
}
tb <- data.frame(p * 100, n)
names(tb) <- c(
  "Control rate (%)",
  "1/3 reduction",
  "1/2 reduction",
  "2/3 reduction"
)
tb %>%
  kable(
    caption = paste0(
      "Fixed design sample size possibilities ",
      "for the CAPTURE trial by control group event rate ",
      "and relative treatment effect."
    )
  ) %>%
  kable_styling()
Fixed design sample size possibilities for the CAPTURE trial by control group event rate and relative treatment effect.
Control rate (%) 1/3 reduction 1/2 reduction 2/3 reduction
7.5 2942 1184 596
10.0 2158 870 438
15.0 1372 556 282
20.0 980 398 202

The third line in the above table can be generated using the call

nBinomial(
  p1 = 0.15,
  p2 = 0.15 * c(2 / 3, 1 / 2, 1 / 3),
  beta = 0.2
)
#> [1] 1371.1937  554.9067  280.1902

and rounding the results up to the nearest even number. The function nBinomial() in the gsDesign package is designed to be a flexible tool for deriving sample size for two-arm binomial trials for both superiority and non-inferiority. Type at the command prompt to see background on sample size, simulation, testing and confidence interval routines for fixed (non-group sequential) binomial trials. These routines will be used with this and other examples throughout the manual.

1.6 A time-to-event endpoint in a cancer trial

As a second example we consider comparing a new treatment to a standard treatment for a cancer trial. Lachin and Foulkes (Lachin and Foulkes 1986) provide a method of computing sample size assuming the following distributions are known:

  • The time to a primary endpoint in each treatment group.
  • The time until dropout in each group.
  • Enrollment over time.

Statistical testing is performed using the logrank test statistic. The methods allow different assumptions in different strata. Enrollment time and total study duration are assumed fixed, and the sample size and number of events required during those periods, respectively, to achieve a desired power and Type I error are computed. Here we apply the simplest form of this method, assuming an exponential distribution in each case with no stratification. The routine can be used to derive the sample size and number of events required. This routine works with failure rates rather than distribution medians or dropout rates per year. An exponential distribution with failure rate \(\lambda\) has cumulative probability of failure at or before time \(t\) of

\[ F(t)=1-e^{-\lambda t}. \]

If the cumulative failure rate is known to be \(p_0\) at time \(t_0\), then the value of \(\lambda\) is

\[ \lambda= -\ln(1-p_0) / t_0. \]

We assume for the trial of interest that the primary endpoint is the time from randomization until the first of disease progression or death (progression free survival or PFS). Patients on the standard treatment are assumed to have an exponential failure rate with a median PFS of 6 months, yielding \(\lambda_C = \ln(2)/6 = 0.1155\) with \(t\) measured in months. The trial is to be powered at 90% to detect a reduction in the hazard rate for PFS of 30% (HR = 0.7) in the experimental group compared to standard treatment. This yields an experimental group failure rate of \(0.7 \times \lambda_C = 0.0809\). Patients are assumed to drop out at a rate of 5% per year of follow-up which implies an exponential rate \(\eta = -\ln(0.95)/12 = 0.00427\). Enrollment is assumed to be uniform over 30 months with patients followed for a minimum of 6 months, yielding a total study time of 36 months.

The function is nSurv() computes sample size using the Lachin and Foulkes (Lachin and Foulkes 1986) method:

x <- nSurv(
  lambdaC = log(2) / 6,
  alpha = 0.025,
  beta = 0.1,
  eta = -log(0.95) / 12,
  hr = 0.7,
  T = 36,
  minfup = 6
)

This returns a total sample size x$n of 416.2635478 which is a continuous number. Generally, you will want to round up to an even number with

n <- ceiling(x$n / 2) * 2
n

The target number of events to power the trial is rounded up to the nearest integer:

events <- ceiling(x$d)
events
#> [1] 330

Thus, 2942, 2158, 1372, 980, 1184, 870, 556, 398, 596, 438, 282, 202 patients and 330 events are sufficient to obtain 90% power with a 2.5% one-sided Type I error. A major issue with this type of study is that many experimental cancer therapies have toxic side-effects and, at the same time, do not provide benefit. For such drugs, it is desirable to minimize the number of patients exposed to the experimental regimen and further to minimize the duration of exposure for those who are exposed. Thus, it is highly desirable to do an early evaluation of data to stop the trial if no treatment benefit is emerging during the course of the trial. Such an evaluation must be carefully planned to 1) avoid an unplanned impact on the power of the study, and 2) to allow a realistic assessment of the emerging treatment effect.

1.7 A non-inferiority study for a new drug

The nBinomial() function presented above was specifically designed to work for noninferiority trial design as well as superiority designs. We consider a new treatment that is to be compared to a standard that has a successful treatment rate of 67.7%. An absolute margin of 7% is considered an acceptable noninferiority margin. The trial is to be powered at 90% with 2.5% Type I error (one-sided) using methods presented by Farrington and Manning (Farrington and Manning 1990). The function call nBinomial(p1 = 0.677, p2 = 0.677, delta0 = 0.07) shows that a fixed sample size of 1874 is adequate for this purpose. There are some concerns about these assumptions, however. First, the control group event rate may be incorrect. As the following code using event rates from 0.55 to 0.75 demonstrates, the required sample size may range from 1600 to over 2100.

p <- c(0.55, 0.6, 0.65, 0.7, 0.75)
ceiling(nBinomial(p1 = p, p2 = p, delta0 = 0.07))
#> [1] 2117 2054 1948 1800 1611

More importantly, if the experimental group therapy does not work quite as well as control, there is a considerable dropoff in power to demonstrate non-inferiority. Thus, there may be value in planning an interim futility analysis to stop the trial if the success rate with experimental therapy is trending substantially worse than with control.

1.8 A diabetes outcomes trial example

Current regulatory standards for chronic therapies of diabetes require ensuring that a new drug in a treatment class does not have substantially inferior cardiovascular outcomes compared to an approved treatment or treatments (Center for Drug Evaluation and Research 2008). While we do not claim the designs for this example presented here would be acceptable to regulators, the specifics of the guidance provide a nice background for the use of the gsDesign package to derive group sequential designs that fit a given problem. The initial reason for presenting this example is that there is likely to be a genuine public health interest in showing any of the following for the two treatment arms compared:

  • The two treatment arms are similar (equivalence).
  • One arm is similar to or better than the other (non-inferiority).
  • Either arm is superior to the other (2-sided testing of no difference).

The example is somewhat simplified here. We assume patients with diabetes have a risk of a cardiovascular event of about 1.5% per year and a 15% dropout rate per year. If each arm has the same cardiovascular risk as the other, we would like to have 90% power to rule out a hazard ratio of 1.3 in either direction. Type I error if one arm has an elevated hazard ratio of 1.3 compared to the other should be 2.5% if one-sided. The trial is to enroll in 2 years and have a minimum follow-up of 4 years, leading to a total study time of 6 years. The sample size routine nSurv() is set up to handle this by making the null hypothesis a hazard ratio of 1.3 (hr0 = 1.3 below) and the alternate hypothesis a hazard ratio of 1 (hr = 1 below) to reflect equivalence. Our assumed rate for the both groups of \(\lambda=\lambda = -\ln(1 - 0.015)\) under the alternate hypothesis is what we want to drive the sample size.

x <- nSurv(
  lambdaC = -log(1 - 0.015),
  hr0 = 1.3,
  hr = 1,
  eta = -log(0.85),
  alpha = 0.025,
  beta = 0.1,
  T = 6,
  minfup = 4
)
n <- ceiling(x$n / 2) * 2
d <- ceiling(x$d)
cat(paste("Sample size:", n, "Events:", d, "\n"))
#> Sample size: 12362 Events: 617

We note that the power for this sample size has been verified by simulation. This can be done with the simtrial package as follows; the numbers are not executed here. This verification uses the Schoenfeld approximation for the variance since the simtrial::simfix() function was not set up to save Cox model standard errors. One-thousand simulations estimated power at 90.7% when the planned minimum of 4 years of follow-up was obtained.

library(simtrial)
library(dplyr)

xx <- simfix(
  nsim = 1000,
  sampleSize = 12362,
  targetEvents = 617,
  totalDuration = 6,
  enrollRates = tibble::tibble(
    duration = 2,
    rate = 12362 / 2
  ),
  failRates = tibble::tibble(
    Stratum = "All",
    duration = 6,
    failRate = -log(1 - 0.015),
    hr = 1,
    dropoutRate = -log(0.85)
  ),
  timingType = 3
)
xx %>%
  mutate(se = sqrt(4 / Events)) %>%
  summarize(
    Events = mean(Events),
    Duration = mean(Duration),
    Power = mean(lnhr + qnorm(0.975) * se < log(1.3))
  )
#>   Events Duration Power
#>   617.53 6.000395 0.907

Generally, a confidence interval for the hazard ratio of experimental to control is used to express treatment differences at the end of this type of trial. A confidence interval will rule out the specified treatment differences consistently with testing if, for example, the same proportional hazards regression model is used for both the a Wald test and the corresponding confidence interval. The terminology of “control” and “experimental” is generally inappropriate when both therapies are approved. However, for this example it is generally the case that a new therapy is being compared to an established one and there may be some asymmetry when considering the direction of inference. Various questions arise concerning early stopping in a trial of this nature:

  • While it would be desirable to stop early if the new therapy has a significantly lower cardiovascular event rate, a minimum amount of follow-up may be valuable to ensure longer-term safety and general acceptance of the results.
  • If a trend emerges in favor of the experimental treatment, it will likely be possible to demonstrate non-inferiority prior to being able to demonstrate superiority. If the trial remains blinded until superiority is demonstrated or until the final planned analysis, full acceptance of a useful new therapy may be delayed. As noted above, the value of long-term safety data may be more important than an early stop based on “short-term” endpoint.
  • From a sponsor’s standpoint, it may be desirable to stop the trial if it becomes futile to demonstrate the experimental therapy is non-inferior to control; that is, there is an interim trend favoring control. However, if both treatment groups represent marketed products then from a public health standpoint it may be desirable to continue the trial to demonstrate a statistically significant advantage for the control treatment.