December 6, 2021



  • 3:30: Introduction and background theory (30 minutes)

  • 4:00: Proportional hazards applications with Shiny app (25 minutes)

  • 4:25: Intro to non-proportional hazards (NPH; 5 minutes)

  • 4:30: Software and piecewise model (15 min)

  • 4:45: Average hazard ratio (AHR; 20 minutes)

  • 5:05: Break (10 minutes)

  • 5:15: NPH design with logrank test (25 minutes)

  • 5:40: Weighted logrank and combination tests (40 minutes)

  • 6:20 Summary and questions (10 minutes)


  • All opinions expressed are those of the presenters and not Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

  • Some slides need to be scrolled down to see the full content.


Course resources

  • Book at
    • Instructions there to install software, download repository at
    • Directories there we will use:
      • data/: contains design files for examples; also simulation results
      • vignettes/: reports produced by Shiny app to summarize designs
      • simulation/: R code and simulation data for the last part of course


Group sequential design

  • Analyze a trial repeatedly at planned intervals
  • Group of data added at each analysis
  • Group sequential design derives boundaries and sample size to
    • Control Type I error
    • Ensure power
    • Stop early for futility or efficacy finding
  • Takes advantage of correlated, group sequential tests

Independent increments process - group sequential design

  • Asymptotic normal assumption works well for most trials
  • Scharfstein et al (1997) demonstrated \(Z = (Z_1, \dots, Z_K)\) is asymptotically normal with independent increments.
  • We extend the canonical distribution notation of Jennison and Turnbull (2000):
    • \(Z_k\) is test statistics for treatment effect at analysis \(k=1,\ldots,K\)
    • \((Z_1,\ldots,Z_K)\) is multivariate normal
    • \(E(Z_k) = \theta_k \sqrt{I_k}, k=1,\ldots,K\)
    • \(\hbox{Cov}(Z_{k_1}, Z_{k_2})=\sqrt{I_{k_1}/I_{k_2}},\) \(1\le k_1\le k_2\le K\)
  • \(I_k\) is the Fisher information for \(\theta_k, k=1,\ldots, K\).
  • For most of this training \(\theta_k=-E(\log(HR_k))\)
  • Simulation can be used to examine accuracy of normal approximation

Assumptions for time-to-event endpoints

  • Lachin and Foulkes (1986) piecewise model used to approximate arbitrary enrollment, survival, dropout patterns
    • Fixed enrollment and follow-up period
      • increase enrollment rates to obtain power
    • Proportional hazards: \(\theta_k=\theta\) (constant)
    • \(I_k\) ~proportional to event count Schoenfeld (1981)
  • We generalize to average hazard ratio (AHR; Mukhopadhyay et al (2020)) for non-proportional hazards (NPH) tested with logrank (\(\theta_k\) varies with \(k\))
  • Generalized to cover weighted logrank as in Yung and Liu (2020) in final section today
    • Fixed design only for RMST
    • Combination tests with multiple weighted logrank tests have more complex correlation structure (Karrison and others (2016))

Testing bounds

  • Bounds \(-\infty \le a_k \le b_k \le \infty\) for \(k=1,\dots,K\)
  • Null hypothesis \(H_0:\) \(\theta_k=0, k=1,\ldots,K\)
  • Alternate hypothesis \(H_1:\) \(\theta_k > 0\) for some \(k=1,\ldots,K\)
  • Actions at analysis \(k=1,\ldots,K\):
    • Reject \(H_0\) at analysis \(k\) if \(Z_k\ge b_k\)
    • Do not reject \(H_0\) and consider stopping if \(Z_k<a_k\), \(k<K\)
    • Continue trial if \(a_k\le Z_k\le b_k\), \(k<K\)
  • Bounds are generally considered advisory for stopping trial, not binding

Boundary crossing probabilities

  • Upper boundary crossing probabilities
    • \(u_k(\theta) = \text{Pr}_\theta(\{Z_k \ge b_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j < b_j\})\)
  • Lower boundary crossing probabilities
    • \(l_k(\theta) = \text{Pr}_\theta (\{Z_k < a_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j < b_j\})\)
  • Null hypothesis: 1-sided Type I error
    • \(a_k = -\infty\) for all \(k\) generally used for Type I error
      • Non-binding lower bound
    • \(\alpha = \sum_{k=1}^{K} u_k(0) = \sum_{k=1}^{K} \text{Pr}(\{Z_k \ge b_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j \le b_j\} \mid H_0)\)

Boundary crossing probabilities (cont.)

  • Alternate hypothesis: Type II error \(\beta= 1 - \hbox{power}\)

    \[-\infty\le a_k<b_k, k=1,\ldots,K-1,\] \[a_K\le b_K\] If \(a_K = b_K\) then total Type II error is \[\beta = \sum_{k=1}^{K} l_k = \sum_{k=1}^{K} \text{Pr}(\{Z_k < a_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j \le b_j\}\mid H_1)\]

Symmetric bounds

Test each treatment for superiority vs the other

Usually not of interest in pharmaceutical industry

Futility bounds

Give up if experimental arm not trending in favor of control?

  • Ethics of continuing trial if unlikely to show superiority?
  • Excess risk of crossing lower bound too soon?

Asymmetric 2-sided testing

Give up if experimental arm trending worse than control

Sample size

  • Given boundary computation, sample size solves for enrollment rate
  • Fixing relative enrollment rates, dropout rates and trial duration enable Lachin and Foulkes (1986) and our AHR methods
  • Under proportional hazards, you can also fix max enrollment rate and solve for trial duration (Kim and Tsiatis 1990)
    • You can easily create scenarios here for which there is no solution
    • Error message An error has occurred. Check your logs or contact the app author for clarification.
    • Need to adjust parameters until a solution is found or use Lachin and Foulkes (1986)

Boundary types

  • Approaches to calculate decision boundary:

    • The error spending approach: specify boundary crossing probabilities at each analysis. This is most commonly done with the error spending function approach (Lan and DeMets 1983).

    • The boundary family approach: specify how big boundary values should be relative to each other and adjust these relative values by a constant multiple to control overall error rates. The commonly applied boundary family include:

      • Haybittle-Peto boundary (Haybittle (1971), Peto et al (1977))
      • Wang-Tsiatis boundary (Wang and Tsiatis 1987)
        • Pocock boundary (Pocock 1977)
        • O’Brien and Fleming boundary (O’Brien and Fleming 1979)
      • Slud and Wei (1982) and Fleming et al (1984) set fixed IA boundary crossing probabilities
        • Hybrid of boundary family/spending approach

Boundary families

Boundary family - Haybittle-Peto boundary

  • Main idea:

    • Interim Z-score boundary: 3
    • Final Z-score boundary: 1.96 (slight \(\alpha\) inflation)

Boundary family - Haybittle-Peto (cont’d)

  • Modified Haybittle-Peto procedure 1:

    Bonferroni adjustment:

    • For the first \(K-1\) analyses, significant p-value is set as 0.001;
    • For the final analysis, significant p-value is set as \(0.05 - 0.01 \times (K-1)\);
      • More generous final bound if you adjust for test correlations
  • Advantages:

    • Avoid type I error inflation
    • Does not require equally spaced analyses

Boundary family - Wang-Tsiatis bounds

  • Definition:

    For 2-sided testing, Wang and Tsiatis (1987) defined the boundary function for the \(k\)-th look as \[ \Gamma(\alpha, K, \Delta) k^{\Delta - 0.5}, \] where \(\Gamma(\alpha, K, \Delta)\) is a constant chosen so that the level of significance is equal to \(\alpha\).

  • Two special cases:

    • \(\Delta = 0.5\): Pocock bounds
    • \(\Delta = 0\): O’Brien-Fleming bounds.

Wang-Tsiatis example - Pocock boundary

For 2-sided testing, the Pocock procedure rejects at the \(k\)-th equally-spaced of \(K\) looks if \[|Z_k| > c_P(K),\] where \(c_P(K)\) is fixed given \(K\) such that \(\text{Pr}(\cup_{k=1}^{K} |Z_k| > c_P(K)) = \alpha\).

Wang-Tsiatis example - Pocock boundary (cont’d)

  • Example:
total number of looks(K) \(\alpha = 0.01\) \(\alpha = 0.05\) \(\alpha = 0.1\)
1 2.576 1.960 1.645
2 2.772 2.178 1.875
4 2.939 2.361 2.067
8 3.078 2.512 2.225
\(\infty\) \(\infty\) \(\infty\) \(\infty\)

We will reject \(H_0\) if \(|Z(k/4)| > 2.361\) for \(k = 1,2,3,4\) (final analysis).

  • Weakness:

    • Overly aggressive interim bounds

    • High price for the end of the trial.

      • With \(\alpha = 0.05\) and 4 analyses (\(k=4\)), the absolute value of z-score would have to exceed 2.361 to be declared significant, including the final analysis (normally 1.96).
      • Z=2.361 translates to a two-tailed “nominal p-value”: \(2(1 − \Phi(2.361))\) = 0.018.
    • \(c_P(K) \to +\infty\) as \(K \to + \infty\).

    • Requires equally spaced looks.

Wang-Tsiatis example - O’Brien-Fleming boundary

  • Early stage: very conservative \(\Rightarrow\) large boundary at the begining;
  • Final stage: nominal value close to the overall value of the design \(\Rightarrow \approx 1.96\).
  • Regulators generally like these bounds

Wang-Tsiatis example - O’Brien-Fleming boundary (cont’d)

total number of looks(K) \(\alpha = 0.01\) \(\alpha = 0.05\) \(\alpha = 0.1\)
1 2.576 1.960 1.645
2 2.580 1.977 1.678
4 2.609 2.024 1.733
8 2.648 2.072 1.786
16 2.684 2.114 1.830
\(\infty\) 2.807 2.241 1.960


  • The tabled value 2.024 is the flat B-value boundary.
  • The flat B-value boundary can be easily transformed into decreasing Z-score boundary by \(Z(t) = B(t)/\sqrt{t}\):
    • \(2.024/\sqrt{1/4} = 4.05\)
    • \(2.024/\sqrt{2/4} = 2.86\)
    • \(2.024/\sqrt{3/4} = 2.34\)
    • \(2.024/\sqrt{4/4} = 2.02\)

Boundary families - summary

Boundary families - summary (cont’d)

Procedure name Boundary Advantages Disadvantages
Haybittle-Peto K-1 at interim analyses and 1.96 at the final analysis simple to implement
Pocock a constant decision boundary for Z-score (1) requires the same level of evidence for early and late looks at the data, so it pays larger price for the final analysis ;
(2) requires equally spaced looks
O’Brien-Fleming constant B-value boundaries, steep decrease in Z-boundaries pay smaller price for the final analysis too conservative in the early stages?

Spending function boundaries

Spending function

Lan-DeMets spending functions to approximate boundary families

Hwang-Shih-DeCani (gamma) spending functions

What is spending time?

  • Information fraction, Lan and DeMets (1983)
    • Time-to-event: fraction of planned final events in analysis
    • Normal or binomial: fraction of planned final sample size in analysis
    • Usually expected by regulators
  • Calendar fraction, Lan and DeMets (1989)
    • Fraction of trial planned calendar duration at analysis
  • Minimum of planned and actual information fraction
    • Probably not advised yet

General methods and proportional hazards

Proportional hazards approach

Lachin and Foulkes (1986) method

  • Sample size and power derivation
  • Time-to-event endpoint
  • 2-arm trial
  • Logrank (or Cox model coefficient) to test treatment effect
  • Constant treatment effect over time (proportional hazards or PH)
  • Number of events drives power, regardless of study duration

Shiny app for proportional hazards

Metastatic oncology example

  • KEYNOTE 189 trial (Gandhi et al (2018))
    • Endpoints: progression free survival (PFS) and overall survival (OS) in patients
    • Indication: previously untreated metastatic non-small cell lung cancer (NSCLC)
    • Treatments: chemotherapy +/- pembrolizumab
    • Randomized 2:1 to an add-on of pembrolizumab or placebo
    • Type I error (1-sided): 0.025 familywise error rate (FWER) split between PFS (\(\alpha=0.0095\)) and OS (\(\alpha=0.0155\))
    • Graphical method \(\alpha\)-control for group sequential design of Maurer and Bretz (2013) used

Key aspects of the design as documented in the protocol accompanying Gandhi et al (2018).

Metastatic oncology: OS design approximation (continued)

  • \(\alpha=0.0155\)
  • Control group survival: exponential median=13 months
  • Exponential dropout rate of 0.133% per month
  • 90% power to detect a hazard ratio (HR) of 0.70025
  • 2:1 randomization, experimental:control
  • Enrollment over 1 year
  • While not specified in the protocol, we have further assumed:
    • Trial duration: 35 months.
    • Observed deaths of 240, 332 and 416 at the 3 planned analyses
    • A one-sided bound using the Lan and DeMets (1983) spending approximating an O’Brien-Fleming bound.

Cardiovascular outcomes reduction

  • AFCAPS/TEXCAPS trial: use of lovastatin to reduce cardiovascular outcomes
  • Design described in Downs et al (1997)
  • Results reported in Downs et al (1998)
  • Reproduction here not exact mainly due to choice of Lachin and Foulkes (1986)
    • Little difference between methods
    • Efficacy bounds will be exactly as proposed

Cardiovascular outcomes: key parameters

  • 5 years minimum follow-up of all patients enrolled
  • Interim analyses after 0.375 and 0.75 of final planned event count has accrued
  • 2-sided bound using the Hwang et al (1990) spending function with parameter \(\gamma = -4\) to approximate an O’Brien-Fleming bound
  • We arbitrarily set the following parameters to match design:
    • Power of 90% for a hazard ratio of 0.6921846; this is slightly different than the 0.7 hazard ratio suggested in Downs et al (1997)
    • Enrollment duration of 1/2 year with constant enrollment.
    • An exponential failure rate of 0.01131 per year which is nearly identical to the annual failure rate of 0.01125.
    • An exponential dropout rate of 0.004 per year which is nearly identical to the annual dropout rate of 0.00399.

Cardiovascular outcomes non-inferiority: EXAMINE trial

  • Indication: treatment of diabetes
  • Treatments: DPP4 inhibitor alogliptin compared to placebo
  • Primary endpoint: major cardiovascular outcomes (MACE)
  • Objective: establish non-inferiority
  • Results in White et al (2013)
  • Design in White et al (2011)
  • We approximate the design and primary analysis evaluation here.
  • Software and design assumptions not completely clear; not exact design reproduction

EXAMINE trial: Key assumptions

  • Primary analysis: stratified Cox model for MACE
    • 1-sided repeated confidence interval for HR at each analysis.
    • Analysis to rule out HR > 1.3, but also tests superiority.
    • Analyses planned after 550, 600, 650 MACE events.
    • O’Brien-Fleming-like spending function Lan and DeMets (1983).
    • 2.5% Type I error.
    • Approximately 91% power.
    • 3.5% annual MACE event rate.
    • Uniform enrollment over 2 years.
    • 4.75 years trial duration.
    • 1% annual loss-to-follow-up rate.
    • Software: EAST 5 (Cytel).

Cure model

Poisson mixture cure model we consider:

\[S(t)= \exp(-\theta (1 - \exp(-\lambda t)).\]

Note that:

  • \(1-\exp(-\lambda t)\) is the CDF for an exponential distribution
    • can be replaced by arbitrary continuous CDF.
  • As \(t\rightarrow \infty\), \(S(t)\rightarrow\exp(-\theta)\) (cure rate).
  • Model useful when historical data suggests plateau in survival.
  • PH model: experimental survival \(S_E(t)=S_C(t)^{HR}\).

Survival model assumptions

More details in book.

Cure model: Expected event accumulation over time

  • Event accumulation over time can be very sensitive to many trial design assumptions.
  • Generally, we are trying to mimic a slowing of event accumulation over time.
  • Assume 18 month enrollment with 6-month ramp-up.

Expected event accrual over time

Potential advantages/disadvantages of calendar spending

  • Quite possible that event rate design assumptions incorrect
  • Good to ensure duration of follow-up is adequate evaluate tail behavior/plateau in survival
  • Ensures adequate follow-up to estimate relevant parts of survival curve
  • Limits trial to relevant clinical and practical duration
  • May be underpowered if events accrue slowly
  • Probably less useful for high-risk endpoints (e.g., metastatic cancer)
  • Regulatory resistance?


Average hazard ratio approach

Delayed effect

  • Recurrent head and neck squamous cell carcinoma
  • Pembrolizumab vs Standard of Care (SOC)
  • 1:1 randomization, N=495
  • Primary endpoint: OS
  • 90% power for HR=0.7, 1-sided \(\alpha=0.025\)
  • Design (Am 11; protocol) and results
    • Proceeded to final analysis (388 deaths; 340 planned)
      • 1-sided nominal p-value for OS: 0.0161
  • 2 interim analyses planned (144 and 216 deaths; IF = 0.42, 0.64)
    • IF = information fraction
    • Efficacy: Hwang-Shih-DeCani (HSD) spending, \(\gamma=-4\)
    • Futility: non-binding, \(\beta\)-spending, HSD, \(\gamma = -16\)
KEYNOTE 040: Overall Survival