December 6, 2021

Welcome

Outline

  • 3:30: Introduction and background theory (30 minutes)

  • 4:00: Proportional hazards applications with Shiny app (25 minutes)

  • 4:25: Intro to non-proportional hazards (NPH; 5 minutes)

  • 4:30: Software and piecewise model (15 min)

  • 4:45: Average hazard ratio (AHR; 20 minutes)

  • 5:05: Break (10 minutes)

  • 5:15: NPH design with logrank test (25 minutes)

  • 5:40: Weighted logrank and combination tests (40 minutes)

  • 6:20 Summary and questions (10 minutes)

Disclaimer

  • All opinions expressed are those of the presenters and not Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

  • Some slides need to be scrolled down to see the full content.

Software

Course resources

  • Book at https://keaven.github.io/gsd-deming/
    • Instructions there to install software, download repository at https://github.com/keaven/gsd-deming
    • Directories there we will use:
      • data/: contains design files for examples; also simulation results
      • vignettes/: reports produced by Shiny app to summarize designs
      • simulation/: R code and simulation data for the last part of course

Background

Group sequential design

  • Analyze a trial repeatedly at planned intervals
  • Group of data added at each analysis
  • Group sequential design derives boundaries and sample size to
    • Control Type I error
    • Ensure power
    • Stop early for futility or efficacy finding
  • Takes advantage of correlated, group sequential tests

Independent increments process - group sequential design

  • Asymptotic normal assumption works well for most trials
  • Scharfstein et al (1997) demonstrated \(Z = (Z_1, \dots, Z_K)\) is asymptotically normal with independent increments.
  • We extend the canonical distribution notation of Jennison and Turnbull (2000):
    • \(Z_k\) is test statistics for treatment effect at analysis \(k=1,\ldots,K\)
    • \((Z_1,\ldots,Z_K)\) is multivariate normal
    • \(E(Z_k) = \theta_k \sqrt{I_k}, k=1,\ldots,K\)
    • \(\hbox{Cov}(Z_{k_1}, Z_{k_2})=\sqrt{I_{k_1}/I_{k_2}},\) \(1\le k_1\le k_2\le K\)
  • \(I_k\) is the Fisher information for \(\theta_k, k=1,\ldots, K\).
  • For most of this training \(\theta_k=-E(\log(HR_k))\)
  • Simulation can be used to examine accuracy of normal approximation

Assumptions for time-to-event endpoints

  • Lachin and Foulkes (1986) piecewise model used to approximate arbitrary enrollment, survival, dropout patterns
    • Fixed enrollment and follow-up period
      • increase enrollment rates to obtain power
    • Proportional hazards: \(\theta_k=\theta\) (constant)
    • \(I_k\) ~proportional to event count Schoenfeld (1981)
  • We generalize to average hazard ratio (AHR; Mukhopadhyay et al (2020)) for non-proportional hazards (NPH) tested with logrank (\(\theta_k\) varies with \(k\))
  • Generalized to cover weighted logrank as in Yung and Liu (2020) in final section today
    • Fixed design only for RMST
    • Combination tests with multiple weighted logrank tests have more complex correlation structure (Karrison and others (2016))

Testing bounds

  • Bounds \(-\infty \le a_k \le b_k \le \infty\) for \(k=1,\dots,K\)
  • Null hypothesis \(H_0:\) \(\theta_k=0, k=1,\ldots,K\)
  • Alternate hypothesis \(H_1:\) \(\theta_k > 0\) for some \(k=1,\ldots,K\)
  • Actions at analysis \(k=1,\ldots,K\):
    • Reject \(H_0\) at analysis \(k\) if \(Z_k\ge b_k\)
    • Do not reject \(H_0\) and consider stopping if \(Z_k<a_k\), \(k<K\)
    • Continue trial if \(a_k\le Z_k\le b_k\), \(k<K\)
  • Bounds are generally considered advisory for stopping trial, not binding

Boundary crossing probabilities

  • Upper boundary crossing probabilities
    • \(u_k(\theta) = \text{Pr}_\theta(\{Z_k \ge b_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j < b_j\})\)
  • Lower boundary crossing probabilities
    • \(l_k(\theta) = \text{Pr}_\theta (\{Z_k < a_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j < b_j\})\)
  • Null hypothesis: 1-sided Type I error
    • \(a_k = -\infty\) for all \(k\) generally used for Type I error
      • Non-binding lower bound
    • \(\alpha = \sum_{k=1}^{K} u_k(0) = \sum_{k=1}^{K} \text{Pr}(\{Z_k \ge b_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j \le b_j\} \mid H_0)\)

Boundary crossing probabilities (cont.)

  • Alternate hypothesis: Type II error \(\beta= 1 - \hbox{power}\)

    \[-\infty\le a_k<b_k, k=1,\ldots,K-1,\] \[a_K\le b_K\] If \(a_K = b_K\) then total Type II error is \[\beta = \sum_{k=1}^{K} l_k = \sum_{k=1}^{K} \text{Pr}(\{Z_k < a_k\} \cap_{j=1}^{k-1} \{a_j \le Z_j \le b_j\}\mid H_1)\]

Symmetric bounds

Test each treatment for superiority vs the other

Usually not of interest in pharmaceutical industry

Futility bounds

Give up if experimental arm not trending in favor of control?

  • Ethics of continuing trial if unlikely to show superiority?
  • Excess risk of crossing lower bound too soon?

Asymmetric 2-sided testing

Give up if experimental arm trending worse than control

Sample size

  • Given boundary computation, sample size solves for enrollment rate
  • Fixing relative enrollment rates, dropout rates and trial duration enable Lachin and Foulkes (1986) and our AHR methods
  • Under proportional hazards, you can also fix max enrollment rate and solve for trial duration (Kim and Tsiatis 1990)
    • You can easily create scenarios here for which there is no solution
    • Error message An error has occurred. Check your logs or contact the app author for clarification.
    • Need to adjust parameters until a solution is found or use Lachin and Foulkes (1986)

Boundary types

  • Approaches to calculate decision boundary:

    • The error spending approach: specify boundary crossing probabilities at each analysis. This is most commonly done with the error spending function approach (Lan and DeMets 1983).

    • The boundary family approach: specify how big boundary values should be relative to each other and adjust these relative values by a constant multiple to control overall error rates. The commonly applied boundary family include:

      • Haybittle-Peto boundary (Haybittle (1971), Peto et al (1977))
      • Wang-Tsiatis boundary (Wang and Tsiatis 1987)
        • Pocock boundary (Pocock 1977)
        • O’Brien and Fleming boundary (O’Brien and Fleming 1979)
      • Slud and Wei (1982) and Fleming et al (1984) set fixed IA boundary crossing probabilities
        • Hybrid of boundary family/spending approach

Boundary families

Boundary family - Haybittle-Peto boundary

  • Main idea:

    • Interim Z-score boundary: 3
    • Final Z-score boundary: 1.96 (slight \(\alpha\) inflation)

Boundary family - Haybittle-Peto (cont’d)

  • Modified Haybittle-Peto procedure 1:

    Bonferroni adjustment:

    • For the first \(K-1\) analyses, significant p-value is set as 0.001;
    • For the final analysis, significant p-value is set as \(0.05 - 0.01 \times (K-1)\);
      • More generous final bound if you adjust for test correlations
  • Advantages:

    • Avoid type I error inflation
    • Does not require equally spaced analyses

Boundary family - Wang-Tsiatis bounds

  • Definition:

    For 2-sided testing, Wang and Tsiatis (1987) defined the boundary function for the \(k\)-th look as \[ \Gamma(\alpha, K, \Delta) k^{\Delta - 0.5}, \] where \(\Gamma(\alpha, K, \Delta)\) is a constant chosen so that the level of significance is equal to \(\alpha\).

  • Two special cases:

    • \(\Delta = 0.5\): Pocock bounds
    • \(\Delta = 0\): O’Brien-Fleming bounds.

Wang-Tsiatis example - Pocock boundary

For 2-sided testing, the Pocock procedure rejects at the \(k\)-th equally-spaced of \(K\) looks if \[|Z_k| > c_P(K),\] where \(c_P(K)\) is fixed given \(K\) such that \(\text{Pr}(\cup_{k=1}^{K} |Z_k| > c_P(K)) = \alpha\).

Wang-Tsiatis example - Pocock boundary (cont’d)

  • Example:
total number of looks(K) \(\alpha = 0.01\) \(\alpha = 0.05\) \(\alpha = 0.1\)
1 2.576 1.960 1.645
2 2.772 2.178 1.875
4 2.939 2.361 2.067
8 3.078 2.512 2.225
\(\infty\) \(\infty\) \(\infty\) \(\infty\)

We will reject \(H_0\) if \(|Z(k/4)| > 2.361\) for \(k = 1,2,3,4\) (final analysis).

  • Weakness:

    • Overly aggressive interim bounds

    • High price for the end of the trial.

      • With \(\alpha = 0.05\) and 4 analyses (\(k=4\)), the absolute value of z-score would have to exceed 2.361 to be declared significant, including the final analysis (normally 1.96).
      • Z=2.361 translates to a two-tailed “nominal p-value”: \(2(1 − \Phi(2.361))\) = 0.018.
    • \(c_P(K) \to +\infty\) as \(K \to + \infty\).

    • Requires equally spaced looks.

Wang-Tsiatis example - O’Brien-Fleming boundary

  • Early stage: very conservative \(\Rightarrow\) large boundary at the begining;
  • Final stage: nominal value close to the overall value of the design \(\Rightarrow \approx 1.96\).
  • Regulators generally like these bounds

Wang-Tsiatis example - O’Brien-Fleming boundary (cont’d)

total number of looks(K) \(\alpha = 0.01\) \(\alpha = 0.05\) \(\alpha = 0.1\)
1 2.576 1.960 1.645
2 2.580 1.977 1.678
4 2.609 2.024 1.733
8 2.648 2.072 1.786
16 2.684 2.114 1.830
\(\infty\) 2.807 2.241 1.960

Example:

  • The tabled value 2.024 is the flat B-value boundary.
  • The flat B-value boundary can be easily transformed into decreasing Z-score boundary by \(Z(t) = B(t)/\sqrt{t}\):
    • \(2.024/\sqrt{1/4} = 4.05\)
    • \(2.024/\sqrt{2/4} = 2.86\)
    • \(2.024/\sqrt{3/4} = 2.34\)
    • \(2.024/\sqrt{4/4} = 2.02\)

Boundary families - summary

Boundary families - summary (cont’d)

Procedure name Boundary Advantages Disadvantages
Haybittle-Peto K-1 at interim analyses and 1.96 at the final analysis simple to implement
Pocock a constant decision boundary for Z-score (1) requires the same level of evidence for early and late looks at the data, so it pays larger price for the final analysis ;
(2) requires equally spaced looks
O’Brien-Fleming constant B-value boundaries, steep decrease in Z-boundaries pay smaller price for the final analysis too conservative in the early stages?

Spending function boundaries

Spending function

Lan-DeMets spending functions to approximate boundary families

Hwang-Shih-DeCani (gamma) spending functions

What is spending time?

  • Information fraction, Lan and DeMets (1983)
    • Time-to-event: fraction of planned final events in analysis
    • Normal or binomial: fraction of planned final sample size in analysis
    • Usually expected by regulators
  • Calendar fraction, Lan and DeMets (1989)
    • Fraction of trial planned calendar duration at analysis
  • Minimum of planned and actual information fraction
    • Probably not advised yet

General methods and proportional hazards

Proportional hazards approach

Lachin and Foulkes (1986) method

  • Sample size and power derivation
  • Time-to-event endpoint
  • 2-arm trial
  • Logrank (or Cox model coefficient) to test treatment effect
  • Constant treatment effect over time (proportional hazards or PH)
  • Number of events drives power, regardless of study duration

Shiny app for proportional hazards

Metastatic oncology example

  • KEYNOTE 189 trial (Gandhi et al (2018))
    • Endpoints: progression free survival (PFS) and overall survival (OS) in patients
    • Indication: previously untreated metastatic non-small cell lung cancer (NSCLC)
    • Treatments: chemotherapy +/- pembrolizumab
    • Randomized 2:1 to an add-on of pembrolizumab or placebo
    • Type I error (1-sided): 0.025 familywise error rate (FWER) split between PFS (\(\alpha=0.0095\)) and OS (\(\alpha=0.0155\))
    • Graphical method \(\alpha\)-control for group sequential design of Maurer and Bretz (2013) used

Key aspects of the design as documented in the protocol accompanying Gandhi et al (2018).

Metastatic oncology: OS design approximation (continued)

  • \(\alpha=0.0155\)
  • Control group survival: exponential median=13 months
  • Exponential dropout rate of 0.133% per month
  • 90% power to detect a hazard ratio (HR) of 0.70025
  • 2:1 randomization, experimental:control
  • Enrollment over 1 year
  • While not specified in the protocol, we have further assumed:
    • Trial duration: 35 months.
    • Observed deaths of 240, 332 and 416 at the 3 planned analyses
    • A one-sided bound using the Lan and DeMets (1983) spending approximating an O’Brien-Fleming bound.

Cardiovascular outcomes reduction

  • AFCAPS/TEXCAPS trial: use of lovastatin to reduce cardiovascular outcomes
  • Design described in Downs et al (1997)
  • Results reported in Downs et al (1998)
  • Reproduction here not exact mainly due to choice of Lachin and Foulkes (1986)
    • Little difference between methods
    • Efficacy bounds will be exactly as proposed

Cardiovascular outcomes: key parameters

  • 5 years minimum follow-up of all patients enrolled
  • Interim analyses after 0.375 and 0.75 of final planned event count has accrued
  • 2-sided bound using the Hwang et al (1990) spending function with parameter \(\gamma = -4\) to approximate an O’Brien-Fleming bound
  • We arbitrarily set the following parameters to match design:
    • Power of 90% for a hazard ratio of 0.6921846; this is slightly different than the 0.7 hazard ratio suggested in Downs et al (1997)
    • Enrollment duration of 1/2 year with constant enrollment.
    • An exponential failure rate of 0.01131 per year which is nearly identical to the annual failure rate of 0.01125.
    • An exponential dropout rate of 0.004 per year which is nearly identical to the annual dropout rate of 0.00399.

Cardiovascular outcomes non-inferiority: EXAMINE trial

  • Indication: treatment of diabetes
  • Treatments: DPP4 inhibitor alogliptin compared to placebo
  • Primary endpoint: major cardiovascular outcomes (MACE)
  • Objective: establish non-inferiority
  • Results in White et al (2013)
  • Design in White et al (2011)
  • We approximate the design and primary analysis evaluation here.
  • Software and design assumptions not completely clear; not exact design reproduction

EXAMINE trial: Key assumptions

  • Primary analysis: stratified Cox model for MACE
    • 1-sided repeated confidence interval for HR at each analysis.
    • Analysis to rule out HR > 1.3, but also tests superiority.
    • Analyses planned after 550, 600, 650 MACE events.
    • O’Brien-Fleming-like spending function Lan and DeMets (1983).
    • 2.5% Type I error.
    • Approximately 91% power.
    • 3.5% annual MACE event rate.
    • Uniform enrollment over 2 years.
    • 4.75 years trial duration.
    • 1% annual loss-to-follow-up rate.
    • Software: EAST 5 (Cytel).

Cure model

Poisson mixture cure model we consider:

\[S(t)= \exp(-\theta (1 - \exp(-\lambda t)).\]

Note that:

  • \(1-\exp(-\lambda t)\) is the CDF for an exponential distribution
    • can be replaced by arbitrary continuous CDF.
  • As \(t\rightarrow \infty\), \(S(t)\rightarrow\exp(-\theta)\) (cure rate).
  • Model useful when historical data suggests plateau in survival.
  • PH model: experimental survival \(S_E(t)=S_C(t)^{HR}\).

Survival model assumptions

More details in book.

Cure model: Expected event accumulation over time

  • Event accumulation over time can be very sensitive to many trial design assumptions.
  • Generally, we are trying to mimic a slowing of event accumulation over time.
  • Assume 18 month enrollment with 6-month ramp-up.

Expected event accrual over time

Potential advantages/disadvantages of calendar spending

  • Quite possible that event rate design assumptions incorrect
  • Good to ensure duration of follow-up is adequate evaluate tail behavior/plateau in survival
  • Ensures adequate follow-up to estimate relevant parts of survival curve
  • Limits trial to relevant clinical and practical duration
  • May be underpowered if events accrue slowly
  • Probably less useful for high-risk endpoints (e.g., metastatic cancer)
  • Regulatory resistance?

Exercise

Average hazard ratio approach

Delayed effect

  • Recurrent head and neck squamous cell carcinoma
  • Pembrolizumab vs Standard of Care (SOC)
  • 1:1 randomization, N=495
  • Primary endpoint: OS
  • 90% power for HR=0.7, 1-sided \(\alpha=0.025\)
  • Design (Am 11; protocol) and results
    • Proceeded to final analysis (388 deaths; 340 planned)
      • 1-sided nominal p-value for OS: 0.0161
  • 2 interim analyses planned (144 and 216 deaths; IF = 0.42, 0.64)
    • IF = information fraction
    • Efficacy: Hwang-Shih-DeCani (HSD) spending, \(\gamma=-4\)
    • Futility: non-binding, \(\beta\)-spending, HSD, \(\gamma = -16\)
KEYNOTE 040: Overall Survival
@KEYNOTE040

Cohen et al (2019)

Delayed effect not just in oncology

Cholesterol lowering and mortality Miettinen et al (1997)

Scandinavian Simvistatin Survival Study

Scandinavian Simvistatin Survival Study

Proportional hazards

  • First line metastatic lung CA (NSCLC)
  • Chemo + Pembrolizumab vs Chemo
  • 2:1 randomization, N=616
  • Primary endpoints
    • Progression free survival (PFS)
    • Overall survival (OS)
  • Potential for delayed OS effect not realized
  • Median follow-up: 10.5 months
  • Design (Am 7; protocol)
  • Stopped at first IA (235 deaths; IF = 0.56)
  • Plan for final analysis: 416 deaths
  • Adjusted 1-sided p-value for OS: 0.00559
  • O’Brien-Fleming spending
KEYNOTE 189
@KEYNOTE189

Gandhi et al (2018)

Crossing survival curves

  • Recurrent advanced gastric or gastro-oesophageal junction cancer
  • Pembrolizumab vs Paclitaxel
  • 1:1 randomization, N=360 planned (N=395 actual) in CPS \(\ge\) 1 population
  • Primary endpoints:
    • OS in PD-L1 CPS \(\ge\) 1
      • 91% power for HR=0.67, 1-sided \(\alpha=0.0215\)
    • PFS in PD-L1 CPS \(\ge\) 1
  • Design (Am 11; protocol) and results in CPS \(\ge\) 1
    • Proceeded to final analysis (326 deaths; 290 planned)
    • 1 IA planned (240 deaths; IF = 0.83)
      • Efficacy: Hwang-Shih-DeCani (HSD) spending, \(\gamma=-4\)
KEYNOTE 061: Overall Survival in CPS 1
@KEYNOTE061

Shitara et al (2018)

  • No futility bound
  • 1-sided nominal p-value for OS: 0.0421 (threshold: p=0.0135)
  • Post hoc FH(\(\rho=1,\gamma=1\)): p=0.0009

Summary of issues

  • What is the impact of (potentially) delayed treatment effect on trial design and analysis?
  • Test statistics used (logrank for this section; next section suggests alternatives)
  • Sample size vs duration of follow-up
  • Timing of analyses
  • Futility bounds
  • Updating bounds
  • Multiplicity adjustments

Software installation

simtrial overview

  • Time-to-event trial simulation
  • Piecewise model
  • Logrank and weighted logrank analysis
  • Simulating fixed and group sequential design
    • Also potential to simulate adaptive design
  • Reverse engineered datasets from NPH Working Group
  • Validation near completion
    • Thanks to AP colleagues and Amin Shirazi

simtrial functions

  • Generating simulated datasets
    • simfix(): Simulation of fixed sample size design for time-to-event endpoint
    • simfix2simPWSurv(): Conversion of enrollment and failure rates from simfix() to simPWSurv() format
    • simPWSurv(): Simulate a stratified time-to-event outcome randomized trial
  • Cutting data for analysis
    • cutData(): Cut a dataset for analysis at a specified date
    • cutDataAtCount(): Cut a dataset for analysis at a specified event count
    • getCutDateForCount(): Get date at which an event count is reached
  • Analysis functions
    • tenFH(): Fleming-Harrington weighted logrank tests
    • tenFHcorr(): Fleming-Harrington weighted logrank tests plus correlations
    • tensurv(): Process survival data into counting process format
    • pMaxCombo(): MaxCombo p-value
    • pwexpfit(): Piecewise exponential survival estimation
    • wMB(): Magirr and Burman modestly weighted logrank tests
  • Lower level functions
    • fixedBlockRand(): Permuted fixed block randomization
    • rpwenroll(): Generate piecewise exponential enrollment
    • rpwexp(): The piecewise exponential distribution

simtrial: reverse engineered datasets

From NPH Working Group

  • Ex1delayedEffect: Time-to-event data example 1 for non-proportional hazards working group
  • Ex2delayedEffect: Time-to-event data example 2 for non-proportional hazards working group
  • Ex3curewithph: Time-to-event data example 3 for non-proportional hazards working group
  • Ex4belly: Time-to-event data example 4 for non-proportional hazards working group
  • Ex5widening: Time-to-event data example 5 for non-proportional hazards working group
  • Ex6crossing: Time-to-event data example 6 for non-proportional hazards working group
  • MBdelayed: Simulated survival dataset with delayed treatment effect

gsDesign2

  • AHR(): Average hazard ratio under non-proportional hazards (test version)
  • eAccrual(): Piecewise constant expected accrual
  • eEvents_df(): Expected events observed under piecewise exponential model
  • ppwe(): Estimate piecewise exponential cumulative distribution function
  • s2pwe(): Approximate survival distribution with piecewise exponential distribution
  • tEvents(): Predict time at which a targeted event count is achieved

We will focus on AHR(), ppwe(), and s2pwe() in this training.

gsdmvn

Power and design functions extending the Jennison and Turnbull (2000) computational model to non-constant treatment effects. Partial list of functions:

  • Design and power under average hazard ratio model
    • gs_power_ahr(): Power computation
    • gs_design_ahr(): Design computations
  • Bound support
    • gs_b(): direct input of bounds
    • gs_spending_bound(): spending function bounds
  • Other tests
    • Design and power with weighted logrank and MaxCombo will be discussed in next section.

The piecewise model

Introducing the piecewise model

  • Simple model to approximate arbitrary patterns of
    • Enrollment: piecewise constant enrollment rates
    • Failure rates: piecewise exponential
    • Dropout rates: piecewise exponential
  • Combined tools for designing and evaluating designs
    • Asymptotic approach using average hazard ratio (AHR)
    • Simulation tools to confirm asymptotic approximations
    • No requirement for proportional hazards
    • Stick with logrank for this section

Piecewise constant enrollment

Set up piecewise exponential enrollment rates

enrollRates <- tibble::tribble(
  ~duration, ~rate,
  # 5/month for 6 months
  6, 5,
  # 20/month until enrollment complete
  6, 20
)

Get enrollment times for 150 observations

set.seed(123)
Month <- simtrial::rpwenroll(
  n = 150,
  enrollRates = enrollRates
)

Question

  • Should we assume all enrollment
    • by targeted time?
    • random, expected enrollment matches target at targeted time?
  • simtrial package assumes the latter
    • Also used in AHR approach here
  • npsurvSS package assumes the former
    • Used for weighted logrank and combination tests in next section

Failure rates

Note: under exponential distribution, median (\(m\)) and failure rate (\(\lambda\)) related:

\[ \begin{align} m=&\log(2)/\lambda \\ \lambda=&\log(2)/m \\ \end{align} \]

Specify failure rates and dropout rates in same table

# Control: exponential with 15 month median
# HR: 1 for 4 months, 0.6 thereafter
failRates <- tibble::tribble(
  ~Stratum, ~duration, ~failRate, ~hr, ~dropoutRate,
  "All", 4, log(2) / 15, 1, .001,
  "All", 100, log(2) / 15, 0.6, .001
)

Piecewise exponential approximation

dloglogis <- function(x, alpha = 1, beta = 4) {
  1 / (1 + (x / alpha)^beta)
}
times10 <- c(seq(1 / 3, 1, 1 / 3), 2, 3)
# Use s2pwe() to generate piecewise approximation
gsDesign2::s2pwe(
  times = times10, survival =
    dloglogis(times10, alpha = .5, beta = 4)
) %>%
  gt() %>%
  fmt_number(col = 1:2, decimals=3)
duration rate
0.333 0.541
0.333 3.736
0.333 4.223
1.000 2.716
1.000 1.619
  • Approximating log-logistic distribution plotted above using piecewise model here
  • Can approximate any survival distribution

Approximating using piecewise model

Break (10 minutes)

Average hazard ratio

Overview

  • Definition of average hazard ratio
  • Literature review
  • Asymptotic theory for group sequential design
  • Verification by simulation

Average hazard ratio (AHR)

  • Geometric mean hazard ratio (Mukhopadhyay et al (2020))
  • Exponentiate: average \(\log(\hbox{HR})\) weighted by expected events per interval
Interval HR -ln(HR) Expected Events
0-4 1.0 0.00 d1
>4 0.6 0.51 d2

\[\hbox{AHR} = \exp\left( \frac{d_1 \log(1) + d_2 \log(0.6)}{d_1 + d_2}\right)\]

Approach to asymptotic theory

  • Assume \((t_{m-1},t_m]\) that the hazard rate \(\lambda_m\) is constant
  • Assume total time (all observations) in interval is \(T_m\) (total time on test)
  • Assume event count in interval is \(D_m\)
  • Hazard rate estimate in this interval: \[\hat{\lambda}_m= D_m/T_m\]
  • Asymptotic approximation (delta method; see Lu Tian slides) \[\log(\hat\lambda)\sim \hbox{Normal}(\log(\lambda_m),\sigma^2=1/D_m).\]

Log hazard ratio estimate

  • Now consider hazard ratio for two treatment arms, \(j=0,1\) (0=control, 1=experimental)
  • Same exponential model
  • Log hazard ratio in interval \[\beta_m=\log(\lambda_{1m}/\lambda_{0m})=\log(\lambda_{1m}) - \log(\lambda_{0m})\]
  • Asymptotic approximation \[\hat{\beta}_m\sim \hbox{Normal}(\beta_m,1/D_{0m}+ 1/D_{1m})\]
  • Now we replace \(D_{0m}, D_{1m}\) with their expected values \(E(D_{0m}), E(D_{1m})\)

Inverse variance weighted \(\beta\)

  • Inverse variance weighting for interval \(m=1,2,\ldots,M\) \[ w_m=\frac{(1/E(D_{0m})+ 1/E(D_{1m}))^{-1}}{\sum_{j=1}^M(1/E(D_{0j})+ 1/E(D_{1j}))^{-1}} \] \[ \beta=\sum_{m=1}^M w_m\beta_m \]
  • Asymptotic approximation

\[ \hat{\beta}\sim \hbox{Normal}(\beta,\mathcal{I}^{-1}) \] where \[ \mathcal{I}=\sum_{m=1}^M (1/E(D_{0m}) + 1/E(D_{1m}))^{-1} \] - Under null hypothesis, this is like the Schoenfeld (1981) approximation \[ \mathcal{I}=\xi\sum_{m=1}^M E(D_{0m}) \] where \(\xi=1/2\) for 1:1 randomization

AHR over time

  • Constant enrollment rate, 12 month targeted enrollment
  • Exponential dropout, 0.001 per month
  • Control: exponential, median = 15 months
  • HR
    • 1 in months 0-4
    • 0.6 thereafter

Power by AHR

  • Schoenfeld (1981) approximation works well plugging in average hazard ratio
  • Use gsDesign::nEvents()
Events <- 332
# if beta is NULL and n= # of events,
# power is computed instead of events required
Power <- gsDesign::nEvents(n = Events, beta = NULL, hr = c(.6, .7, .8))

Power by AHR

Assume 332 events

AHR as estimand

  • Some argue this is a bad idea
    • e.g., hazards of hazard ratios (Hernán (2010))
  • Pro’s
    • Estimated by Cox regression
    • AHR concept makes more clear what this is
    • Logrank is widely-accepted corresponding test
    • Both asymptotic approximations and simulation supported
      • This includes group sequential design
    • Easy to approximate arbitrary enrollment, failure and dropout patterns
  • Cautions
    • No single estimand sufficently describes NPH differences
    • Early interim analysis (futility, efficacy) should anticipate possible reduced effect

Expected accrual of endpoints

AHR asymptotic approximation

Used the following code to get AHR and information at specified times:

analysisTimes <- c(12, 20, 28, 36)
sampleSize <- 500
enrollRates <- tibble(Stratum = "All", duration = 12, rate = sampleSize / 12)
failRates <- tibble(
  Stratum = "All",
  duration = c(4, 100),
  failRate = log(2) / 15,
  hr = c(1, .6),
  dropoutRate = 0.001
)
ahr <- gsDesign2::AHR(
  enrollRates = enrollRates,
  failRates = failRates,
  totalDuration = analysisTimes,
  ratio = 1
)

Simulation code considerations

  • Many options for cutting data!
    • Time-based
    • Event-based
    • Max of time- and event-based
    • Limit final trial duration
  • Note that
    • Code here is detailed, but (hopefully) not complicated
    • AHR changes if event-based
    • Number of events variable if time-based

Simulation code

# Transform to simPWSurv() format
x <- simfix2simPWSurv(failRates)
nsim <- 10000 # of simulations
# Set up matrix for simulation results
results <- matrix(0, nrow = nsim * 4, ncol = 6)
colnames(results) <- c("Sim", "Analysis", "Events", "beta", "var", "logrank")
ii <- 0 # index for results row
for (sim in 1:nsim) {
  # Simulate a trial
  ds <- simPWSurv(
    n = sampleSize,
    enrollRates = enrollRates,
    failRates = x$failRates,
    dropoutRates = x$dropoutRates
  )
  for (j in seq_along(analysisTimes)) {
    # Cut data at specified analysis times
    # Use cutDataAtCount to cut at event count
    dsc <- ds %>% cutData(analysisTimes[j])
    ii <- ii + 1
    results[ii, 1] <- sim
    results[ii, 2] <- j
    results[ii, 3] <- sum(dsc$event)
    cox <- coxph(Surv(tte, event) ~ Treatment, data = dsc)
    results[ii, 4] <- as.numeric(cox$coefficients)
    results[ii, 5] <- as.numeric(cox$var)
    # Logrank test
    Z <- dsc %>%
      tensurv(txval = "Experimental") %>%
      tenFH(rg = tibble::tibble(rho = 0, gamma = 0))
    results[ii, 6] <- as.numeric(Z$Z)
  }
}

Simulation summary

Simulation summary based on 10k replications

results <- tibble::as_tibble(results)
simsum <- results %>%
  group_by(Analysis) %>%
  summarize(
    AHR = exp(mean(beta)), Events = mean(Events),
    info = 1 / mean(var(beta)), info0 = Events / 4
  )

Comparison of asymptotics vs simulation

Asymptotic Approximation

Time AHR Events info info0
12.00 0.84 107.39 26.37 26.85
20.00 0.74 207.90 50.67 51.97
28.00 0.70 279.10 68.23 69.78
36.00 0.68 331.29 81.38 82.82

10k simulations

AHR Events info info0
0.84 107.16 25.48 26.79
0.74 207.77 49.74 51.94
0.70 278.93 67.29 69.73
0.68 331.13 80.11 82.78
  • Simulations represent truth since they are from actual distribution
  • Note that asymptotic info0 seems a better approximation of simulation than info
  • Using both info0 and info in design will make sample size a little more conservative
    • Not as conservative as simulation info

Distribution of Cox coefficient

ggplot(results, aes(x = factor(Analysis), y = beta)) +
  geom_violin() +
  ggtitle("Distribution of Cox Coefficient by Analysis") +
  xlab("Analysis") +
  ylab("Cox coefficient")

Variability of results

9 simulations

Question: Do you really want to adapt sample size based on an early interim estimate of treatment effect?

Asymptotic approximation

  • Use of Tsiatis (1982) (also extends to weighted logrank)
  • Statistical information proportional to expected event counts as in Schoenfeld (1981)
  • Natural parameter: \(\log(\hbox{AHR})\)
  • Statistical information still proportional to number of events
  • Correlation still computed based on statistical information
  • Extension of Jennison and Turnbull (2000) calculations to non-constant effect size over time

Asymptotic distribution simplified

Statistical information at analysis: \(\mathcal{I}_k\), \(1\le k\le K\)

Proportion of final information at analysis \(k\): \(t_k =\mathcal{I}_k / \mathcal{I}_K\)

\[Z_k\sim \hbox{Normal}(\sqrt{\mathcal{I}_k} \theta(t_k),1)\] Multivariate normal with correlations for \(1\le j\le k\le K\):

\[\hbox{Corr}(Z_j,Z_k)=\sqrt{t_j/t_k}\]

General theory to AHR translation

  • Previously, Cox \(\hat\beta\) was shown to be approximately a weighted \(\log(AHR)\)
  • How do we translate to extended canonical form with \(\theta(t_k)\), \(\mathcal{I_k}\), \(1\le k\le K\)?
    • Assume piecewise model
    • Fix some \(k\) in \(1,2,\ldots,K\)
      • Fix calendar time of analysis \(k\) relative to opening of enrollment; can solve for this if event-based spending fraction desired
      • Compute \(\mathcal{I_k}\) as before based on expected events
      • Compute \(t_k=\mathcal{I}_k/\mathcal{I}_K\)
      • Compute expected Cox coefficient \(\beta\) (\(=\log(AHR)\)) as before and set \(\theta(t_k)=\beta\)
  • We now have canonical form for the piecewise model

Group sequential design with spending bounds

Recalling assumptions

analysisTimes <- c(12, 20, 28, 36)
sampleSize <- 500

enrollRates

Stratum duration rate
All 12 41.7

failRates

Stratum duration failRate hr dropoutRate
All 4 0.046 1.0 0.001
All 100 0.046 0.6 0.001

Interim and final timing, effect size, information

ahr

Time AHR Events info info0
12 0.84 107.4 26.4 26.8
20 0.74 207.9 50.7 52.0
28 0.70 279.1 68.2 69.8
36 0.68 331.3 81.4 82.8

Information fraction for interim analyses

## [1] 0.3241690 0.6275343 0.8424726

One-sided design

One-sided design with proportional hazards (gsDesign)

PH1sided <- gsDesign::gsSurv( # Derive Group Sequential Design
  k = 4, # Number of analyses (interim + final)
  test.type = 1, # use this for 1-sided testing
  alpha = 0.025, # 1-sided Type I error
  beta = 0.1, # Type II error (1 - power)
  timing = timing, # Information fraction for interims
  sfu = sfLDOF, # O'Brien-Fleming spending approximation
  lambdaC = failRates$failRate, # Piecewise control failure rates
  hr = ahr$AHR[4], # Used final analysis AHR
  eta = failRates$dropoutRate, # Piecewise exponential dropout rates
  gamma = enrollRates$rate, # Relative enrollment
  R = enrollRates$duration, # Duration of piecewise enrollment rates
  S = failRates$duration[1], # Duration of piecewise failure rates (K-1)
  T = max(analysisTimes), # Study duration
  minfup = max(analysisTimes) - sum(enrollRates$duration), # Minimum follow-up
  ratio = 1 # Experimental:Control randomization ratio
)

One-sided design with gsSurv()

Analysis Value Efficacy
IA 1: 32% Z 3.7670
N: 444 p (1-sided) 0.0001
Events: 97 ~HR at bound 0.4636
Month: 13 P(Cross) if HR=1 0.0001
P(Cross) if HR=0.68 0.0289
IA 2: 63% Z 2.6020
N: 444 p (1-sided) 0.0046
Events: 186 ~HR at bound 0.6828
Month: 21 P(Cross) if HR=1 0.0047
P(Cross) if HR=0.68 0.4999
IA 3: 84% Z 2.2209
N: 444 p (1-sided) 0.0132
Events: 250 ~HR at bound 0.7549
Month: 28 P(Cross) if HR=1 0.0146
P(Cross) if HR=0.68 0.7916
Final Z 2.0453
N: 444 p (1-sided) 0.0204
Events: 297 ~HR at bound 0.7885
Month: 36 P(Cross) if HR=1 0.0250
P(Cross) if HR=0.68 0.9000

One-sided design with gsSurv()

cat(summary(PH1sided))

One-sided group sequential design with 4 analyses, time-to-event outcome with sample size 444 and 297 events required, 90 percent power, 2.5 percent (1-sided) Type I error to detect a hazard ratio of 0.68. Enrollment and total study durations are assumed to be 12 and 36 months, respectively. Efficacy bounds derived using a Lan-DeMets O’Brien-Fleming approximation spending function with none = 1.

One-sided design with gs_design_ahr()

library(gsdmvn)
# Spending function setup
upar <- list(sf = gsDesign::sfLDOF, total_spend = 0.025)
NPH1sided <- gs_design_ahr(
  enrollRates = enrollRates,
  failRates = failRates,
  ratio = 1, alpha = .025, beta = 0.1,
  # Information fraction not required (but available!)
  analysisTimes = analysisTimes,
  # Function to enable spending bound
  upper = gs_spending_bound,
  # Spending function and parameters used
  upar = list(sf = gsDesign::sfLDOF, total_spend = 0.025),
  # Lower bound fixed at -infinity
  lower = gs_b, # allows input of fixed bound
  # With gs_b, just enter values for bounds
  lpar = rep(-Inf, 4)
)

One-sided design with gs_design_ahr()

Analysis Bound Time N Events Z Probability AHR theta info info0
1 Upper 12 464.3 99.7 3.7670 0.0019 0.840 0.175 24.5 24.9
2 Upper 20 464.3 193.0 2.6020 0.3024 0.738 0.304 47.0 48.3
3 Upper 28 464.3 259.2 2.2209 0.7329 0.700 0.357 63.3 64.8
4 Upper 36 464.3 307.6 2.0453 0.9000 0.683 0.381 75.6 76.9
  • You will want to round up events and sample size!
  • Interim boundary crossing probability much lower than with PH bounds
    • IA 2 crossing probability 50% under PH
  • Sample size larger than for PH (N=444, 297 events)

Symmetric bounds

Symmetric bounds

  • Moving on to lower bounds
  • Common practice is to use binding upper and lower bounds
  • Here we use two one-sided tests for \(\alpha-\)spending
  • Upper bound \(b_k(\alpha)\) and lower bound \(a_k(\alpha)= - b_k(\alpha)\) defined so that \[ \begin{align} f(s_k,\alpha)-f(s_{k-1},\alpha) =& P_0(\{Z_{k}\geq b_{k}(\alpha)\}\cap_{j=1}^{k-1}\{-b_{j}(\alpha)< Z_{j}< b_{j}(\alpha)\}\\ =& P_0(\{Z_{k}\le -b_{k}(\alpha)\}\cap_{j=1}^{k-1}\{-b_{j}(\alpha)< Z_{j}< b_{j}(\alpha)\} \end{align} \]

Symmetric design with gsSurv()

Analysis Value Efficacy Futility
IA 1: 32% Z 3.7670 -3.7670
N: 444 p (1-sided) 0.0001 0.0001
Events: 97 ~HR at bound 0.4636 2.1569
Month: 13 P(Cross) if HR=1 0.0001 0.0001
P(Cross) if HR=0.68 0.0289 0.0000
IA 2: 63% Z 2.6020 -2.6020
N: 444 p (1-sided) 0.0046 0.0046
Events: 186 ~HR at bound 0.6828 1.4646
Month: 21 P(Cross) if HR=1 0.0047 0.0047
P(Cross) if HR=0.68 0.4999 0.0000
IA 3: 84% Z 2.2209 -2.2209
N: 444 p (1-sided) 0.0132 0.0132
Events: 250 ~HR at bound 0.7549 1.3246
Month: 28 P(Cross) if HR=1 0.0146 0.0146
P(Cross) if HR=0.68 0.7916 0.0000
Final Z 2.0453 -2.0453
N: 444 p (1-sided) 0.0204 0.0204
Events: 297 ~HR at bound 0.7885 1.2682
Month: 36 P(Cross) if HR=1 0.0250 0.0250
P(Cross) if HR=0.68 0.9000 0.0000

Symmetric design with gs_design_ahr()

library(gsdmvn)

# Spending function and parameters for both bounds
par <- list(sf = gsDesign::sfLDOF, total_spend = 0.025)
NPHsymmetric <- gs_design_ahr(
  enrollRates = enrollRates,
  failRates = failRates,
  ratio = 1, alpha = .025, beta = 0.1,
  # Information fraction not required (but available!)
  analysisTimes = analysisTimes,
  # Function to enable spending bound
  upper = gs_spending_bound,
  lower = gs_spending_bound,
  # Spending function and parameters used
  upar = par,
  lpar = par,
  binding = TRUE, # set lower bound to binding
  h1_spending = FALSE
)

Symmetric design with gs_design_ahr()

Analysis Bound Time N Events Z Probability AHR theta info info0
1 Upper 12 464.3 99.7 3.7670 0.0019 0.840 0.175 24.5 24.9
2 Upper 20 464.3 193.0 2.6020 0.3024 0.738 0.304 47.0 48.3
3 Upper 28 464.3 259.2 2.2209 0.7329 0.700 0.357 63.3 64.8
4 Upper 36 464.3 307.6 2.0453 0.9000 0.683 0.381 75.6 76.9
1 Lower 12 464.3 99.7 −3.7670 0.0000 0.840 0.175 24.5 24.9
2 Lower 20 464.3 193.0 −2.6020 0.0000 0.738 0.304 47.0 48.3
3 Lower 28 464.3 259.2 −2.2209 0.0000 0.700 0.357 63.3 64.8
4 Lower 36 464.3 307.6 −2.0453 0.0000 0.683 0.381 75.6 76.9

Asymmetric bounds

Asymmetric bounds

  • Use non-binding upper bound with spending function \(f_1(s,\alpha)\)
  • Set lower boundary crossing boundary under null hypothesis; may assume binding for lower bound with spending function \(f_2(s,\gamma)\) for some chosen \(0 < \gamma \le 1-\alpha\)
  • Set bounds to satisfy \[ \begin{align} f_1(s_k,\alpha)-f_1(s_{k-1},\alpha) & = P_0(\{Z_{k}\geq b_{k}(\alpha)\}\cap_{j=1}^{k-1}\{Z_{j}< b_{j}(\alpha)\}\\ f_2(s_k,\gamma)-f_2(s_{k-1},\gamma) & = P_\theta(\{Z_{k}< a_{k}(\gamma)\}\cap_{j=1}^{k-1}\{a_{j}(\gamma)\le Z_{j}< b_{j}(\alpha)\} \end{align} \]

\(\beta\)-spending bounds

  • Use non-binding upper bound with spending function \(f_1(s,\alpha)\)
  • Set lower boundary crossing boundary under alternate hypothesis \(\theta(t_k)\) (denoted \(\theta\) below)
  • Assume spending function \(f_2(s,\gamma)\) for \(\gamma\) set to Type II error; power \(=1-\gamma\)
  • Set bounds to satisfy

\[ \begin{align} f_1(s_k,\alpha)-f_1(s_{k-1},\alpha) &= P_0(\{Z_{k}\geq b_{k}(\alpha)\}\cap_{j=1}^{k-1}\{Z_{j}< b_{j}(\alpha)\}\\ f_2(s_k,\gamma)-f_2(s_{k-1},\gamma) &= P_\theta(\{Z_{k}< a_{k}(\gamma)\}\cap_{j=1}^{k-1}\{a_{j}(\gamma)\le Z_{j}< b_{j}(\alpha)\} \end{align} \] - Generally, sample size set so that \(a_K=b_K\)

\(\beta\)-spending design with gsSurv()

Analysis Value Efficacy Futility
IA 1: 32% Z 3.7670 -0.2503
N: 476 p (1-sided) 0.0001 0.5988
Events: 104 ~HR at bound 0.4767 1.0505
Month: 13 P(Cross) if HR=1 0.0001 0.4012
P(Cross) if HR=0.68 0.0338 0.0143
IA 2: 63% Z 2.6020 0.8440
N: 476 p (1-sided) 0.0046 0.1993
Events: 201 ~HR at bound 0.6922 0.8875
Month: 21 P(Cross) if HR=1 0.0047 0.8103
P(Cross) if HR=0.68 0.5385 0.0393
IA 3: 84% Z 2.2209 1.5151
N: 476 p (1-sided) 0.0132 0.0649
Events: 269 ~HR at bound 0.7626 0.8312
Month: 28 P(Cross) if HR=1 0.0144 0.9414
P(Cross) if HR=0.68 0.8185 0.0687
Final Z 2.0453 2.0453
N: 476 p (1-sided) 0.0204 0.0204
Events: 319 ~HR at bound 0.7953 0.7953
Month: 36 P(Cross) if HR=1 0.0225 0.9775
P(Cross) if HR=0.68 0.9000 0.1000

\(\beta\)-spending design with gs_design_ahr()

library(gsdmvn)
# Spending function setup
upar <- list(sf = gsDesign::sfLDOF, total_spend = 0.025)
lpar <- list(sf = gsDesign::sfHSD, total_spend = .1, param = -2)
NPHasymmetric <- gs_design_ahr(
  enrollRates = enrollRates,
  failRates = failRates,
  ratio = 1, alpha = .025, beta = 0.1,
  # Information fraction not required (but available!)
  analysisTimes = analysisTimes,
  # Function to enable spending bound
  upper = gs_spending_bound,
  lower = gs_spending_bound,
  # Spending function and parameters used
  upar = upar, lpar = lpar
)

\(\beta\)-spending design with gs_design_ahr()

Analysis Bound Time N Events Z Probability AHR theta info info0
1 Upper 12 501.8 107.8 3.7670 0.0021 0.840 0.175 26.5 26.9
2 Upper 20 501.8 208.6 2.6020 0.3318 0.738 0.304 50.9 52.2
3 Upper 28 501.8 280.1 2.2209 0.7660 0.700 0.357 68.5 70.0
4 Upper 36 501.8 332.5 2.0453 0.9000 0.683 0.381 81.7 83.1
1 Lower 12 501.8 107.8 −1.2899 0.0143 0.840 0.175 26.5 26.9
2 Lower 20 501.8 208.6 0.3054 0.0387 0.738 0.304 50.9 52.2
3 Lower 28 501.8 280.1 1.3340 0.0681 0.700 0.357 68.5 70.0
4 Lower 36 501.8 332.5 2.0453 0.1000 0.683 0.381 81.7 83.1

Design with interims at specified times

  • Not easily done with gsDesign::gsSurv()
  • Futility only at interim 1
    • Look for p=0.05 in the wrong direction
  • Efficacy only AFTER interim 1
  • This is a variation on asymmetric design
  • Cannot be done (at least not easily) with gsDesign package
  • We will use information fraction instead of calendar times of analysis

Design with interims at specified times

# Spending function setup
upar <- list(sf = gsDesign::sfLDOF, total_spend = 0.025)
lpar <- c(qnorm(.05), rep(-Inf, 3))
NPHskip <- gs_design_ahr(
  enrollRates = enrollRates,
  failRates = failRates,
  ratio = 1, alpha = .025, beta = 0.1,
  # Information fraction not required (but available!)
  analysisTimes = analysisTimes,
  # Upper spending bound
  upper = gs_spending_bound, upar = upar,
  # Skip first efficacy analysis
  test_upper = c(FALSE, TRUE, TRUE, TRUE),
  # Spending function and parameters used
  lower = gs_b, lpar = lpar
)

Design with interims at specified times

Analysis Bound Time N Events Z Probability AHR theta info info0
1 Lower 12 467.6 100.4 −1.6449 0.0060 0.840 0.175 24.7 25.1
2 Upper 20 467.6 194.4 2.5999 0.3057 0.738 0.304 47.4 48.6
3 Upper 28 467.6 261.0 2.2207 0.7359 0.700 0.357 63.8 65.3
4 Upper 36 467.6 309.8 2.0452 0.9000 0.683 0.381 76.1 77.5

Review of multiplicity issues

Type I error

All probabilities are under null hypothesis

  • Nominal p-value: probability of rejecting for a specific test conditioning on nothing else
  • Repeated p-value: \(\alpha\)-level at which a group sequential test would be rejected at a given analysis for a specific hypothesis
  • Sequential p-value: \(\alpha\)-level at which a group sequential test would be rejected for all analyses performed
    • Both sequential and repeated p-values can be performed on interim data
  • FWER - Given a set of hypotheses with fixed and/or group sequential designs, the FWER at which an individual hypothesis would be rejected with one of the analyses performed

Other tests

Outline

We consider other alternative tests for group sequential design.

  • Review fixed design
    • Weighted logrank test
    • MaxCombo test
    • Illustration using npsurvSS
  • Group sequential design with weighted logrank test
    • Under a given boundary
    • Boundary calculation
    • Illustration using gsdmvn
  • Group sequential design with MaxCombo test
    • Under a given boundary
    • Boundary calculation
    • Illustration using gsdmvn

Fixed design

For simplicity, we made a few key assumptions.

  • Balanced design (1:1 randomization ratio).
  • 1-sided test.
  • Local alternative: variance under null and alternative are approximately equal.
  • Accrual distribution: Piecewise uniform.
  • Survival distribution: piecewise exponential.
  • Loss to follow-up: exponential.
  • No stratification.
  • No cure fraction.

The fixed design part largely follows the concept described in Yung and Liu (2019).

Notation

  • \(\alpha\): Type I error
  • \(\beta\): Type II error or power (1 - \(\beta\))
  • \(z_\alpha\): upper \(\alpha\) percentile of standard normal distribution
  • \(z_\beta\): upper \(\beta\) percentile of standard normal distribution

We considered a 1-sided test with type I error at \(\alpha=0.025\) and \(1-\beta=80\%\) power.

z_alpha <- abs(qnorm(0.025))
z_alpha
## [1] 1.959964
z_beta <- abs(qnorm(0.2))
z_beta
## [1] 0.8416212

Sample size calculation

  • \(\theta\): effect size
  • \(n\): total sample size
  • \(Z\): test statistics is asymptotic normal
    • Under null hypothesis: \(Z \sim \mathcal{N}(0, \sigma_0^2)\)
    • Under alternative hypothesis: \(Z \sim \mathcal{N}(\sqrt{n}\theta, \sigma_1^2)\)

By assuming local alternative, we have

\[\sigma_0^2 \approx \sigma_1^2 = \sigma^2\] In this simplified case, the sample size can be calculated as

\[ n = \frac{4 (z_{\alpha}+z_{\beta})^{2}}{\theta^2} \]

Examples

Sample size calculation under non-proportional hazards

Fixed design with weighted logrank test

Fixed design with MaxCombo test

Group sequential design with weighted logrank test

Similar to the fixed design, we can define the test statistics for weighted logrank test using counting process formula

\[ Z_k=\sqrt{\frac{n_{0}+n_{1}}{n_{0}n_{1}}}\int_{0}^{t_k}w(t)\frac{\overline{Y}_{0}(t)\overline{Y}_{1}(t)}{\overline{Y}_{0}(t)+\overline{Y}_{0}(t)}\left\{ \frac{d\overline{N}_{1}(t)}{\overline{Y}_{1}(t)}-\frac{d\overline{N}_{0}(t)}{\overline{Y}_{0}(t)}\right\} \]

Note, the only difference is that the test statistics fixed analysis up to \(t_k\) at \(k\)-th interim analysis

Group sequential design with weighted logrank test

Group sequential design with MaxCombo test

Summary

Where to start for NPH?

  • Understand your control group
    • Get simple assumptions to approximate published results?
  • Understand AHR
    • Consider treatment effect that is clinically meaningful, but conservative
    • Consider a treatment effect delay that is intermediate
    • Understand likely enrollment pattern and dropout rate
    • Plot AHR, underlying survival differences and expected accrual of patients and events

Where to start for NPH?

  • Follow-up (trial duration)
    • Get to good part of AHR could (it will plateau under above assumptions)
    • Ensure follow-up long enough to ensure tail characterization
    • Require both minimum follow-up and sufficient events for final analysis

Simplest design approach

  • Start with AHR from above evaluation
  • Design under proportional hazards assuming constant AHR
  • Ensure early futility bounds are not too aggressive
    • No futility (risk that data monitoring committee will overrule)
    • \(\beta\)-spending: be conservative for early bounds
    • Asymmetric 2-sided test; e.g., Pocock-like bound with moderate total boundary crossing probability
  • Efficacy bounds: O’Brien-Fleming always acceptable to regulators
  • This is easy to do with gsDesign Shiny app!

More advanced needs: gsdmvn

  • For primary analysis, may wish to stick with logrank test
    • Well-accepted by regulatory agencies
  • Futility bounds and power take into account changing effect over time
    • NPH will also impact expected timing of analyses
  • More easily allows interim timing based on calendar time
  • More options for setting boundaries
    • Fixed futility bounds
    • Haybittle-Peto bounds
    • Futility or efficacy bounds can be eliminated from selected analyses

Less conventional tests

  • RMST has been heavily promoted
    • For delayed effect, power is generally no better than logrank
    • However, some advocate these for primary analysis
  • Sensitivity analyses with better power assuming delayed effect
    • Weighted logrank with FH(0, 0.5) or Magirr-Burman test
    • MaxCombo test at final analysis only
  • There can be substantial power gains

Thank you!

  • Feedback and questions are welcome!
  • May wish to submit by issues at GitHub repositories, but e-mail also OK.

Thank you

References

[1] Scharfstein, D. O., Tsiatis, A. A. and Robins, J. M. (1997). Semiparametric efficiency and its implication on the design and analysis of group-sequential studies. Journal of the American Statistical Association 92 1342–50.

[2] Jennison, C. and Turnbull, B. W. (2000). Group sequential methods with applications to clinical trials. Chapman; Hall/CRC, Boca Raton, FL.

[3] Lachin, J. M. and Foulkes, M. A. (1986). Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics 42 507–19.

[4] Schoenfeld, D. (1981). The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika 68 316–9.

[5] Mukhopadhyay, P., Huang, W., Metcalfe, P., Öhrn, F., Jenner, M. and Stone, A. (2020). Statistical and practical considerations in designing of immuno-oncology trials. Journal of Biopharmaceutical Statistics 1–7.

[6] Yung, G. and Liu, Y. (2020). Sample size and power for the weighted log-rank test and kaplan-meier based tests with allowance for nonproportional hazards. Biometrics 76 939–50.

[7] Karrison, T. G. and others. (2016). Versatile tests for comparing survival curves based on weighted log-rank statistics. Stata Journal 16 678–90.

[8] Kim, K. and Tsiatis, A. A. (1990). Study duration for clinical trials with survival response and early stopping rule. Biometrics 81–92.

[9] Lan, K. K. G. and DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika 70 659–63.

[10] Haybittle, J. (1971). Repeated assessment of results in clinical trials of cancer treatment. The British Journal of Radiology 44 793–7.

[11] Peto, R., Pike, Mc., Armitage, P., Breslow, N. E., Cox, D., Howard, S., Mantel, N., McPherson, K., Peto, J. and Smith, P. (1977). Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. Analysis and examples. British Journal of Cancer 35 1–39.

[12] Wang, S. K. and Tsiatis, A. A. (1987). Approximately optimal one-parameter boundaries for group sequential trials. Biometrics 193–9.

[13] Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika 64 191–9.

[14] O’Brien, P. C. and Fleming, T. R. (1979). A multiple testing procedure for clinical trials. Biometrics 549–56.

[15] Slud, E. and Wei, L. (1982). Two-sample repeated significance tests based on the modified wilcoxon statistic. Journal of the American Statistical Association 77 862–8.

[16] Fleming, T. R., Harrington, D. P. and O’Brien, P. C. (1984). Designs for group sequential tests. Controlled Clinical Trials 5 348–61.

[17] Lan, K. K. G. and DeMets, D. L. (1989). Group sequential procedures: Calendar versus information time. Statistics in Medicine 8 1191–8.

[18] Gandhi, L., Rodrı́guez-Abreu, D., Gadgeel, S., Esteban, E., Felip, E., De Angelis, F., Domine, M., Clingan, P., Hochmair, M. J., Powell, S. F. and others. (2018). Pembrolizumab plus chemotherapy in metastatic non–small-cell lung cancer. New England Journal of Medicine 378 2078–92.

[19] Maurer, W. and Bretz, F. (2013). Multiple testing in group sequential trials using graphical approaches. Statistics in Biopharmaceutical Research 5 311–20.

[20] Downs, J. R., Beere, P. A., Whitney, E., Clearfield, M., Weis, S., Rochen, J., Stein, E. A., Shapiro, D. R., Langendorfer, A. and Gotto Jr, A. M. (1997). Design & rationale of the air force/texas coronary atherosclerosis prevention study (AFCAPS/TexCAPS). The American Journal of Cardiology 80 287–93.

[21] Downs, J. R., Clearfield, M., Weis, S., Whitney, E., Shapiro, D. R., Beere, P. A., Langendorfer, A., Stein, E. A., Kruyer, W., Gotto Jr, A. M. and others. (1998). Primary prevention of acute coronary events with lovastatin in men and women with average cholesterol levels: Results of AFCAPS/TexCAPS. Journal of the American Medical Association 279 1615–22.

[22] Hwang, I. K., Shih, W. J. and De Cani, J. S. (1990). Group sequential designs using a family of type i error probability spending functions. Statistics in Medicine 9 1439–45.

[23] White, W. B., Cannon, C. P., Heller, S. R., Nissen, S. E., Bergenstal, R. M., Bakris, G. L., Perez, A. T., Fleck, P. R., Mehta, C. R., Kupfer, S. and others. (2013). Alogliptin after acute coronary syndrome in patients with type 2 diabetes. New England Journal of Medicine 369 1327–35.

[24] White, W. B., Bakris, G. L., Bergenstal, R. M., Cannon, C. P., Cushman, W. C., Fleck, P., Heller, S., Mehta, C., Nissen, S. E., Perez, A. and others. (2011). EXamination of cArdiovascular outcoMes with alogliptIN versus standard of carE in patients with type 2 diabetes mellitus and acute coronary syndrome (EXAMINE): A cardiovascular safety study of the dipeptidyl peptidase 4 inhibitor alogliptin in patients with type 2 diabetes with acute coronary syndrome. American Heart Journal 162 620–6.

[25] Cohen, E. E., Soulières, D., Le Tourneau, C., Dinis, J., Licitra, L., Ahn, M.-J., Soria, A., Machiels, J.-P., Mach, N., Mehra, R. and others. (2019). Pembrolizumab versus methotrexate, docetaxel, or cetuximab for recurrent or metastatic head-and-neck squamous cell carcinoma (KEYNOTE-040): A randomised, open-label, phase 3 study. The Lancet 393 156–67.

[26] Miettinen, T. A., Pyörälä, K., Olsson, A. G., Musliner, T. A., Cook, T. J., Faergeman, O., Berg, K., Pedersen, T., Kjekshus, J. and Group, for the S. S. S. (1997). Cholesterol-lowering therapy in women and elderly patients with myocardial infarction or angina pectoris: Findings from the scandinavian simvastatin survival study (4S). Circulation 96 4211–8.

[27] Shitara, K., Özgüroğlu, M., Bang, Y.-J., Di Bartolomeo, M., Mandalà, M., Ryu, M.-H., Fornaro, L., Olesiński, T., Caglevic, C., Chung, H. C. and others. (2018). Pembrolizumab versus paclitaxel for previously treated, advanced gastric or gastro-oesophageal junction cancer (KEYNOTE-061): A randomised, open-label, controlled, phase 3 trial. The Lancet 392 123–33.

[28] Hernán, M. A. (2010). The hazards of hazard ratios. Epidemiology 21 13.

[29] Tsiatis, A. A. (1982). Repeated significance testing for a general class of statistics use in censored survival analysis. Journal of the American Statistical Association 77 855–61.

[30] Yung, G. and Liu, Y. (2019). Sample size and power for the weighted log-rank test and kaplan-meier based tests with allowance for nonproportional hazards. Biometrics.