September 1, 2022, Bremen

Philosophy

“It is more fun to talk with someone who doesn’t use long, difficult words but rather short, easy words like ‘What about lunch?’” — A. A. Milne, Winnie-the-Pooh

Co-authors

  • Yilong Zhang, Meta Platforms, Inc.
  • Nan Xiao, Merck & Co., Inc.
  • Yujie Zhao, Merck & Co., Inc.

Disclaimer

These materials do not represent corporate thoughts of Merck & Co., Inc., Rahway, NJ, USA and its affiliates, or Meta Platforms, Inc.

Keaven Anderson takes responsibility for any errors.

Overview

  • Grammar of group sequential design is an ongoing initiative
    • Enabling new and potentially extensive capabilities
    • Simplified command structure
  • Current status
    • Many capabilities implemented
    • Progress on grammar
  • Today’s presentation
    • Fixed design example
    • Introduction to grammar priorities
  • Group sequential bound grammar coming later

Aim is to support specific design innovations coming into common use

New capabilities represent
opinionated priorities

Opinionated selection of new features

  • Non-proportional hazards asymptotics for group sequential design
    • Weighted logrank and combination test design: Roychoudhury et al. (2021); Magirr and Jiménez (2022)
  • Many more bound options
    • Spending other than information-based
    • One or both of efficacy and futility bound at each analysis
  • Time-to-event endpoint with parallel simulation
  • Graphical multiplicity and parametric design (Anderson et al. 2022)
    • Account for group sequential and multiple hypothesis correlations
    • Multiple experimental arms (e.g., MAMS)
    • Intersecting populations
  • Designs for stratified populations
    • Binomial or time-to-event outcomes with differing treatment effects (Mehrotra and Railkar 2000)

Some Merck Shiny innovations

Some Merck packages and repositories

  • gsDesign ongoing updates for PH
  • gsDesign2 for non-proportional hazards
    • Asymptotic design (gsdmvn now being merged into gsDesign2)
    • Simulation (simtrial)
  • Graphical multiplicity with group sequential design (Anderson et al. 2022)

There are also many non-Merck packages

Packages used here

# New grammar and capabilities
library(gsdmvn)    # To be combined with gsDesign2
library(gsDesign2)

# Standalone time-to-event simulation
library(simtrial)

# Supported since 2007
library(gsDesign)

# tidyverse packages
library(tibble)
library(gt)
library(dplyr)

Piecewise model

Setup input parameters

Enrollment rates (piecewise constant, fixed total duration for NPH approach)

Stratum duration rate
All 18 20

Failure and dropout rates (piecewise constant, piecewise hazard ratio)

Stratum duration failRate hr dropoutRate
All 4 0.05776227 1.0 0.001
All 100 0.05776227 0.6 0.001
  • Simple to express to collaborators
    • Median control survival: 12 months
    • 4 month delay in benefit, HR = 0.6 thereafter
  • Able to approximate arbitrary enrollment, failure and dropout rates

Other input parameters

# Study duration in months
studyDuration <- 36
# Experimental / Control randomization ratio
ratio <- 1
# 1-sided Type I error
alpha <- 0.025
# Type II error (power may be a bad argument choice)
beta <- .1

Desire is to make grammar flexible, simple, consistent

  • Rigorous interface (API) for constructing and validating inputs with sensible defaults
  • Composable operations for creating pipe-friendly workflows
  • Unified underlying data representation for extensibility
  • Work is underway; starting efforts with R7 (Wickham, 2022)

R7 thoughts

  • Test version for time-to-event group sequential design written in R7
    • Old version in S3
    • We have also tested Q7 as another OOP system
    • R7 is in alpha mode (very early)
    • Our initial experience (~500 lines of code) is very promising
  • OOP: object oriented programming
    • Rigorous input checking at time of design construction
    • Checking gives more useful error detection and feedback
    • Enable extensibility
  • Grammar
    • Opportunity to reconsider grammar standards from 2009 (Revolution Computing)
    • Concise and clear naming conventions
    • Expanding capabilities

Average hazard ratio

Average hazard ratio

Easy to describe expected effect over time

AHR(
  enrollRates = enrollRates,
  failRates = failRates,
  totalDuration = c(.01, seq(4, 4.5, .1), 5:36),
  ratio = 1
) %>%
  ggplot(aes(x = Time, y = AHR)) +
  geom_line() +
  ggtitle("Geometric mean for hazard ratio by Cox model") +
  scale_x_continuous(breaks = seq(0, 36, 12))


Using asymptotic calculations to examine design robustness

What questions can we answer without simulation?

  • What is the impact of different HR?
  • What is the impact of delayed effect or crossing survival?
  • What is the impact of strata prevalence?
  • What is the impact of enrollment?
  • What is the effect of follow-up duration?
  • What is the impact of interim analysis (IA) timing?
  • What is the impact of adding interim analysis?
  • What is the impact of different futility bounds?
  • What is the impact of different statistical tests?
  • What is the impact of alpha allocation?
  • What is the impact of incorporating parametric tests accounting for correlations?

Example for scenario and test method comparison

  • Fixed sample size only
  • 18 months expected enrollment
  • 4 different failure rate scenarios
    • All have doubling of 36 month survival
  • 9 different statistical tests

Get sample size for logrank using average hazard ratio

Method: AHR = average hazard ratio for NPH (Mukhopadhyay et al. 2020)

x <- fixed_design(
  x = "AHR", alpha = alpha, power = 1 - beta, ratio = 1,
  enrollRates = enrollRates, failRates = failRates,
  studyDuration = studyDuration
)
x %>% summary() %>% as_gt()
Fixed Design under AHR Method1
Design N Events Time Bound alpha Power
AHR 463.078 324.7077 36 1.959964 0.025 0.9
1 Power computed with average hazard ratio method.

Other methods available: Lachin and Foulkes (Lachin and Foulkes 1986), Fleming-Harrington (Harrington and Fleming 1982), MaxCombo (Karrison et al. 2016; Roychoudhury et al. 2021), Modestly Weighted Logrank (Magirr and Burman 2019), Milestone difference, RMST. Many of these implemented by npsurvSS package (Yung and Liu 2019).

Verify asymptotic power approximation by simulation

Compare power of tests under 4 month effect delay scenario
Design N Events Time Bound alpha Power Simulated power1 Simulated alpha
Average hazard ratio 463.1 324.7 36 1.959964 0.0250 0.9000 0.8960 0.0253
Lachin and Foulkes 463.1 328.9 36 1.959964 0.0250 0.9060 NA NA
Fleming-Harrington FH(0, 0) (logrank) 463.1 324.7 36 1.959964 0.0250 0.9029 0.8971 0.0226
Fleming-Harrington FH(0, 0.5) 463.1 324.7 36 1.959964 0.0250 0.9584 0.9533 0.0260
MaxCombo: logrank, FH(0, 0.5) 463.1 324.7 36 1.959964 0.0250 0.9565 0.9415 0.0255
MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5) 463.1 324.7 36 1.959964 0.0250 0.9585 0.9455 0.0276
Modestly weighted LR: tau = 4 463.1 324.7 36 1.959964 0.0250 0.9198 0.9180 0.0233
Modestly weighted LR: tau = 12 463.1 324.7 36 1.959964 0.0250 0.9449 0.9383 0.0215
Modestly weighted LR: tau = 18 463.1 324.7 36 1.959964 0.0250 0.9486 0.9404 0.0234
RMST: tau = 36 463.1 324.7 36 1.959964 0.0250 0.8760 0.8883 0.0277
1 Simulated power and alpha is based on 10,000 simulations.

Comparing different tests for robust power

Scenarios considered

Strong null addresses Magirr and Burman (2019), Freidlin and Korn (2019)

Check for robust power by test and scenario

  • Scenarios focused on long-term benefit, not short-term trade offs
  • Logrank (average hazard ratio) and RMST lose considerable power with delayed benefit
  • Many weighted logrank and combination tests retain good power across scenarios
    • Modestly weighted logrank may need to down-weight for much longer than effect delay

Strong null: Need to control \(\alpha=0.025\) across entire null space

Test alpha
Strong null
Logrank 0.0029
Fleming-Harrington FH(0, 0.5) 0.0163
MaxCombo: logrank, FH(0, 0.5) 0.0163
MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5) 0.0166
MaxCombo: logrank, FH(0, 1) 0.0344
MaxCombo: logrank, FH(0, 1), FH(1, 1) 0.0366
Modestly weighted LR: tau = 4 0.0043
Modestly weighted LR: tau = 12 0.0098
Modestly weighted LR: tau = 18 0.0132
RMST: tau = 36 0.0022
Milestone: tau = 24 0.0135
Milestone: tau = 30 0.0203


  • Excess Type I error with too much early down-weighting (FH(0,1))
    • In spite of 18 month enrollment
  • Type I error well-controlled by less down-weighting
    • e.g., Modestly Weighted logrank (Magirr and Burman (2019)) or FH(0,0.5)-based (Roychoudhury et al. (2021)) tests

Summary

  • Description of useful new design features
  • Initial grammar demonstrating
    • Ease of use
    • Broad applications to implement and compare designs
  • Work is ongoing for both grammar and more features
  • See Mukhopadhyay et al. (2022) for a systematic review of logrank vs RMST vs MaxCombo for 8+ years of immunotherapy trials in oncology

Thank you

References

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2022. “A Unified Framework for Weighted Parametric Group Sequential Design.” Biometrical Journal.

Freidlin, Boris, and Edward L Korn. 2019. “Methods for Accommodating Nonproportional Hazards in Clinical Trials: Ready for the Primary Analysis?” Journal of Clinical Oncology 37 (35): 3455.

Harrington, David P, and Thomas R Fleming. 1982. “A Class of Rank Test Procedures for Censored Survival Data.” Biometrika 69 (3): 553–66.

Karrison, Theodore G et al. 2016. “Versatile Tests for Comparing Survival Curves Based on Weighted Log-Rank Statistics.” Stata Journal 16 (3): 678–90.

Lachin, John M., and Mary A. Foulkes. 1986. “Evaluation of Sample Size and Power for Analyses of Survival with Allowance for Nonuniform Patient Entry, Losses to Follow-up, Noncompliance, and Stratification.” Biometrics 42: 507–19.

Magirr, Dominic, and Carl-Fredrik Burman. 2019. “Modestly Weighted Logrank Tests.” Statistics in Medicine 38 (20): 3782–90.

Magirr, Dominic, and José L Jiménez. 2022. “Design and Analysis of Group-Sequential Clinical Trials Based on a Modestly Weighted Log-Rank Test in Anticipation of a Delayed Separation of Survival Curves: A Practical Guidance.” Clinical Trials 19 (2): 201–10.

Mehrotra, Devan V, and Radha Railkar. 2000. “Minimum Risk Weights for Comparing Treatments in Stratified Binomial Trials.” Statistics in Medicine 19 (6): 811–25.

Mukhopadhyay, Pralay, Wenmei Huang, Paul Metcalfe, Fredrik Öhrn, Mary Jenner, and Andrew Stone. 2020. “Statistical and Practical Considerations in Designing of Immuno-Oncology Trials.” Journal of Biopharmaceutical Statistics 30 (6): 1130–46.

Mukhopadhyay, Pralay, Jiabu Ye, Keaven M Anderson, Satrajit Roychoudhury, Eric H Rubin, Susan Halabi, and Richard J Chappell. 2022. “Log-Rank Test Vs MaxCombo and Difference in Restricted Mean Survival Time Tests for Comparing Survival Under Nonproportional Hazards in Immuno-Oncology Trials: A Systematic Review and Meta-Analysis.” JAMA Oncology.

Roychoudhury, Satrajit, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. 2021. “Robust Design and Analysis of Clinical Trials with Nonproportional Hazards: A Straw Man Guidance from a Cross-Pharma Working Group.” Statistics in Biopharmaceutical Research, 1–15. https://doi.org/10.1080/19466315.2021.1874507.

Yung, Godwin, and Yi Liu. 2019. “Sample Size and Power for the Weighted Log-Rank Test and Kaplan-Meier Based Tests with Allowance for Nonproportional Hazards.” Biometrics.