Grammar of Group Sequential Design

September 1, 2022, Bremen

Philosophy

“It is more fun to talk with someone who doesn’t use long, difficult words but rather short, easy words like ‘What about lunch?’” — A. A. Milne, Winnie-the-Pooh

Co-authors

Yilong Zhang, Meta Platforms, Inc.
Nan Xiao, Merck & Co., Inc.
Yujie Zhao, Merck & Co., Inc.

Disclaimer

These materials do not represent corporate thoughts of Merck & Co., Inc., Rahway, NJ, USA and its affiliates, or Meta Platforms, Inc.

Keaven Anderson takes responsibility for any errors.

Overview

Grammar of group sequential design is an ongoing initiative
- Enabling new and potentially extensive capabilities
- Simplified command structure
Current status
- Many capabilities implemented
- Progress on grammar
Today’s presentation
- Fixed design example
- Introduction to grammar priorities
Group sequential bound grammar coming later

Aim is to support specific design innovations coming into common use

New capabilities represent
opinionated priorities

Opinionated selection of new features

Non-proportional hazards asymptotics for group sequential design
- Weighted logrank and combination test design: Roychoudhury et al. (2021); Magirr and Jiménez (2022)
Many more bound options
- Spending other than information-based
- One or both of efficacy and futility bound at each analysis
Time-to-event endpoint with parallel simulation
Graphical multiplicity and parametric design (Anderson et al. 2022)
- Account for group sequential and multiple hypothesis correlations
- Multiple experimental arms (e.g., MAMS)
- Intersecting populations
Designs for stratified populations
- Binomial or time-to-event outcomes with differing treatment effects (Mehrotra and Railkar 2000)

Some Merck Shiny innovations

gsDesign
- Design and design update with gsDesign package
- Save and reproduce designs
- Design code and documentation generation (RMarkdown)
- Video introduction at https://www.youtube.com/watch?v=8uZRuvzma9M
gMCPLite
- Recent work by Yalin Zhu, Xuan Deng
- Graphical multiplicity with ggplot2 graphics https://merck.github.io/gMCPLite/
  - Removes RJava to make lightweight package
- Package at https://cran.r-project.org/web/packages/gMCPLite/index.html
  - Post issues at https://github.com/Merck/gMCPLite
- Ongoing work includes Shiny interface

Some Merck packages and repositories

gsDesign ongoing updates for PH
gsDesign2 for non-proportional hazards
- Asymptotic design (gsdmvn now being merged into gsDesign2)
- Simulation (simtrial)
Graphical multiplicity with group sequential design (Anderson et al. 2022)
- https://github.com/zifangguo/WPGSD_Supp_Programs

There are also many non-Merck packages

Packages used here

# New grammar and capabilities
library(gsdmvn)    # To be combined with gsDesign2
library(gsDesign2)

# Standalone time-to-event simulation
library(simtrial)

# Supported since 2007
library(gsDesign)

# tidyverse packages
library(tibble)
library(gt)
library(dplyr)

Piecewise model

Setup input parameters

Enrollment rates (piecewise constant, fixed total duration for NPH approach)

Stratum	duration	rate
All	18	20

Failure and dropout rates (piecewise constant, piecewise hazard ratio)

Stratum	duration	failRate	hr	dropoutRate
All	4	0.05776227	1.0	0.001
All	100	0.05776227	0.6	0.001

Simple to express to collaborators
- Median control survival: 12 months
- 4 month delay in benefit, HR = 0.6 thereafter
Able to approximate arbitrary enrollment, failure and dropout rates

Other input parameters

# Study duration in months
studyDuration <- 36
# Experimental / Control randomization ratio
ratio <- 1
# 1-sided Type I error
alpha <- 0.025
# Type II error (power may be a bad argument choice)
beta <- .1

Desire is to make grammar flexible, simple, consistent

Rigorous interface (API) for constructing and validating inputs with sensible defaults
Composable operations for creating pipe-friendly workflows
Unified underlying data representation for extensibility
Work is underway; starting efforts with R7 (Wickham, 2022)

R7 thoughts

Test version for time-to-event group sequential design written in R7
- Old version in S3
- We have also tested Q7 as another OOP system
- R7 is in alpha mode (very early)
- Our initial experience (~500 lines of code) is very promising
OOP: object oriented programming
- Rigorous input checking at time of design construction
- Checking gives more useful error detection and feedback
- Enable extensibility
Grammar
- Opportunity to reconsider grammar standards from 2009 (Revolution Computing)
- Concise and clear naming conventions
- Expanding capabilities

Average hazard ratio

Easy to describe expected effect over time

AHR(
  enrollRates = enrollRates,
  failRates = failRates,
  totalDuration = c(.01, seq(4, 4.5, .1), 5:36),
  ratio = 1
) %>%
  ggplot(aes(x = Time, y = AHR)) +
  geom_line() +
  ggtitle("Geometric mean for hazard ratio by Cox model") +
  scale_x_continuous(breaks = seq(0, 36, 12))

Using asymptotic calculations to examine design robustness

What questions can we answer without simulation?

What is the impact of different HR?
What is the impact of delayed effect or crossing survival?
What is the impact of strata prevalence?
What is the impact of enrollment?
What is the effect of follow-up duration?
What is the impact of interim analysis (IA) timing?
What is the impact of adding interim analysis?
What is the impact of different futility bounds?
What is the impact of different statistical tests?
What is the impact of alpha allocation?
What is the impact of incorporating parametric tests accounting for correlations?

Example for scenario and test method comparison

Fixed sample size only
18 months expected enrollment
4 different failure rate scenarios
- All have doubling of 36 month survival
9 different statistical tests

Get sample size for logrank using average hazard ratio

Method: AHR = average hazard ratio for NPH (Mukhopadhyay et al. 2020)

x <- fixed_design(
  x = "AHR", alpha = alpha, power = 1 - beta, ratio = 1,
  enrollRates = enrollRates, failRates = failRates,
  studyDuration = studyDuration
)
x %>% summary() %>% as_gt()

Fixed Design under AHR Method¹
Design	N	Events	Time	Bound	alpha	Power
AHR	463.078	324.7077	36	1.959964	0.025	0.9
¹ Power computed with average hazard ratio method.

Other methods available: Lachin and Foulkes (Lachin and Foulkes 1986), Fleming-Harrington (Harrington and Fleming 1982), MaxCombo (Karrison et al. 2016; Roychoudhury et al. 2021), Modestly Weighted Logrank (Magirr and Burman 2019), Milestone difference, RMST. Many of these implemented by npsurvSS package (Yung and Liu 2019).

Verify asymptotic power approximation by simulation

Compare power of tests under 4 month effect delay scenario
Design	N	Events	Time	Bound	alpha	Power	Simulated power¹	Simulated alpha
Average hazard ratio	463.1	324.7	36	1.959964	0.0250	0.9000	0.8960	0.0253
Lachin and Foulkes	463.1	328.9	36	1.959964	0.0250	0.9060	NA	NA
Fleming-Harrington FH(0, 0) (logrank)	463.1	324.7	36	1.959964	0.0250	0.9029	0.8971	0.0226
Fleming-Harrington FH(0, 0.5)	463.1	324.7	36	1.959964	0.0250	0.9584	0.9533	0.0260
MaxCombo: logrank, FH(0, 0.5)	463.1	324.7	36	1.959964	0.0250	0.9565	0.9415	0.0255
MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5)	463.1	324.7	36	1.959964	0.0250	0.9585	0.9455	0.0276
Modestly weighted LR: tau = 4	463.1	324.7	36	1.959964	0.0250	0.9198	0.9180	0.0233
Modestly weighted LR: tau = 12	463.1	324.7	36	1.959964	0.0250	0.9449	0.9383	0.0215
Modestly weighted LR: tau = 18	463.1	324.7	36	1.959964	0.0250	0.9486	0.9404	0.0234
RMST: tau = 36	463.1	324.7	36	1.959964	0.0250	0.8760	0.8883	0.0277
¹ Simulated power and alpha is based on 10,000 simulations.

Comparing different tests for robust power

Scenarios considered

Strong null addresses Magirr and Burman (2019), Freidlin and Korn (2019)

Check for robust power by test and scenario

Scenarios focused on long-term benefit, not short-term trade offs

Logrank (average hazard ratio) and RMST lose considerable power with delayed benefit

Many weighted logrank and combination tests retain good power across scenarios
- Modestly weighted logrank may need to down-weight for much longer than effect delay

Strong null: Need to control \(\alpha=0.025\) across entire null space

Test	alpha
Strong null
Logrank	0.0029
Fleming-Harrington FH(0, 0.5)	0.0163
MaxCombo: logrank, FH(0, 0.5)	0.0163
MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5)	0.0166
MaxCombo: logrank, FH(0, 1)	0.0344
MaxCombo: logrank, FH(0, 1), FH(1, 1)	0.0366
Modestly weighted LR: tau = 4	0.0043
Modestly weighted LR: tau = 12	0.0098
Modestly weighted LR: tau = 18	0.0132
RMST: tau = 36	0.0022
Milestone: tau = 24	0.0135
Milestone: tau = 30	0.0203

Excess Type I error with too much early down-weighting (FH(0,1))
- In spite of 18 month enrollment
Type I error well-controlled by less down-weighting
- e.g., Modestly Weighted logrank (Magirr and Burman (2019)) or FH(0,0.5)-based (Roychoudhury et al. (2021)) tests

Summary

Description of useful new design features
Initial grammar demonstrating
- Ease of use
- Broad applications to implement and compare designs
Work is ongoing for both grammar and more features
See Mukhopadhyay et al. (2022) for a systematic review of logrank vs RMST vs MaxCombo for 8+ years of immunotherapy trials in oncology

Thank you

Email: Keaven_Anderson@merck.com

References

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2022. “A Unified Framework for Weighted Parametric Group Sequential Design.” Biometrical Journal.

Freidlin, Boris, and Edward L Korn. 2019. “Methods for Accommodating Nonproportional Hazards in Clinical Trials: Ready for the Primary Analysis?” Journal of Clinical Oncology 37 (35): 3455.

Harrington, David P, and Thomas R Fleming. 1982. “A Class of Rank Test Procedures for Censored Survival Data.” Biometrika 69 (3): 553–66.

Karrison, Theodore G et al. 2016. “Versatile Tests for Comparing Survival Curves Based on Weighted Log-Rank Statistics.” Stata Journal 16 (3): 678–90.

Lachin, John M., and Mary A. Foulkes. 1986. “Evaluation of Sample Size and Power for Analyses of Survival with Allowance for Nonuniform Patient Entry, Losses to Follow-up, Noncompliance, and Stratification.” Biometrics 42: 507–19.

Magirr, Dominic, and Carl-Fredrik Burman. 2019. “Modestly Weighted Logrank Tests.” Statistics in Medicine 38 (20): 3782–90.

Magirr, Dominic, and José L Jiménez. 2022. “Design and Analysis of Group-Sequential Clinical Trials Based on a Modestly Weighted Log-Rank Test in Anticipation of a Delayed Separation of Survival Curves: A Practical Guidance.” Clinical Trials 19 (2): 201–10.

Mehrotra, Devan V, and Radha Railkar. 2000. “Minimum Risk Weights for Comparing Treatments in Stratified Binomial Trials.” Statistics in Medicine 19 (6): 811–25.

Mukhopadhyay, Pralay, Wenmei Huang, Paul Metcalfe, Fredrik Öhrn, Mary Jenner, and Andrew Stone. 2020. “Statistical and Practical Considerations in Designing of Immuno-Oncology Trials.” Journal of Biopharmaceutical Statistics 30 (6): 1130–46.

Mukhopadhyay, Pralay, Jiabu Ye, Keaven M Anderson, Satrajit Roychoudhury, Eric H Rubin, Susan Halabi, and Richard J Chappell. 2022. “Log-Rank Test Vs MaxCombo and Difference in Restricted Mean Survival Time Tests for Comparing Survival Under Nonproportional Hazards in Immuno-Oncology Trials: A Systematic Review and Meta-Analysis.” JAMA Oncology.

Roychoudhury, Satrajit, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. 2021. “Robust Design and Analysis of Clinical Trials with Nonproportional Hazards: A Straw Man Guidance from a Cross-Pharma Working Group.” Statistics in Biopharmaceutical Research, 1–15. https://doi.org/10.1080/19466315.2021.1874507.

Yung, Godwin, and Yi Liu. 2019. “Sample Size and Power for the Weighted Log-Rank Test and Kaplan-Meier Based Tests with Allowance for Nonproportional Hazards.” Biometrics.

Philosophy

Co-authors

Disclaimer

Overview

New capabilities represent opinionated priorities

Opinionated selection of new features

Some Merck Shiny innovations

Some Merck packages and repositories

Packages used here

Piecewise model

Setup input parameters

Other input parameters

Desire is to make grammar flexible, simple, consistent

R7 thoughts

Average hazard ratio

Average hazard ratio

Using asymptotic calculations to examine design robustness

What questions can we answer without simulation?

Example for scenario and test method comparison

Get sample size for logrank using average hazard ratio

Verify asymptotic power approximation by simulation

Comparing different tests for robust power

Scenarios considered

Check for robust power by test and scenario

Strong null: Need to control \(\alpha=0.025\) across entire null space

Summary

Thank you

References

New capabilities represent
opinionated priorities