“It is more fun to talk with someone who doesn’t use long, difficult words but rather short, easy words like ‘What about lunch?’” — A. A. Milne, *Winnie-the-Pooh*

September 1, 2022, Bremen

“It is more fun to talk with someone who doesn’t use long, difficult words but rather short, easy words like ‘What about lunch?’” — A. A. Milne, *Winnie-the-Pooh*

- Yilong Zhang, Meta Platforms, Inc.
- Nan Xiao, Merck & Co., Inc.
- Yujie Zhao, Merck & Co., Inc.

These materials do not represent corporate thoughts of Merck & Co., Inc., Rahway, NJ, USA and its affiliates, or Meta Platforms, Inc.

Keaven Anderson takes responsibility for any errors.

- Grammar of group sequential design is an ongoing initiative
- Enabling new and potentially extensive capabilities
- Simplified command structure

- Current status
- Many capabilities implemented
- Progress on grammar

- Today’s presentation
- Fixed design example
- Introduction to grammar priorities

- Group sequential bound grammar coming later

Aim is to support specific design innovations coming into common use

opinionated priorities

- Non-proportional hazards asymptotics for group sequential design
- Weighted logrank and combination test design: Roychoudhury et al. (2021); Magirr and Jiménez (2022)

- Many more bound options
- Spending other than information-based
- One or both of efficacy and futility bound at each analysis

- Time-to-event endpoint with parallel simulation
- Graphical multiplicity and parametric design (Anderson et al. 2022)
- Account for group sequential and multiple hypothesis correlations
- Multiple experimental arms (e.g., MAMS)
- Intersecting populations

- Designs for stratified populations
- Binomial or time-to-event outcomes with differing treatment effects (Mehrotra and Railkar 2000)

**gsDesign**- Design and design update with
**gsDesign**package - Save and reproduce designs
- Design code and documentation generation (RMarkdown)
- Video introduction at https://www.youtube.com/watch?v=8uZRuvzma9M

- Design and design update with
**gMCPLite**- Recent work by Yalin Zhu, Xuan Deng
- Graphical multiplicity with
**ggplot2**graphics https://merck.github.io/gMCPLite/- Removes RJava to make lightweight package

- Package at https://cran.r-project.org/web/packages/gMCPLite/index.html
- Post issues at https://github.com/Merck/gMCPLite

- Ongoing work includes Shiny interface

**gsDesign**ongoing updates for PH**gsDesign2**for non-proportional hazards- Asymptotic design (
**gsdmvn**now being merged into**gsDesign2**) - Simulation (
**simtrial**)

- Asymptotic design (
- Graphical multiplicity with group sequential design (Anderson et al. 2022)

There are also many non-Merck packages

# New grammar and capabilities library(gsdmvn) # To be combined with gsDesign2 library(gsDesign2) # Standalone time-to-event simulation library(simtrial) # Supported since 2007 library(gsDesign) # tidyverse packages library(tibble) library(gt) library(dplyr)

Enrollment rates (piecewise constant, fixed total duration for NPH approach)

Stratum | duration | rate |
---|---|---|

All | 18 | 20 |

Failure and dropout rates (piecewise constant, piecewise hazard ratio)

Stratum | duration | failRate | hr | dropoutRate |
---|---|---|---|---|

All | 4 | 0.05776227 | 1.0 | 0.001 |

All | 100 | 0.05776227 | 0.6 | 0.001 |

- Simple to express to collaborators
- Median control survival: 12 months
- 4 month delay in benefit, HR = 0.6 thereafter

- Able to approximate arbitrary enrollment, failure and dropout rates

# Study duration in months studyDuration <- 36 # Experimental / Control randomization ratio ratio <- 1 # 1-sided Type I error alpha <- 0.025 # Type II error (power may be a bad argument choice) beta <- .1

- Rigorous interface (API) for constructing and validating inputs with sensible defaults
- Composable operations for creating pipe-friendly workflows
- Unified underlying data representation for extensibility
- Work is underway; starting efforts with R7 (Wickham, 2022)

- Test version for time-to-event group sequential design written in R7
- Old version in S3
- We have also tested Q7 as another OOP system
- R7 is in
*alpha*mode (very early) - Our initial experience (~500 lines of code) is very promising

- OOP: object oriented programming
- Rigorous input checking at time of design construction
- Checking gives more useful error detection and feedback
- Enable extensibility

- Grammar
- Opportunity to reconsider grammar standards from 2009 (Revolution Computing)
- Concise and clear naming conventions
- Expanding capabilities

Easy to describe expected effect over time

AHR( enrollRates = enrollRates, failRates = failRates, totalDuration = c(.01, seq(4, 4.5, .1), 5:36), ratio = 1 ) %>% ggplot(aes(x = Time, y = AHR)) + geom_line() + ggtitle("Geometric mean for hazard ratio by Cox model") + scale_x_continuous(breaks = seq(0, 36, 12))

- What is the impact of different HR?
- What is the impact of delayed effect or crossing survival?
- What is the impact of strata prevalence?
- What is the impact of enrollment?
- What is the effect of follow-up duration?
- What is the impact of interim analysis (IA) timing?
- What is the impact of adding interim analysis?
- What is the impact of different futility bounds?
- What is the impact of different statistical tests?
- What is the impact of alpha allocation?
- What is the impact of incorporating parametric tests accounting for correlations?

- Fixed sample size only
- 18 months expected enrollment
- 4 different failure rate scenarios
- All have doubling of 36 month survival

- 9 different statistical tests

Method: AHR = average hazard ratio for NPH (Mukhopadhyay et al. 2020)

x <- fixed_design( x = "AHR", alpha = alpha, power = 1 - beta, ratio = 1, enrollRates = enrollRates, failRates = failRates, studyDuration = studyDuration ) x %>% summary() %>% as_gt()

Fixed Design under AHR Method^{1} |
||||||
---|---|---|---|---|---|---|

Design | N | Events | Time | Bound | alpha | Power |

AHR | 463.078 | 324.7077 | 36 | 1.959964 | 0.025 | 0.9 |

^{1} Power computed with average hazard ratio method. |

Other methods available: Lachin and Foulkes (Lachin and Foulkes 1986), Fleming-Harrington (Harrington and Fleming 1982), MaxCombo (Karrison et al. 2016; Roychoudhury et al. 2021), Modestly Weighted Logrank (Magirr and Burman 2019), Milestone difference, RMST. Many of these implemented by **npsurvSS** package (Yung and Liu 2019).

Compare power of tests under 4 month effect delay scenario | ||||||||
---|---|---|---|---|---|---|---|---|

Design | N | Events | Time | Bound | alpha | Power | Simulated power^{1} |
Simulated alpha |

Average hazard ratio | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9000 | 0.8960 | 0.0253 |

Lachin and Foulkes | 463.1 | 328.9 | 36 | 1.959964 | 0.0250 | 0.9060 | NA | NA |

Fleming-Harrington FH(0, 0) (logrank) | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9029 | 0.8971 | 0.0226 |

Fleming-Harrington FH(0, 0.5) | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9584 | 0.9533 | 0.0260 |

MaxCombo: logrank, FH(0, 0.5) | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9565 | 0.9415 | 0.0255 |

MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5) | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9585 | 0.9455 | 0.0276 |

Modestly weighted LR: tau = 4 | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9198 | 0.9180 | 0.0233 |

Modestly weighted LR: tau = 12 | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9449 | 0.9383 | 0.0215 |

Modestly weighted LR: tau = 18 | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.9486 | 0.9404 | 0.0234 |

RMST: tau = 36 | 463.1 | 324.7 | 36 | 1.959964 | 0.0250 | 0.8760 | 0.8883 | 0.0277 |

^{1} Simulated power and alpha is based on 10,000 simulations. |

Strong null addresses Magirr and Burman (2019), Freidlin and Korn (2019)

- Scenarios focused on long-term benefit, not short-term trade offs
- Logrank (average hazard ratio) and RMST lose considerable power with delayed benefit
- Many weighted logrank and combination tests retain good power across scenarios
- Modestly weighted logrank may need to down-weight for much longer than effect delay

Test | alpha |
---|---|

Strong null | |

Logrank | 0.0029 |

Fleming-Harrington FH(0, 0.5) | 0.0163 |

MaxCombo: logrank, FH(0, 0.5) | 0.0163 |

MaxCombo: logrank, FH(0, 0.5), FH(0.5, 0.5) | 0.0166 |

MaxCombo: logrank, FH(0, 1) | 0.0344 |

MaxCombo: logrank, FH(0, 1), FH(1, 1) | 0.0366 |

Modestly weighted LR: tau = 4 | 0.0043 |

Modestly weighted LR: tau = 12 | 0.0098 |

Modestly weighted LR: tau = 18 | 0.0132 |

RMST: tau = 36 | 0.0022 |

Milestone: tau = 24 | 0.0135 |

Milestone: tau = 30 | 0.0203 |

- Excess Type I error with too much early down-weighting (FH(0,1))
- In spite of 18 month enrollment

- Type I error well-controlled by less down-weighting
- e.g., Modestly Weighted logrank (Magirr and Burman (2019)) or FH(0,0.5)-based (Roychoudhury et al. (2021)) tests

- Description of useful new design features
- Initial grammar demonstrating
- Ease of use
- Broad applications to implement and compare designs

- Work is ongoing for both grammar and more features
- See Mukhopadhyay et al. (2022) for a systematic review of logrank vs RMST vs MaxCombo for 8+ years of immunotherapy trials in oncology

*Email:* Keaven_Anderson@merck.com

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2022.
“A Unified Framework for Weighted Parametric Group Sequential Design.” *Biometrical Journal*.

Freidlin, Boris, and Edward L Korn. 2019.
“Methods for Accommodating Nonproportional Hazards in Clinical Trials: Ready for the Primary Analysis?” *Journal of Clinical Oncology* 37 (35): 3455.

Harrington, David P, and Thomas R Fleming. 1982.
“A Class of Rank Test Procedures for Censored Survival Data.” *Biometrika* 69 (3): 553–66.

Karrison, Theodore G et al. 2016.
“Versatile Tests for Comparing Survival Curves Based on Weighted Log-Rank Statistics.” *Stata Journal* 16 (3): 678–90.

Lachin, John M., and Mary A. Foulkes. 1986.
“Evaluation of Sample Size and Power for Analyses of Survival with Allowance for Nonuniform Patient Entry, Losses to Follow-up, Noncompliance, and Stratification.” *Biometrics* 42: 507–19.

Magirr, Dominic, and Carl-Fredrik Burman. 2019.
“Modestly Weighted Logrank Tests.” *Statistics in Medicine* 38 (20): 3782–90.

Magirr, Dominic, and José L Jiménez. 2022.
“Design and Analysis of Group-Sequential Clinical Trials Based on a Modestly Weighted Log-Rank Test in Anticipation of a Delayed Separation of Survival Curves: A Practical Guidance.” *Clinical Trials* 19 (2): 201–10.

Mehrotra, Devan V, and Radha Railkar. 2000.
“Minimum Risk Weights for Comparing Treatments in Stratified Binomial Trials.” *Statistics in Medicine* 19 (6): 811–25.

Mukhopadhyay, Pralay, Wenmei Huang, Paul Metcalfe, Fredrik Öhrn, Mary Jenner, and Andrew Stone. 2020.
“Statistical and Practical Considerations in Designing of Immuno-Oncology Trials.” *Journal of Biopharmaceutical Statistics* 30 (6): 1130–46.

Mukhopadhyay, Pralay, Jiabu Ye, Keaven M Anderson, Satrajit Roychoudhury, Eric H Rubin, Susan Halabi, and Richard J Chappell. 2022.
“Log-Rank Test Vs MaxCombo and Difference in Restricted Mean Survival Time Tests for Comparing Survival Under Nonproportional Hazards in Immuno-Oncology Trials: A Systematic Review and Meta-Analysis.” *JAMA Oncology*.

Roychoudhury, Satrajit, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. 2021.
“Robust Design and Analysis of Clinical Trials with Nonproportional Hazards: A Straw Man Guidance from a Cross-Pharma Working Group.” *Statistics in Biopharmaceutical Research*, 1–15. https://doi.org/10.1080/19466315.2021.1874507.

Yung, Godwin, and Yi Liu. 2019.
“Sample Size and Power for the Weighted Log-Rank Test and Kaplan-Meier Based Tests with Allowance for Nonproportional Hazards.” *Biometrics*.