March 8, 2024

Abstract

We consider group sequential design supporting multiple objectives and possibly including complexities such as non-proportional hazards due to a delayed treatment effect. Given that off-the-shelf software has limited capabilities, we have supported a multiple open source R packages to support complex designs. The gsDesign2 package supports design and analysis under non-proportional hazards assumptions. For testing multiple hypotheses using graphical multiplicity method, the gMCPLite package supports graphs and provides examples for how to analyze trials with multiple objectives. For trials testing hypotheses for multiple overlapping populations or testing of multiple experimental treatment groups versus a common control, the wpgsd package takes advantage of known correlations to efficiently test hypotheses. Finally, the simtrial package is a fast simulation tool for time-to-event endpoints. In addition to describing key package capabilities, we describe our processes to justify use in a regulatory compliant environment.

Overview of design challenges considered

  • Group sequential design using spending bounds (gsDesign)
  • Testing multiple hypotheses in group sequential design
    • gMCPLite: part of gMCP plus ggplot2 graphics functionality
    • wpgsd package
      • MAMS (multi-arm-multi-stage)
      • Related populations (e.g., biomarker+, overall)
  • Non-proportional hazards with group sequential design (gsDesign2)
  • Fast simulation for time-to-event endpoints: simtrial

Hex sticker wall

Our design packages

Package Topic Shiny interface GitHub / Documentation Unit test coverage
gsDesign Design https://rinpharma.shinyapps.io/gsdesign/

https://github.com/keaven/gsDesign

https://keaven.github.io/gsDesign/

75%
gsDesign2 NPH Under construction

https://github.com/Merck/gsDesign2

https://merck.github.io/gsDesign2/

77%
simtrial TTE Simulation ?

https://github.com/Merck/simtrial

https://merck.github.io/simtrial/

83%
gMCPLite Graphical Multiplicity https://rinpharma.shinyapps.io/gmcp/

https://github.com/Merck/gMCPLite

https://merck.github.io/gMCPLite/

76%
wpgsd Bounds for Correlated Testing Not currently planned

https://github.com/Merck/wpgsd

https://merck.github.io/wpgsd/

79%

Graphical multiplicity

  • Graph using ggplot2 initially built with https://rinpharma.shinyapps.io/gmcp/ which generated code for gMCPLite
  • Dividing \(\alpha\) equally between biomarker positive (B+) and overall population
  • Group sequential design and \(\alpha\)-reallocation available using Maurer and Bretz (2013)
  • Accounting for correlated tests, we can relax Maurer-Bretz bounds using Anderson et al. (2022) and the wpgsd package

Designs

gsDesign and its Shiny app

  • The gsDesign package supports group sequential clinical trial design, largely as presented by Jennison and Turnbull (1999). An easy-to-use web interface to enable usage without coding as well as generate code to reproduce a design; this is being enhanced to support more features on an ongoing basis.
  • Initial OS design for B+ group assumes
    • \(\alpha= 0.01\), 90% power
    • Median control OS = 12 months
    • Hazard ratio (HR) for experimental treatment = 0.6
    • 12 month enrollment with 6 month ramp-up
    • Analyses planned at 16, 26, and 36 months
    • O’Brien-Fleming-like spending bound

Design Using Calendar timing
gsDesign::gsSurvCalendar()
Analysis Value Efficacy
IA 1: 45% p (1-sided) 0.0001
N: 284 ~HR at bound 0.4648
Events: 91 P(Cross) if HR=1 0.0001
Month: 16 P(Cross) if HR=0.6 0.1134
IA 2: 79% p (1-sided) 0.0037
N: 284 ~HR at bound 0.6542
Events: 159 P(Cross) if HR=1 0.0038
Month: 26 P(Cross) if HR=0.6 0.7118
Final p (1-sided) 0.0088
N: 284 ~HR at bound 0.7156
Events: 201 P(Cross) if HR=1 0.0100
Month: 36 P(Cross) if HR=0.6 0.9003
Information- or calendar-based spending supported
Table often incoporated directly into protocol

gsDesign2

  • Introduction: The goal of gsDesign2 is to enable fixed or group sequential design under non-proportional hazards, including changing hazard ratios over time and/or between strata. Substantial flexibility on top of what gsDesign provides (Zhao et al. (2023)).
  • Reproducing same design as gsDesign with different sample size method: gs_design_ahr().
  • Results close, but slightly different from gsDesign.
    • Does this mean one is wrong?
    • We will check using simulation!

Design for B+ Population Using gs_design_ahr()
AHR approximations of ~HR at bound
Bound Z Nominal p1 ~HR at bound2 Cumulative boundary crossing probability
Alternate hypothesis Null hypothesis
Analysis: 1 Time: 15.9 N: 278 Event: 89 AHR: 0.6 Information fraction: 0.44
Efficacy 3.67 0.0001 0.4520 0.1088 0.0001
Analysis: 2 Time: 25.8 N: 278 Event: 155 AHR: 0.6 Information fraction: 0.77
Efficacy 2.69 0.0036 0.6453 0.6865 0.0036
Analysis: 3 Time: 36.2 N: 278 Event: 198 AHR: 0.6 Information fraction: 1
Efficacy 2.37 0.0089 0.7121 0.9016 0.0100
1 One-sided p-value for experimental vs control treatment. Value < 0.5 favors experimental, > 0.5 favors control.
2 Approximate hazard ratio to cross bound.

gsDesign vs. gsDesign2

Feature gsDesign gsDesign2
Nonproportional hazards ✔️
Allow skipping bound at an analysis ✔️
Integer-based sample size/event count ✔️ ✔️
Alternates to logrank for survival analysis ✔️
Calendar-based timing/spending ✔️ ✔️
HR bounds for futility ✔️
Stratified design for binomial ✔️
Shiny interface ✔️ Under construction
Maturity ✔️

Simulations

simtrial

  • simtrial: fast, extensible clinical trial simulation framework for time-to-event endpoints.
    • Backend based on data.table
  • For each simulation
    • Generate data: sim_pw_surv()
    • For each analysis
      • Cut the data for analysis: create_cutting(); function factory allowing complex rules
      • Create Z-value test for analysis: various tests available; logrank shown here
  • Across simulations: summarize trial outcomes

10k simulations for design power
Requiring event count and minimum follow-up helpful
Cut criteria Power (95% CI)
Event count 88.6%, 95% CI: (88%, 89.2%)
Event count and minimum follow-up 94.4%, 95% CI: (93.9%, 94.9%)

Non-proportional hazards

  • For Biomarker- group population, assume delayed effect.
    • 3 months: HR = 1
    • Thereafter: HR = 0.7.
    • 30% of overall population.
  • Sample size determined by B+ population already derived.
  • Spending for both B+ and overall determined by B+ information fraction.
  • Power computed here using gs_power_ahr().
  • Design assumes stratified analysis (B+, B-).

Overall Population Design
AHR approximations of ~HR at bound
Bound Z Nominal p1 ~HR at bound2 Cumulative boundary crossing probability
Alternate hypothesis Null hypothesis3
Analysis: 1 Time: 15.9 N: 398 Event: 130 AHR: 0.65 Information fraction: 0.45
Efficacy 3.63 0.0001 0.5227 0.1304 0.0001
Analysis: 2 Time: 25.9 N: 398 Event: 225 AHR: 0.63 Information fraction: 0.78
Efficacy 2.67 0.0038 0.6971 0.7993 0.0038
Analysis: 3 Time: 36 N: 398 Event: 284 AHR: 0.62 Information fraction: 1
Efficacy 2.37 0.0088 0.7530 0.9617 0.0100
1 One-sided p-value for experimental vs. control treatment. Value < 0.5 favors experimental, > 0.5 favors control.
2 Approximate hazard ratio to cross bound.
3 alpha-spending determined by B+ information fraction.

wpgsd

Weighted parametric group sequential design

  • WPGSD; Anderson et al. (2022)
  • Takes advantage of the known correlation structure in constructing efficacy bounds
  • Controls family-wise Type I error (FWER) for a group sequential design.
  • Correlation may be due to:
    • common observations in nested populations
    • overlapping populations
    • common control arm.

Counting events that occur in intersection hypotheses
H1 H2 Analysis Event
1 1 1 89
1 1 2 155
1 1 3 198
1 2 1 89
1 2 2 155
1 2 3 198
2 2 1 130
2 2 2 225
2 2 3 284

Correlation Matrix and Bounds

Correlation matrix

Correlations Between Tests
H1_A1 H2_A1 H1_A2 H2_A2 H1_A3 H2_A3
1.00 0.83 0.76 0.63 0.67 0.56
0.83 1.00 0.63 0.76 0.55 0.68
0.76 0.63 1.00 0.83 0.88 0.74
0.63 0.76 0.83 1.00 0.73 0.89
0.67 0.55 0.88 0.73 1.00 0.83
0.56 0.68 0.74 0.89 0.83 1.00

Bounds

Analysis Hypotheses H1 H2
1 H1 0.00083 NA
1 H1, H2 0.00048 0.00048
1 H2 NA 0.00083
2 H1 0.011 NA
2 H1, H2 0.0069 0.0069
2 H2 NA 0.011
3 H1 0.022 NA
3 H1, H2 0.014 0.014
3 H2 NA 0.022

Group Sequential Bound Comparison: Bonferroni vs. Parametric

Usual group sequential calculation

Bonferroni Bounds
Adjusted only for correlations between analyses
Analysis Hypotheses H1 H2
1 H1 0.00083 NA
1 H1, H2 0.00019 0.00019
1 H2 NA 0.00083
2 H1 0.011 NA
2 H1, H2 0.0047 0.0047
2 H2 NA 0.011
3 H1 0.022 NA
3 H1, H2 0.011 0.011
3 H2 NA 0.022
Bounds expressed as nominal p-values

WPGSD

Weighted Parametric GSD
Adjusted for correlations between analyses and hypotheses
Analysis Hypotheses H1 H2
1 H1 0.00083 NA
1 H1, H2 0.00048 0.00048
1 H2 NA 0.00083
2 H1 0.011 NA
2 H1, H2 0.0069 0.0069
2 H2 NA 0.011
3 H1 0.022 NA
3 H1, H2 0.014 0.014
3 H2 NA 0.022
Bounds expressed as nominal p-values

Summary

  • Validated R-based packages for complex group sequential design
  • Substantial documentation and examples available
  • Key features include:
    • Non-proportional hazards
    • Specialized spending and other bounds
    • Liberalizing bounds by incorporating known correlations
    • Summary output formatted for formal documents

Thank you

References

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2022. “A Unified Framework for Weighted Parametric Group Sequential Design.” Biometrical Journal 64 (7): 1219–39.

Jennison, Christopher, and Bruce W Turnbull. 1999. Group Sequential Methods with Applications to Clinical Trials. CRC Press.

Maurer, Willi, and Frank Bretz. 2013. “Multiple Testing in Group Sequential Trials Using Graphical Approaches.” Statistics in Biopharmaceutical Research 5 (4): 311–20.

Zhao, Yujie, Yilong Zhang, Larry Leon, and Keaven M Anderson. 2023. “Group Sequential Design Under Non-Proportional Hazards.” arXiv Preprint arXiv:2312.01723.