Supporting Complex Group Sequential Designs

March 8, 2024

Abstract

We consider group sequential design supporting multiple objectives and possibly including complexities such as non-proportional hazards due to a delayed treatment effect. Given that off-the-shelf software has limited capabilities, we have supported a multiple open source R packages to support complex designs. The gsDesign2 package supports design and analysis under non-proportional hazards assumptions. For testing multiple hypotheses using graphical multiplicity method, the gMCPLite package supports graphs and provides examples for how to analyze trials with multiple objectives. For trials testing hypotheses for multiple overlapping populations or testing of multiple experimental treatment groups versus a common control, the wpgsd package takes advantage of known correlations to efficiently test hypotheses. Finally, the simtrial package is a fast simulation tool for time-to-event endpoints. In addition to describing key package capabilities, we describe our processes to justify use in a regulatory compliant environment.

Overview of design challenges considered

Group sequential design using spending bounds (gsDesign)
- Supporting Shiny app: https://rinpharma.shinyapps.io/gsdesign/
Testing multiple hypotheses in group sequential design
- gMCPLite: part of gMCP plus ggplot2 graphics functionality
  - Supporting Shiny app: https://rinpharma.shinyapps.io/gmcp/
- wpgsd package
  - MAMS (multi-arm-multi-stage)
  - Related populations (e.g., biomarker+, overall)
Non-proportional hazards with group sequential design (gsDesign2)
Fast simulation for time-to-event endpoints: simtrial

Hex sticker wall

Our design packages

Package	Topic	Shiny interface	GitHub / Documentation	Unit test coverage
gsDesign	Design	https://rinpharma.shinyapps.io/gsdesign/	https://github.com/keaven/gsDesign https://keaven.github.io/gsDesign/	75%
gsDesign2	NPH	Under construction	https://github.com/Merck/gsDesign2 https://merck.github.io/gsDesign2/	77%
simtrial	TTE Simulation	?	https://github.com/Merck/simtrial https://merck.github.io/simtrial/	83%
gMCPLite	Graphical Multiplicity	https://rinpharma.shinyapps.io/gmcp/	https://github.com/Merck/gMCPLite https://merck.github.io/gMCPLite/	76%
wpgsd	Bounds for Correlated Testing	Not currently planned	https://github.com/Merck/wpgsd https://merck.github.io/wpgsd/	79%

Graphical multiplicity

Graph using ggplot2 initially built with https://rinpharma.shinyapps.io/gmcp/ which generated code for gMCPLite
Dividing \(\alpha\) equally between biomarker positive (B+) and overall population
Group sequential design and \(\alpha\)-reallocation available using Maurer and Bretz (2013)
Accounting for correlated tests, we can relax Maurer-Bretz bounds using Anderson et al. (2022) and the wpgsd package

Designs

gsDesign and its Shiny app

The gsDesign package supports group sequential clinical trial design, largely as presented by Jennison and Turnbull (1999). An easy-to-use web interface to enable usage without coding as well as generate code to reproduce a design; this is being enhanced to support more features on an ongoing basis.
Initial OS design for B+ group assumes
- \(\alpha= 0.01\), 90% power
- Median control OS = 12 months
- Hazard ratio (HR) for experimental treatment = 0.6
- 12 month enrollment with 6 month ramp-up
- Analyses planned at 16, 26, and 36 months
- O’Brien-Fleming-like spending bound

Analysis	Value	Efficacy
Design Using Calendar timing
gsDesign::gsSurvCalendar()
IA 1: 45%	p (1-sided)	0.0001
N: 284	~HR at bound	0.4648
Events: 91	P(Cross) if HR=1	0.0001
Month: 16	P(Cross) if HR=0.6	0.1134
IA 2: 79%	p (1-sided)	0.0037
N: 284	~HR at bound	0.6542
Events: 159	P(Cross) if HR=1	0.0038
Month: 26	P(Cross) if HR=0.6	0.7118
Final	p (1-sided)	0.0088
N: 284	~HR at bound	0.7156
Events: 201	P(Cross) if HR=1	0.0100
Month: 36	P(Cross) if HR=0.6	0.9003
Information- or calendar-based spending supported
Table often incoporated directly into protocol

gsDesign2

Introduction: The goal of gsDesign2 is to enable fixed or group sequential design under non-proportional hazards, including changing hazard ratios over time and/or between strata. Substantial flexibility on top of what gsDesign provides (Zhao et al. (2023)).
Reproducing same design as gsDesign with different sample size method: gs_design_ahr().
Results close, but slightly different from gsDesign.
- Does this mean one is wrong?
- We will check using simulation!

Bound	Z	Nominal p¹	~HR at bound²	Cumulative boundary crossing probability
Design for B+ Population Using gs_design_ahr()
AHR approximations of ~HR at bound
Bound	Z	Nominal p¹	~HR at bound²	Alternate hypothesis	Null hypothesis
Analysis: 1 Time: 15.9 N: 278 Event: 89 AHR: 0.6 Information fraction: 0.44
Efficacy	3.67	0.0001	0.4520	0.1088	0.0001
Analysis: 2 Time: 25.8 N: 278 Event: 155 AHR: 0.6 Information fraction: 0.77
Efficacy	2.69	0.0036	0.6453	0.6865	0.0036
Analysis: 3 Time: 36.2 N: 278 Event: 198 AHR: 0.6 Information fraction: 1
Efficacy	2.37	0.0089	0.7121	0.9016	0.0100
¹ One-sided p-value for experimental vs control treatment. Value < 0.5 favors experimental, > 0.5 favors control.
² Approximate hazard ratio to cross bound.

gsDesign vs. gsDesign2

Feature	gsDesign	gsDesign2
Nonproportional hazards	❌	✔️
Allow skipping bound at an analysis	❌	✔️
Integer-based sample size/event count	✔️	✔️
Alternates to logrank for survival analysis	❌	✔️
Calendar-based timing/spending	✔️	✔️
HR bounds for futility	❌	✔️
Stratified design for binomial	❌	✔️
Shiny interface	✔️	Under construction
Maturity	✔️	❌

Simulations

simtrial

simtrial: fast, extensible clinical trial simulation framework for time-to-event endpoints.
- Backend based on data.table
For each simulation
- Generate data: sim_pw_surv()
- For each analysis
  - Cut the data for analysis: create_cutting(); function factory allowing complex rules
  - Create Z-value test for analysis: various tests available; logrank shown here
Across simulations: summarize trial outcomes

Cut criteria	Power (95% CI)
10k simulations for design power
Requiring event count and minimum follow-up helpful
Event count	88.6%, 95% CI: (88%, 89.2%)
Event count and minimum follow-up	94.4%, 95% CI: (93.9%, 94.9%)

Non-proportional hazards

For Biomarker- group population, assume delayed effect.
- 3 months: HR = 1
- Thereafter: HR = 0.7.
- 30% of overall population.
Sample size determined by B+ population already derived.
Spending for both B+ and overall determined by B+ information fraction.
Power computed here using gs_power_ahr().
Design assumes stratified analysis (B+, B-).

Bound	Z	Nominal p¹	~HR at bound²	Cumulative boundary crossing probability
Overall Population Design
AHR approximations of ~HR at bound
Bound	Z	Nominal p¹	~HR at bound²	Alternate hypothesis	Null hypothesis³
Analysis: 1 Time: 15.9 N: 398 Event: 130 AHR: 0.65 Information fraction: 0.45
Efficacy	3.63	0.0001	0.5227	0.1304	0.0001
Analysis: 2 Time: 25.9 N: 398 Event: 225 AHR: 0.63 Information fraction: 0.78
Efficacy	2.67	0.0038	0.6971	0.7993	0.0038
Analysis: 3 Time: 36 N: 398 Event: 284 AHR: 0.62 Information fraction: 1
Efficacy	2.37	0.0088	0.7530	0.9617	0.0100
¹ One-sided p-value for experimental vs. control treatment. Value < 0.5 favors experimental, > 0.5 favors control.
² Approximate hazard ratio to cross bound.
³ alpha-spending determined by B+ information fraction.

wpgsd

Weighted parametric group sequential design

WPGSD; Anderson et al. (2022)
Takes advantage of the known correlation structure in constructing efficacy bounds
Controls family-wise Type I error (FWER) for a group sequential design.
Correlation may be due to:
- common observations in nested populations
- overlapping populations
- common control arm.

H1	H2	Analysis	Event
Counting events that occur in intersection hypotheses
1	1	1	89
1	1	2	155
1	1	3	198
1	2	1	89
1	2	2	155
1	2	3	198
2	2	1	130
2	2	2	225
2	2	3	284

Correlation Matrix and Bounds

Correlation matrix

H1_A1	H2_A1	H1_A2	H2_A2	H1_A3	H2_A3
Correlations Between Tests
1.00	0.83	0.76	0.63	0.67	0.56
0.83	1.00	0.63	0.76	0.55	0.68
0.76	0.63	1.00	0.83	0.88	0.74
0.63	0.76	0.83	1.00	0.73	0.89
0.67	0.55	0.88	0.73	1.00	0.83
0.56	0.68	0.74	0.89	0.83	1.00

Bounds

Analysis	Hypotheses	H1	H2
1	H1	0.00083	NA
1	H1, H2	0.00048	0.00048
1	H2	NA	0.00083
2	H1	0.011	NA
2	H1, H2	0.0069	0.0069
2	H2	NA	0.011
3	H1	0.022	NA
3	H1, H2	0.014	0.014
3	H2	NA	0.022

Group Sequential Bound Comparison: Bonferroni vs. Parametric

Usual group sequential calculation

Analysis	Hypotheses	H1	H2
Bonferroni Bounds
Adjusted only for correlations between analyses
1	H1	0.00083	NA
1	H1, H2	0.00019	0.00019
1	H2	NA	0.00083
2	H1	0.011	NA
2	H1, H2	0.0047	0.0047
2	H2	NA	0.011
3	H1	0.022	NA
3	H1, H2	0.011	0.011
3	H2	NA	0.022
Bounds expressed as nominal p-values

WPGSD

Analysis	Hypotheses	H1	H2
Weighted Parametric GSD
Adjusted for correlations between analyses and hypotheses
1	H1	0.00083	NA
1	H1, H2	0.00048	0.00048
1	H2	NA	0.00083
2	H1	0.011	NA
2	H1, H2	0.0069	0.0069
2	H2	NA	0.011
3	H1	0.022	NA
3	H1, H2	0.014	0.014
3	H2	NA	0.022
Bounds expressed as nominal p-values

Summary

Validated R-based packages for complex group sequential design
Substantial documentation and examples available
Key features include:
- Non-proportional hazards
- Specialized spending and other bounds
- Liberalizing bounds by incorporating known correlations
- Summary output formatted for formal documents

Thank you

Email: keaven_anderson@merck.com

Slides: https://keaven.github.io/talks/

References

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2022. “A Unified Framework for Weighted Parametric Group Sequential Design.” Biometrical Journal 64 (7): 1219–39.

Jennison, Christopher, and Bruce W Turnbull. 1999. Group Sequential Methods with Applications to Clinical Trials. CRC Press.

Maurer, Willi, and Frank Bretz. 2013. “Multiple Testing in Group Sequential Trials Using Graphical Approaches.” Statistics in Biopharmaceutical Research 5 (4): 311–20.

Zhao, Yujie, Yilong Zhang, Larry Leon, and Keaven M Anderson. 2023. “Group Sequential Design Under Non-Proportional Hazards.” arXiv Preprint arXiv:2312.01723.