04. Data layout for recurrent event analyses

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
suppressWarnings({
  suppressMessages({
    library(revents)
    library(dplyr)
    library(gtsummary)
    library(reda)
    library(reReg)
    library(tibble)
    library(tidyr)
    library(ggplot2)
    library(kableExtra)
  })
})

Data layout modelling recurrent events

Data layout in recurrent events constitute an essential first step before modelling and it is mainly determined by model’s assumptions.

Starting from our first raw data (df_revents), we need to derive the necessary columns for the analysis of recurrent events.

# Load the simulated data
data("df_revents", package = "revents")
df_revents <- revents::df_revents

First step, it is necessary to get the time interval (TSTART, TSTOP and TGAP) for each observation as well as the sequence of the event and the total number of events experienced by each patient. The following code snippet shows how to derive these columns.

# Load the simulated data
final_relapses <- df_revents %>%
  dplyr::select(id, time, status) %>%
  group_by(id) %>%
  arrange(id, time) %>%
  mutate(TSTART = lag(time, default = 0),
         TSTOP = time,
         TGAP = TSTOP - TSTART,
         SEVENT = row_number()) %>%
  mutate(PARAMCD = "ARR") %>%
  dplyr::select(id, PARAMCD, TSTART, TSTOP, TGAP, status, SEVENT) %>%
  rename(STATUS = status)

Following that, the rest of variables needs to be joined from the original dataset by id. The following code snippet shows how to join the variables.

# joining adjusting variables from original df
df_filter <- df_revents %>%
  dplyr::select(id, Age, Sex, mstype, Race, Time_since_diagnosis)

# Perform the left join
data_layout_1 <- final_relapses %>%
  left_join(df_filter, by = "id", relationship = "many-to-many") %>%
  group_by(id) %>%
  arrange(id, TSTOP) %>%
  rename(ID = id,
         AGE = Age,
         SEX = Sex,
         DISEASE_COURSE = mstype,
         RACE = Race,
         TIME_SINCE_DIAGNOSIS = Time_since_diagnosis) %>%
  distinct()

The type of layout obtained is the typical layout used in survival-based models. It can be essentially differentiated on the time scale used (CT, TT and GT) and the occurrence of successive events. In the conditional and marginals rate base models, a subject is assumed not to be at risk for a subsequent event until the current event has finished. However, in the marginal hazards models (WLW and LWA) this assumption is different, as each participant is simultaneously at risk for the occurrence of any event from the beginning of the study.

Data Layout: Conditional and marginal rates models

For illustrate these differences, in the Table 1 is displayed the data layout for conditionals and marginals rate-based models. The variable Id is a unique patient identifier. Tstart and Tstop represent the time interval of each observation while Tgap the difference of time between observations. Event (0 or 1) represents whether an event occurs at the end of the time interval. If an event has been observed at time Tstop, Event is equal to 1. If Tstop is a right censoring time Event is equal to 0. Sevent records the event sequence for each patient, which is necessary for stratified models such as PWP and WLW. Disease course defines the patient’s group that in this study is RRMS or SPMS. In data layout 1, patients without relapses have only 1 line, whereas patients with at least one event have more than 1 line, with the last line corresponding to the time of right-censoring.

Table 1. Data layout 1: conditionals and marginals rate models.

kbl(data_layout_1[1:13,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

ID	PARAMCD	TSTART	TSTOP	TGAP	STATUS	SEVENT	AGE	SEX	DISEASE_COURSE	RACE	TIME_SINCE_DIAGNOSIS
1	ARR	0.0000000	0.2526189	0.2526189	1	1	41.32254	Male	RRMS	White	6.099722
1	ARR	0.2526189	0.4496878	0.1970689	1	2	41.32254	Male	RRMS	White	6.099722
1	ARR	0.4496878	2.5320097	2.0823219	1	3	41.32254	Male	RRMS	White	6.099722
1	ARR	2.5320097	10.0000000	7.4679903	0	4	41.32254	Male	RRMS	White	6.099722
2	ARR	0.0000000	0.2332276	0.2332276	1	1	31.44477	Female	RRMS	White	3.371547
2	ARR	0.2332276	1.1621344	0.9289069	1	2	31.44477	Female	RRMS	White	3.371547
2	ARR	1.1621344	1.5062723	0.3441379	1	3	31.44477	Female	RRMS	White	3.371547
2	ARR	1.5062723	2.2481090	0.7418367	1	4	31.44477	Female	RRMS	White	3.371547
2	ARR	2.2481090	2.8411302	0.5930211	1	5	31.44477	Female	RRMS	White	3.371547
2	ARR	2.8411302	3.0169426	0.1758125	1	6	31.44477	Female	RRMS	White	3.371547
2	ARR	3.0169426	3.8417178	0.8247752	1	7	31.44477	Female	RRMS	White	3.371547
2	ARR	3.8417178	8.0386405	4.1969226	1	8	31.44477	Female	RRMS	White	3.371547
2	ARR	8.0386405	10.0000000	1.9613595	0	9	31.44477	Female	RRMS	White	3.371547

Data Layout: Marginal WLW/LWA models

Data layout 2 for WLW and LWA should be arranged as each participant have the same number of entries (Table 2). That means that each id has many lines as the maximum number of events that could be observed. In this example, maximum number of Sevent was defined Sevent was 7.

# Step 1: Get max event sequence
max_SEVENT <- max(data_layout_1$SEVENT)

# Step 2: Create all combinations of ID and SEVENT
combinations <- expand.grid(ID = unique(data_layout_1$ID), SEVENT = 1:max_SEVENT)

# Step 3: Merge and don't fill event-related columns
data_layout_2 <- 
  combinations %>%
  left_join(data_layout_1, by = c("ID", "SEVENT")) %>%
  arrange(ID, SEVENT) %>%
  mutate(
    IMPUTED = ifelse(is.na(STATUS), TRUE, FALSE)  # optional: mark imputed rows
  ) %>%
  # select only necessary variables (TGAP is not necessary in this layout)
  dplyr::select(-TGAP, -IMPUTED) %>%
  # Fill only static covariates
  group_by(ID) %>%
  # fill in missing values by repeating the value of the last cell
  fill(PARAMCD, TSTART, TSTOP, STATUS, SEVENT, DISEASE_COURSE, SEX, AGE, RACE,
       TIME_SINCE_DIAGNOSIS
  ) %>%
  ungroup()

Table 2. Data layout 2: WLW and LWA marginals hazards models.

kbl(data_layout_2[1:22,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

ID	SEVENT	PARAMCD	TSTART	TSTOP	STATUS	AGE	SEX	DISEASE_COURSE	RACE	TIME_SINCE_DIAGNOSIS
1	1	ARR	0.0000000	0.2526189	1	41.32254	Male	RRMS	White	6.099722
1	2	ARR	0.2526189	0.4496878	1	41.32254	Male	RRMS	White	6.099722
1	3	ARR	0.4496878	2.5320097	1	41.32254	Male	RRMS	White	6.099722
1	4	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	5	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	6	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	7	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	8	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	9	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	10	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
1	11	ARR	2.5320097	10.0000000	0	41.32254	Male	RRMS	White	6.099722
2	1	ARR	0.0000000	0.2332276	1	31.44477	Female	RRMS	White	3.371547
2	2	ARR	0.2332276	1.1621344	1	31.44477	Female	RRMS	White	3.371547
2	3	ARR	1.1621344	1.5062723	1	31.44477	Female	RRMS	White	3.371547
2	4	ARR	1.5062723	2.2481090	1	31.44477	Female	RRMS	White	3.371547
2	5	ARR	2.2481090	2.8411302	1	31.44477	Female	RRMS	White	3.371547
2	6	ARR	2.8411302	3.0169426	1	31.44477	Female	RRMS	White	3.371547
2	7	ARR	3.0169426	3.8417178	1	31.44477	Female	RRMS	White	3.371547
2	8	ARR	3.8417178	8.0386405	1	31.44477	Female	RRMS	White	3.371547
2	9	ARR	8.0386405	10.0000000	0	31.44477	Female	RRMS	White	3.371547
2	10	ARR	8.0386405	10.0000000	0	31.44477	Female	RRMS	White	3.371547
2	11	ARR	8.0386405	10.0000000	0	31.44477	Female	RRMS	White	3.371547

Data Layout: Count-based models

In classical count-based models such as Poisson regression, each participant contributes one record, which includes the number of events as main outcome and total length of follow-up (Table 3). For Data layout 3, Id represents the unique identifier of the individual, Disease Course the type of disease, Count the total number of observations during the total Length Time since study start.

data_layout_3 <- data_layout_1 %>%
  group_by(ID) %>%
  reframe(
    DISEASE_COURSE = first(DISEASE_COURSE),
    COUNT = as.numeric(sum(STATUS == 1)), # total number of events 
    LENGHT.TIME = last(TSTOP),
    AGE = first(AGE),
    SEX = first(SEX),
    RACE = first(RACE),
    TIME_SINCE_DIAGNOSIS = first(TIME_SINCE_DIAGNOSIS))

Table 3. Data layout 3: count-based models.

kbl(data_layout_3[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

ID	DISEASE_COURSE	COUNT	LENGHT.TIME	AGE	SEX	RACE	TIME_SINCE_DIAGNOSIS
1	RRMS	3	10	41.32254	Male	White	6.0997219
2	RRMS	8	10	31.44477	Female	White	3.3715470
3	RRMS	3	10	42.89048	Male	No white	2.8732107
4	RRMS	1	10	42.24095	Female	White	1.8018853
5	RRMS	1	10	39.89355	Male	No white	3.3193525
6	RRMS	4	10	33.25618	Female	White	0.3114489
7	RRMS	3	10	36.04294	Female	White	1.8779915
8	RRMS	0	10	44.49765	Male	White	0.5070449
9	RRMS	1	10	46.73466	Female	White	1.3793769
10	RRMS	4	10	39.21081	Female	White	0.5543679

David Herman

Data layout modelling recurrent events

Data Layout: Conditional and marginal rates models

Data Layout: Marginal WLW/LWA models

Data Layout: Count-based models