Skip to contents

Data layout modelling recurrent events

Data layout in recurrent events constitute an essential first step before modelling and it is mainly determined by model’s assumptions.

Starting from our first raw data (df_revents), we need to derive the necessary columns for the analysis of recurrent events.

# Load the simulated data
data("df_revents", package = "revents")
df_revents <- revents::df_revents

First step, it is necessary to get the time interval (TSTART, TSTOP and TGAP) for each observation as well as the sequence of the event and the total number of events experienced by each patient. The following code snippet shows how to derive these columns.

final_relapses <- df_revents %>%
  dplyr::select(id, time, status) %>%
  #Pivot dataset
  pivot_longer(!id, names_to = "STATUS", values_to = "TIME")%>%
  #There were some duplicates because of the merge so I deleted them
  distinct()%>%
  #Define the censoring flag
  mutate(STATUS = ifelse(STATUS == "time", 1, 0)) %>%
  #Define tstart and tstop
  group_by(id) %>%
  arrange(id,TIME) %>%
  mutate(TSTART = replace_na(lag(TIME),0)) %>%
  rename(TSTOP = TIME) %>%  
  # remove rows where TSTART == TSTOP == 0 ó 1
  filter(!(TSTART == TSTOP & TSTART %in% c(0, 1))) %>%
  mutate(PARAMCD = "ARR", # parameter code "Anualized relapse rate"
         TGAP = TSTOP - TSTART, # time between events
         SEVENT = row_number()) %>% # sequence number of event per id
  dplyr::select(id, PARAMCD, TSTART, TSTOP, TGAP, STATUS, SEVENT) 

Following that, the rest of variables needs to be joined from the original dataset by id. The following code snippet shows how to join the variables.

# joining adjusting variables from original df
df_filter <- df_revents %>%
  dplyr::select(id, Age, Sex, mstype, Race, Time_since_diagnosis)

# Perform the left join
data_layout_1 <- final_relapses %>%
  left_join(df_filter, by = "id", relationship = "many-to-many") %>%
  group_by(id) %>%
  arrange(id, TSTOP) %>%
  rename(ID = id,
         AGE = Age,
         SEX = Sex,
         DISEASE_COURSE = mstype,
         RACE = Race,
         TIME_SINCE_DIAGNOSIS = Time_since_diagnosis) %>%
  distinct()

The type of layout obtained is the typical layout used in survival-based models. It can be essentially differentiated on the time scale used (CT, TT and GT) and the occurrence of successive events. In the conditional and marginals rate base models, a subject is assumed not to be at risk for a subsequent event until the current event has finished. However, in the marginal hazards models (WLW and LWA) this assumption is different, as each participant is simultaneously at risk for the occurrence of any event from the beginning of the study.

Data Layout: Conditional and marginal rates models

For illustrate these differences, in the Table 1 is displayed the data layout for conditionals and marginals rate-based models. The variable Id is a unique patient identifier. Tstart and Tstop represent the time interval of each observation while Tgap the difference of time between observations. Event (0 or 1) represents whether an event occurs at the end of the time interval. If an event has been observed at time Tstop, Event is equal to 1. If Tstop is a right censoring time Event is equal to 0. Sevent records the event sequence for each patient, which is necessary for stratified models such as PWP and WLW. Disease course defines the patient’s group that in this study is RRMS or SPMS. In data layout 1, patients without relapses have only 1 line, whereas patients with at least one event have more than 1 line, with the last line corresponding to the time of right-censoring.

Table 1. Data layout 1: conditionals and marginals rate models.

kbl(data_layout_1[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID PARAMCD TSTART TSTOP TGAP STATUS SEVENT AGE SEX DISEASE_COURSE RACE TIME_SINCE_DIAGNOSIS
1 ARR 0.0000000 0.2526189 0.2526189 1 1 41.32254 Male RRMS White 6.099722
1 ARR 0.2526189 0.4496878 0.1970689 1 2 41.32254 Male RRMS White 6.099722
1 ARR 0.4496878 1.0000000 0.5503122 0 3 41.32254 Male RRMS White 6.099722
1 ARR 1.0000000 2.5320097 1.5320097 1 4 41.32254 Male RRMS White 6.099722
1 ARR 2.5320097 10.0000000 7.4679903 1 5 41.32254 Male RRMS White 6.099722
2 ARR 0.0000000 0.2332276 0.2332276 1 1 31.44477 Female RRMS White 3.371547
2 ARR 0.2332276 1.0000000 0.7667724 0 2 31.44477 Female RRMS White 3.371547
2 ARR 1.0000000 1.1621344 0.1621344 1 3 31.44477 Female RRMS White 3.371547
2 ARR 1.1621344 1.5062723 0.3441379 1 4 31.44477 Female RRMS White 3.371547
2 ARR 1.5062723 2.2481090 0.7418367 1 5 31.44477 Female RRMS White 3.371547

Data Layout: Marginal WLW/LWA models

Data layout 2 for WLW and LWA should be arranged as each participant have the same number of entries (Table 2). That means that each id has many lines as the maximum number of events that could be observed. In this example, maximum number of Sevent was defined Sevent was 7.

# Find the maximum SEVENT value in your data
max_SEVENT <- max(data_layout_1$SEVENT)

# Create a data frame with all combinations of USUBJID and SEVENT
combinations <- expand.grid(ID = unique(data_layout_1$ID), SEVENT = 1:max_SEVENT)

# Merge the original data frame with all combinations of USUBJID and SEVENT
data_layout_2 <- 
  merge(combinations, data_layout_1,
      by = c("ID", "SEVENT"),
      all.x = TRUE
    ) %>%
    arrange(ID, SEVENT) %>%
  # select only necessary variables (TGAP are not necessary in this layout)
    dplyr::select(-TGAP) %>%
  # fill in missing values by repeating the value of the last cell
    fill(PARAMCD, TSTART, TSTOP, STATUS, SEVENT, DISEASE_COURSE, SEX, AGE, RACE,
      TIME_SINCE_DIAGNOSIS
    )

Table 2. Data layout 2: WLW and LWA marginals hazards models.

kbl(data_layout_2[1:20,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID SEVENT PARAMCD TSTART TSTOP STATUS AGE SEX DISEASE_COURSE RACE TIME_SINCE_DIAGNOSIS
1 1 ARR 0.0000000 0.2526189 1 41.32254 Male RRMS White 6.099722
1 2 ARR 0.2526189 0.4496878 1 41.32254 Male RRMS White 6.099722
1 3 ARR 0.4496878 1.0000000 0 41.32254 Male RRMS White 6.099722
1 4 ARR 1.0000000 2.5320097 1 41.32254 Male RRMS White 6.099722
1 5 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 6 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 7 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 8 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 9 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 10 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 11 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
1 12 ARR 2.5320097 10.0000000 1 41.32254 Male RRMS White 6.099722
2 1 ARR 0.0000000 0.2332276 1 31.44477 Female RRMS White 3.371547
2 2 ARR 0.2332276 1.0000000 0 31.44477 Female RRMS White 3.371547
2 3 ARR 1.0000000 1.1621344 1 31.44477 Female RRMS White 3.371547
2 4 ARR 1.1621344 1.5062723 1 31.44477 Female RRMS White 3.371547
2 5 ARR 1.5062723 2.2481090 1 31.44477 Female RRMS White 3.371547
2 6 ARR 2.2481090 2.8411302 1 31.44477 Female RRMS White 3.371547
2 7 ARR 2.8411302 3.0169426 1 31.44477 Female RRMS White 3.371547
2 8 ARR 3.0169426 3.8417178 1 31.44477 Female RRMS White 3.371547

Data Layout: Count-based models

In classical count-based models such as Poisson regression, each participant contributes one record, which includes the number of events as main outcome and total length of follow-up (Table 3). For Data layout 3, Id represents the unique identifier of the individual, Disease Course the type of disease, Count the total number of observations during the total Length Time since study start.

data_layout_3 <- data_layout_1 %>%
    group_by(ID) %>%
    reframe(
      DISEASE_COURSE = first(DISEASE_COURSE),
      COUNT = as.numeric(sum(STATUS == 1)), # total number of events 
      LENGHT.TIME = last(TSTOP),
      AGE = first(AGE),
      SEX = first(SEX),
      RACE = first(RACE),
      TIME_SINCE_DIAGNOSIS = first(TIME_SINCE_DIAGNOSIS))

Table 3. Data layout 3: count-based models.

kbl(data_layout_3[1:10,]) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID DISEASE_COURSE COUNT LENGHT.TIME AGE SEX RACE TIME_SINCE_DIAGNOSIS
1 RRMS 4 10 41.32254 Male White 6.0997219
2 RRMS 9 10 31.44477 Female White 3.3715470
3 RRMS 4 10 42.89048 Male No white 2.8732107
4 RRMS 2 10 42.24095 Female White 1.8018853
5 RRMS 2 10 39.89355 Male No white 3.3193525
6 RRMS 5 10 33.25618 Female White 0.3114489
7 RRMS 4 10 36.04294 Female White 1.8779915
8 RRMS 1 10 44.49765 Male White 0.5070449
9 RRMS 2 10 46.73466 Female White 1.3793769
10 RRMS 5 10 39.21081 Female White 0.5543679