04. Data layout for recurrent event analyses
David Herman
database_layout.Rmd
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
suppressWarnings({
suppressMessages({
library(revents)
library(dplyr)
library(gtsummary)
library(reda)
library(reReg)
library(tibble)
library(tidyr)
library(ggplot2)
library(kableExtra)
})
})
Data layout modelling recurrent events
Data layout in recurrent events constitute an essential first step before modelling and it is mainly determined by model’s assumptions.
Starting from our first raw data (df_revents), we need to derive the necessary columns for the analysis of recurrent events.
# Load the simulated data
data("df_revents", package = "revents")
df_revents <- revents::df_revents
First step, it is necessary to get the time interval (TSTART, TSTOP and TGAP) for each observation as well as the sequence of the event and the total number of events experienced by each patient. The following code snippet shows how to derive these columns.
final_relapses <- df_revents %>%
dplyr::select(id, time, status) %>%
#Pivot dataset
pivot_longer(!id, names_to = "STATUS", values_to = "TIME")%>%
#There were some duplicates because of the merge so I deleted them
distinct()%>%
#Define the censoring flag
mutate(STATUS = ifelse(STATUS == "time", 1, 0)) %>%
#Define tstart and tstop
group_by(id) %>%
arrange(id,TIME) %>%
mutate(TSTART = replace_na(lag(TIME),0)) %>%
rename(TSTOP = TIME) %>%
# remove rows where TSTART == TSTOP == 0 ó 1
filter(!(TSTART == TSTOP & TSTART %in% c(0, 1))) %>%
mutate(PARAMCD = "ARR", # parameter code "Anualized relapse rate"
TGAP = TSTOP - TSTART, # time between events
SEVENT = row_number()) %>% # sequence number of event per id
dplyr::select(id, PARAMCD, TSTART, TSTOP, TGAP, STATUS, SEVENT)
Following that, the rest of variables needs to be joined from the original dataset by id. The following code snippet shows how to join the variables.
# joining adjusting variables from original df
df_filter <- df_revents %>%
dplyr::select(id, Age, Sex, mstype, Race, Time_since_diagnosis)
# Perform the left join
data_layout_1 <- final_relapses %>%
left_join(df_filter, by = "id", relationship = "many-to-many") %>%
group_by(id) %>%
arrange(id, TSTOP) %>%
rename(ID = id,
AGE = Age,
SEX = Sex,
DISEASE_COURSE = mstype,
RACE = Race,
TIME_SINCE_DIAGNOSIS = Time_since_diagnosis) %>%
distinct()
The type of layout obtained is the typical layout used in survival-based models. It can be essentially differentiated on the time scale used (CT, TT and GT) and the occurrence of successive events. In the conditional and marginals rate base models, a subject is assumed not to be at risk for a subsequent event until the current event has finished. However, in the marginal hazards models (WLW and LWA) this assumption is different, as each participant is simultaneously at risk for the occurrence of any event from the beginning of the study.
Data Layout: Conditional and marginal rates models
For illustrate these differences, in the Table 1 is displayed the data layout for conditionals and marginals rate-based models. The variable Id is a unique patient identifier. Tstart and Tstop represent the time interval of each observation while Tgap the difference of time between observations. Event (0 or 1) represents whether an event occurs at the end of the time interval. If an event has been observed at time Tstop, Event is equal to 1. If Tstop is a right censoring time Event is equal to 0. Sevent records the event sequence for each patient, which is necessary for stratified models such as PWP and WLW. Disease course defines the patient’s group that in this study is RRMS or SPMS. In data layout 1, patients without relapses have only 1 line, whereas patients with at least one event have more than 1 line, with the last line corresponding to the time of right-censoring.
Table 1. Data layout 1: conditionals and marginals rate models.
kbl(data_layout_1[1:10,]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID | PARAMCD | TSTART | TSTOP | TGAP | STATUS | SEVENT | AGE | SEX | DISEASE_COURSE | RACE | TIME_SINCE_DIAGNOSIS |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ARR | 0.0000000 | 0.2526189 | 0.2526189 | 1 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | ARR | 0.2526189 | 0.4496878 | 0.1970689 | 1 | 2 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | ARR | 0.4496878 | 1.0000000 | 0.5503122 | 0 | 3 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | ARR | 1.0000000 | 2.5320097 | 1.5320097 | 1 | 4 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | ARR | 2.5320097 | 10.0000000 | 7.4679903 | 1 | 5 | 41.32254 | Male | RRMS | White | 6.099722 |
2 | ARR | 0.0000000 | 0.2332276 | 0.2332276 | 1 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | ARR | 0.2332276 | 1.0000000 | 0.7667724 | 0 | 2 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | ARR | 1.0000000 | 1.1621344 | 0.1621344 | 1 | 3 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | ARR | 1.1621344 | 1.5062723 | 0.3441379 | 1 | 4 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | ARR | 1.5062723 | 2.2481090 | 0.7418367 | 1 | 5 | 31.44477 | Female | RRMS | White | 3.371547 |
Data Layout: Marginal WLW/LWA models
Data layout 2 for WLW and LWA should be arranged as each participant have the same number of entries (Table 2). That means that each id has many lines as the maximum number of events that could be observed. In this example, maximum number of Sevent was defined Sevent was 7.
# Find the maximum SEVENT value in your data
max_SEVENT <- max(data_layout_1$SEVENT)
# Create a data frame with all combinations of USUBJID and SEVENT
combinations <- expand.grid(ID = unique(data_layout_1$ID), SEVENT = 1:max_SEVENT)
# Merge the original data frame with all combinations of USUBJID and SEVENT
data_layout_2 <-
merge(combinations, data_layout_1,
by = c("ID", "SEVENT"),
all.x = TRUE
) %>%
arrange(ID, SEVENT) %>%
# select only necessary variables (TGAP are not necessary in this layout)
dplyr::select(-TGAP) %>%
# fill in missing values by repeating the value of the last cell
fill(PARAMCD, TSTART, TSTOP, STATUS, SEVENT, DISEASE_COURSE, SEX, AGE, RACE,
TIME_SINCE_DIAGNOSIS
)
Table 2. Data layout 2: WLW and LWA marginals hazards models.
kbl(data_layout_2[1:20,]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID | SEVENT | PARAMCD | TSTART | TSTOP | STATUS | AGE | SEX | DISEASE_COURSE | RACE | TIME_SINCE_DIAGNOSIS |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | ARR | 0.0000000 | 0.2526189 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 2 | ARR | 0.2526189 | 0.4496878 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 3 | ARR | 0.4496878 | 1.0000000 | 0 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 4 | ARR | 1.0000000 | 2.5320097 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 5 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 6 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 7 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 8 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 9 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 10 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 11 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
1 | 12 | ARR | 2.5320097 | 10.0000000 | 1 | 41.32254 | Male | RRMS | White | 6.099722 |
2 | 1 | ARR | 0.0000000 | 0.2332276 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 2 | ARR | 0.2332276 | 1.0000000 | 0 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 3 | ARR | 1.0000000 | 1.1621344 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 4 | ARR | 1.1621344 | 1.5062723 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 5 | ARR | 1.5062723 | 2.2481090 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 6 | ARR | 2.2481090 | 2.8411302 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 7 | ARR | 2.8411302 | 3.0169426 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
2 | 8 | ARR | 3.0169426 | 3.8417178 | 1 | 31.44477 | Female | RRMS | White | 3.371547 |
Data Layout: Count-based models
In classical count-based models such as Poisson regression, each participant contributes one record, which includes the number of events as main outcome and total length of follow-up (Table 3). For Data layout 3, Id represents the unique identifier of the individual, Disease Course the type of disease, Count the total number of observations during the total Length Time since study start.
data_layout_3 <- data_layout_1 %>%
group_by(ID) %>%
reframe(
DISEASE_COURSE = first(DISEASE_COURSE),
COUNT = as.numeric(sum(STATUS == 1)), # total number of events
LENGHT.TIME = last(TSTOP),
AGE = first(AGE),
SEX = first(SEX),
RACE = first(RACE),
TIME_SINCE_DIAGNOSIS = first(TIME_SINCE_DIAGNOSIS))
Table 3. Data layout 3: count-based models.
kbl(data_layout_3[1:10,]) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
ID | DISEASE_COURSE | COUNT | LENGHT.TIME | AGE | SEX | RACE | TIME_SINCE_DIAGNOSIS |
---|---|---|---|---|---|---|---|
1 | RRMS | 4 | 10 | 41.32254 | Male | White | 6.0997219 |
2 | RRMS | 9 | 10 | 31.44477 | Female | White | 3.3715470 |
3 | RRMS | 4 | 10 | 42.89048 | Male | No white | 2.8732107 |
4 | RRMS | 2 | 10 | 42.24095 | Female | White | 1.8018853 |
5 | RRMS | 2 | 10 | 39.89355 | Male | No white | 3.3193525 |
6 | RRMS | 5 | 10 | 33.25618 | Female | White | 0.3114489 |
7 | RRMS | 4 | 10 | 36.04294 | Female | White | 1.8779915 |
8 | RRMS | 1 | 10 | 44.49765 | Male | White | 0.5070449 |
9 | RRMS | 2 | 10 | 46.73466 | Female | White | 1.3793769 |
10 | RRMS | 5 | 10 | 39.21081 | Female | White | 0.5543679 |