Coding Sample, v1
Overview
Aim: to produce a report analyzing a dataset, in which there is an independent variable of interest but several outcome variables of interest. Outcome variables of interest include:
Outcome Variable | Description |
---|---|
os_intervention_months |
median OS in intervention arm only |
os_diff |
difference in OS between intervention and control arm |
pfs_intervention |
median PFS in intervention arm only |
pfs_diff |
difference in PFS between intervention and control arm |
Script schema: The script imports and cleans the dataset. It then creates a function for each segment of reporting (figures, tables, statistical analysis) then applies each function to each of the outcome variable of interest
Comments describe code on the following line/chunk
Configure environment & load packages
Current R version:
[1] "R version 4.2.2 (2022-10-31 ucrt)"
Package versions in use:
Package Version
1 broom 1.0.5
2 flextable 0.9.3
3 janitor 2.2.0
4 ggpubr 0.6.0
5 ggtext 0.1.2
6 xlsx 0.6.5
7 lubridate 1.9.2
8 forcats 1.0.0
9 stringr 1.5.0
10 dplyr 1.1.3
11 purrr 1.0.2
12 readr 2.1.4
13 tidyr 1.3.0
14 tibble 3.2.1
15 ggplot2 3.4.3
16 tidyverse 2.0.0
Import & clean data
Code
#reads data from excel sheet
data <- read.xlsx2("data_og.xlsx", sheetName = "og") |>
#makes the variable names more compatible for manipulation
janitor::clean_names()
dat1 <- data |>
#removes variables not in use
select(rr_intervention, os_intervention_months, pfs_intervention, os_control_months, pfs_contnrol, type) |>
#transforms variables into numerical format
mutate(across(c(rr_intervention, os_intervention_months, pfs_intervention, os_control_months, pfs_contnrol), as.numeric)) |>
#categorizes independent variable
mutate(rr_cat = case_when(
rr_intervention < 10 ~ "<10",
rr_intervention >= 10 & rr_intervention < 20 ~ "10-20",
rr_intervention >= 20 & rr_intervention < 30 ~ "20-30",
rr_intervention >= 30 ~ "30+",
T ~ "other"
)) |>
#removes rows with NA/missing for the main independent variable
filter(rr_cat != "other") |>
#factorizes categorized variables
mutate(rr_cat = factor(rr_cat, levels = c("<10", "10-20", "20-30", "30+")),
#calculates new variables
os_diff = os_intervention_months - os_control_months,
pfs_diff = pfs_intervention - pfs_contnrol)
# creates dictionary for color use in later code
col_labs <- c(
"Overall"="#17162F",
"Single" = "#FFB248",
"Combination" = "#C76058"
)
# creates another dictionary for color use in later code
col2 <- rev(c(
"#BF4904",
"#D97925",
"#F28705",
"#F29F05"
))
Functions
Function for figure 1
Objective: show the distribution of each outcome variable, stratified by category. This function will return a figure for each outcome variable of interest.
Code
get_figure1 <- function(outcome_var) {
# Calculates stats test for variable
stat_test <- kruskal.test(get(outcome_var) ~ rr_cat, data = dat1)
# Calculates labels for plot
med_lab <- dat1 |>
summarize(median = round(median(get(outcome_var), na.rm = T), 1),
maxz = max(get(outcome_var)),
.by = rr_cat)
plot <- dat1 |>
#maps variables to figure
ggplot(aes(x = rr_cat, y = get(outcome_var), color = rr_cat)) +
#adds and formats dots
geom_jitter(alpha = 0.7, width = 0.15) +
#add & formats boxplots
geom_boxplot(outlier.shape = NA, size = 0.5, fill = NA) +
#adds & formats in blue line for overall median
geom_hline(yintercept = median(dat1[[outcome_var]], na.rm = T),
color = "#1261A6",
linetype = 2,
size = 0.6,
alpha = 0.55) +
#adds in label for each group
geom_label(data = med_lab, aes(x = rr_cat, y = median, color = rr_cat, label = median),
position = position_nudge(x = -0.4)) +
#formats the x-axis
scale_x_discrete(expand = expansion(add = c(0.75, 0))) +
#maps the correct colors to categories
scale_color_manual(values = col2,
aesthetics = "color") +
#adds in the p-value from the stats test
labs(x = "Overall response rate (%)",
caption = paste0("*p-value = ", round(stat_test$p.value, 2))) +
#applies the theme
theme_minimal() +
#cleans up theme formatting
theme(plot.subtitle = element_markdown(),
legend.position = "none",
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linewidth = 0.5),
panel.grid.minor.y = element_blank())
return(plot)
}
Function for summary table
Objective: show summary statistics for each outcome variable, stratified by category. This function will return a table for each outcome variable of interest.
Code
get_sum_table <- function(outcome_var) {
dat1 |>
#orders table by previously set factor
arrange(rr_cat) |>
#renames the variable for easier manipulation
mutate(val = get(outcome_var)) |>
#gets summary statistics
summarize(median = median(val, na.rm = T),
percentile.25 = quantile(val, 0.25, na.rm = T),
percentile.75 = quantile(val, 0.75, na.rm = T),
min = min(val, na.rm= T),
max = max(val, na.rm = T),
.by = rr_cat) |>
#turns into a MS word table
flextable::as_flextable() |>
#deletes footer
flextable::delete_part(part = "footer")
}
Function for Figure 2
Objective: show the relationship between the continuous independent variable and the outcome variable of interest, stratified by the variable type
.
Code
get_figure2 <- function(outcome_var) {
dat1 |>
#maps variables to figure
ggplot(aes(y = get(outcome_var), x = rr_intervention, color = type)) +
#adds in dots
geom_point() +
#adds in linear regression, stratified by the type variable
geom_smooth(method = lm, se = FALSE) +
#maps colors from dictionary
scale_color_manual(values = col_labs,
aesthetics = "color") +
#adds x-axis title
labs(x = "Overall response rate (%)",
#adds in header displaying the corresponding colors
title = glue::glue("<span style = 'color:{col_labs[2]};'>Single agent</span> and <span style = 'color:{col_labs[3]};'>Combined</span> drugs")) +
#adds in equation labels
stat_regline_equation( aes(label = ..eq.label..)) +
#applies theme
theme_minimal() +
#cleans up theme
theme(plot.title = element_markdown(),
legend.position = "none")
}
Function for Model 1
Objective: Provide linear regression details from the model shown in Figure 2
Code
get_model1 <- function(outcome_var) {
model_plot <- dat1 |>
#groups data by stratifying variable, 'type'
group_by(type) |>
#Runs linear regression for each stratified group for outcome variable
do(mod = lm(get(outcome_var) ~ rr_intervention, data = .)) |>
#Extracts the models as a list, with the stratifying variable (type) as the name
pull(mod, type) |>
#applies the 'summary' function for each model
lapply(summary)
return(model_plot)
}
Function for Model 2
Objective: Output of the linear regression as an interaction between the 2 non-outcome variables (rr_intervention & type)
Function for Model 3
Objective: Output of the linear regression with the 2 non-outcome variables (rr_intervention & type)
Outputs by outcome variables
Outcome A: OS in intervention
Figure 1A
Table 1A
Figure 2A
Model 1A
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-18.286 -6.598 -1.609 3.649 36.093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.23294 4.38502 1.649 0.107753
rr_intervention 0.35956 0.09156 3.927 0.000373 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.19 on 36 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.2999, Adjusted R-squared: 0.2804
F-statistic: 15.42 on 1 and 36 DF, p-value: 0.0003729
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-21.997 -6.312 -4.375 7.802 16.495
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1206 12.0306 0.426 0.683
rr_intervention 0.5215 0.2243 2.325 0.053 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.17 on 7 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.4358, Adjusted R-squared: 0.3552
F-statistic: 5.408 on 1 and 7 DF, p-value: 0.05297
Model 2A
Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-21.997 -6.520 -1.683 4.817 36.093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1206 10.5356 0.486 0.6294
rr_intervention 0.5215 0.1964 2.655 0.0111 *
typeSingle 2.1124 11.4645 0.184 0.8547
rr_intervention:typeSingle -0.1619 0.2179 -0.743 0.4614
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.54 on 43 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.3703, Adjusted R-squared: 0.3264
F-statistic: 8.43 on 3 and 43 DF, p-value: 0.000161
Model 3A
Outcome B: Difference in OS
Figure 1B
Table 1B
Figure 2B
Model 1B
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-8.9410 -3.3640 0.6332 4.5754 7.5453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.85950 3.03207 1.603 0.133
rr_intervention -0.05421 0.06616 -0.819 0.427
Residual standard error: 5.801 on 13 degrees of freedom
(41 observations deleted due to missingness)
Multiple R-squared: 0.04911, Adjusted R-squared: -0.02403
F-statistic: 0.6714 on 1 and 13 DF, p-value: 0.4273
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-17.414 -8.564 3.266 8.395 15.060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.8169 14.7229 -1.889 0.1077
rr_intervention 0.5786 0.2610 2.217 0.0685 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.07 on 6 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.4503, Adjusted R-squared: 0.3587
F-statistic: 4.915 on 1 and 6 DF, p-value: 0.06848
Model 2B
Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-17.4138 -5.0971 0.6332 7.1972 15.0605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.8169 9.8829 -2.815 0.01106 *
rr_intervention 0.5786 0.1752 3.303 0.00374 **
typeSingle 32.6764 10.8949 2.999 0.00737 **
rr_intervention:typeSingle -0.6328 0.2017 -3.137 0.00543 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.774 on 19 degrees of freedom
(45 observations deleted due to missingness)
Multiple R-squared: 0.3712, Adjusted R-squared: 0.2719
F-statistic: 3.739 on 3 and 19 DF, p-value: 0.02885
Model 3B
Outcome C: PFS in intervention
Figure 1C
Table 1C
Figure 2C
Model 1C
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-7.5663 -3.0900 -0.8781 1.7883 13.0154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.85104 1.85648 -0.458 0.649
rr_intervention 0.25995 0.03705 7.016 1.38e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.607 on 42 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.5396, Adjusted R-squared: 0.5286
F-statistic: 49.22 on 1 and 42 DF, p-value: 1.381e-08
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-4.0836 -2.9441 0.1177 2.4055 5.0262
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.37055 4.00052 0.093 0.9292
rr_intervention 0.22876 0.06968 3.283 0.0168 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.462 on 6 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.6424, Adjusted R-squared: 0.5828
F-statistic: 10.78 on 1 and 6 DF, p-value: 0.01676
Model 2C
Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-7.5663 -3.0569 -0.6676 1.9185 13.0154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.37055 5.17570 0.072 0.9432
rr_intervention 0.22876 0.09014 2.538 0.0145 *
typeSingle -1.22158 5.48151 -0.223 0.8246
rr_intervention:typeSingle 0.03119 0.09708 0.321 0.7494
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.48 on 48 degrees of freedom
(16 observations deleted due to missingness)
Multiple R-squared: 0.5531, Adjusted R-squared: 0.5252
F-statistic: 19.8 on 3 and 48 DF, p-value: 1.709e-08
Model 3C
Outcome D: Difference in PFS
Figure 1D
Table 1D
Figure 2D
Model 1D
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
Min 1Q Median 3Q Max
-3.8632 -1.9693 -0.7168 0.7825 7.7097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.10732 2.12029 1.466 0.168
rr_intervention 0.04548 0.04326 1.051 0.314
Residual standard error: 3.174 on 12 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.08432, Adjusted R-squared: 0.008016
F-statistic: 1.105 on 1 and 12 DF, p-value: 0.3139
Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)
Residuals:
1 3 4 7 9 10 11
-0.7935 1.3842 1.3842 0.9509 -1.3956 -1.1472 -0.3830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.933633 2.353093 2.522 0.0531 .
rr_intervention 0.004479 0.038660 0.116 0.9123
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.325 on 5 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.002677, Adjusted R-squared: -0.1968
F-statistic: 0.01342 on 1 and 5 DF, p-value: 0.9123
Model 2D
Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-3.8632 -1.1995 -0.7079 0.9509 7.7097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.933633 4.905736 1.210 0.243
rr_intervention 0.004479 0.080599 0.056 0.956
typeSingle -2.826315 5.241180 -0.539 0.597
rr_intervention:typeSingle 0.041001 0.088957 0.461 0.651
Residual standard error: 2.762 on 17 degrees of freedom
(47 observations deleted due to missingness)
Multiple R-squared: 0.1117, Adjusted R-squared: -0.04509
F-statistic: 0.7123 on 3 and 17 DF, p-value: 0.558