Coding Sample, v1

Author

Jordan Tuia

Overview

Aim: to produce a report analyzing a dataset, in which there is an independent variable of interest but several outcome variables of interest. Outcome variables of interest include:

Outcome Variable Description
os_intervention_months median OS in intervention arm only
os_diff difference in OS between intervention and control arm
pfs_intervention median PFS in intervention arm only
pfs_diff difference in PFS between intervention and control arm

Script schema: The script imports and cleans the dataset. It then creates a function for each segment of reporting (figures, tables, statistical analysis) then applies each function to each of the outcome variable of interest

Note

Comments describe code on the following line/chunk

Code
#this codes explains
this.line.of.code = "explained by comment"

Configure environment & load packages

Current R version:

[1] "R version 4.2.2 (2022-10-31 ucrt)"
Code
library(tidyverse)
library(xlsx)
library(ggtext)
library(ggpubr)
library(janitor)
library(flextable)
library(broom)

Package versions in use:

     Package Version
1      broom   1.0.5
2  flextable   0.9.3
3    janitor   2.2.0
4     ggpubr   0.6.0
5     ggtext   0.1.2
6       xlsx   0.6.5
7  lubridate   1.9.2
8    forcats   1.0.0
9    stringr   1.5.0
10     dplyr   1.1.3
11     purrr   1.0.2
12     readr   2.1.4
13     tidyr   1.3.0
14    tibble   3.2.1
15   ggplot2   3.4.3
16 tidyverse   2.0.0

Import & clean data

Code
#reads data from excel sheet
data <- read.xlsx2("data_og.xlsx", sheetName = "og") |> 
  #makes the variable names more compatible for manipulation
  janitor::clean_names() 


dat1 <- data |> 
  #removes variables not in use
  select(rr_intervention, os_intervention_months, pfs_intervention, os_control_months, pfs_contnrol, type) |> 
  #transforms variables into numerical format
  mutate(across(c(rr_intervention, os_intervention_months, pfs_intervention, os_control_months, pfs_contnrol), as.numeric)) |> 
  #categorizes independent variable
  mutate(rr_cat = case_when(
           rr_intervention < 10 ~ "<10", 
           rr_intervention >= 10 & rr_intervention < 20 ~ "10-20", 
           rr_intervention >= 20 & rr_intervention < 30 ~ "20-30", 
           rr_intervention >= 30 ~ "30+", 
           T ~ "other"
         ))  |> 
  #removes rows with NA/missing for the main independent variable
   filter(rr_cat != "other") |> 
  #factorizes categorized variables
  mutate(rr_cat = factor(rr_cat, levels = c("<10", "10-20", "20-30", "30+")), 
         #calculates new variables
         os_diff = os_intervention_months - os_control_months, 
         pfs_diff = pfs_intervention - pfs_contnrol) 


# creates dictionary for color use in later code
col_labs <- c(
  "Overall"="#17162F", 
  "Single" = "#FFB248", 
  "Combination" = "#C76058"
)

# creates another dictionary for color use in later code
col2 <- rev(c(
  "#BF4904", 
  "#D97925", 
  "#F28705", 
  "#F29F05"
))

Functions

Function for figure 1

Objective: show the distribution of each outcome variable, stratified by category. This function will return a figure for each outcome variable of interest.

Code
get_figure1 <- function(outcome_var) {
   
  # Calculates stats test for variable
  stat_test <- kruskal.test(get(outcome_var) ~ rr_cat, data = dat1)
  
  # Calculates labels for plot
  med_lab <- dat1 |> 
    summarize(median = round(median(get(outcome_var), na.rm = T), 1),
              maxz = max(get(outcome_var)), 
              .by = rr_cat)
  
  
  
  plot <- dat1 |> 
    #maps variables to figure
    ggplot(aes(x = rr_cat, y = get(outcome_var), color = rr_cat)) +
    #adds and formats dots
    geom_jitter(alpha = 0.7, width = 0.15) +
    #add & formats boxplots
    geom_boxplot(outlier.shape = NA, size = 0.5, fill = NA) +
    #adds & formats in blue line for overall median 
    geom_hline(yintercept = median(dat1[[outcome_var]], na.rm = T), 
               color = "#1261A6", 
               linetype = 2, 
               size = 0.6, 
               alpha = 0.55) +
    #adds in label for each group
    geom_label(data = med_lab, aes(x = rr_cat, y = median, color = rr_cat, label = median), 
               position = position_nudge(x = -0.4)) +
    #formats the x-axis
    scale_x_discrete(expand = expansion(add = c(0.75, 0))) +
    #maps the correct colors to categories
    scale_color_manual(values = col2, 
                        aesthetics = "color") +
    #adds in the p-value from the stats test
    labs(x = "Overall response rate (%)",
         caption = paste0("*p-value = ", round(stat_test$p.value, 2))) +
    #applies the theme
    theme_minimal() +
    #cleans up theme formatting
    theme(plot.subtitle = element_markdown(), 
          legend.position = "none", 
          panel.grid.major.x = element_blank(), 
          panel.grid.major.y = element_line(linewidth = 0.5),
          panel.grid.minor.y = element_blank())
  
  
  return(plot)

}

Function for summary table

Objective: show summary statistics for each outcome variable, stratified by category. This function will return a table for each outcome variable of interest.

Code
get_sum_table <- function(outcome_var) {
  dat1 |> 
    #orders table by previously set factor
    arrange(rr_cat) |> 
    #renames the variable for easier manipulation
    mutate(val = get(outcome_var)) |> 
    #gets summary statistics
    summarize(median = median(val, na.rm = T),
              percentile.25 = quantile(val, 0.25, na.rm = T),
              percentile.75 = quantile(val, 0.75, na.rm = T),
              min = min(val, na.rm= T),
              max = max(val, na.rm = T), 
              .by = rr_cat) |> 
    #turns into a MS word table
    flextable::as_flextable() |> 
    #deletes footer
    flextable::delete_part(part = "footer") 
  
}

Function for Figure 2

Objective: show the relationship between the continuous independent variable and the outcome variable of interest, stratified by the variable type.

Code
get_figure2 <- function(outcome_var) {
  dat1 |> 
    #maps variables to figure
    ggplot(aes(y = get(outcome_var), x = rr_intervention, color = type)) +
    #adds in dots
    geom_point() +
    #adds in linear regression, stratified by the type variable
    geom_smooth(method = lm, se = FALSE) +
    #maps colors from dictionary
    scale_color_manual(values = col_labs, 
                       aesthetics = "color") +
    #adds x-axis title 
    labs(x = "Overall response rate (%)", 
         #adds in header displaying the corresponding colors
         title = glue::glue("<span style = 'color:{col_labs[2]};'>Single agent</span> and <span style = 'color:{col_labs[3]};'>Combined</span> drugs")) +
    #adds in equation labels
    stat_regline_equation( aes(label = ..eq.label..)) +
    #applies theme
    theme_minimal() +
    #cleans up theme
    theme(plot.title = element_markdown(), 
          legend.position = "none")
  
  
}

Function for Model 1

Objective: Provide linear regression details from the model shown in Figure 2

Code
get_model1 <- function(outcome_var) {
  model_plot <- dat1 |>
    #groups data by stratifying variable, 'type'
    group_by(type) |>
    #Runs linear regression for each stratified group for outcome variable
    do(mod = lm(get(outcome_var) ~ rr_intervention, data = .)) |>
    #Extracts the models as a list, with the stratifying variable (type) as the name
    pull(mod, type) |>
    #applies the 'summary' function for each model
    lapply(summary)  
  
  return(model_plot)

  
}

Function for Model 2

Objective: Output of the linear regression as an interaction between the 2 non-outcome variables (rr_intervention & type)

Code
get_model2 <- function(outcome_var) {
  model2 <- lm(get(outcome_var) ~  rr_intervention * type, data = dat1)
return(summary(model2))

}

Function for Model 3

Objective: Output of the linear regression with the 2 non-outcome variables (rr_intervention & type)

Code
get_model3 <- function(outcome_var) {

model3 <- lm(get(outcome_var) ~  rr_intervention + type, data = dat1)

return(model3)

}

Outputs by outcome variables

Outcome A: OS in intervention

Figure 1A

Code
get_figure1("os_intervention_months") +
  ylab("OS in intervention (months)")

Table 1A

Code
get_sum_table("os_intervention_months")

rr_cat

median

percentile.25

percentile.75

min

max

factor

numeric

numeric

numeric

numeric

numeric

<10

10.4

8.8

12.0

7.1

13.7

10-20

17.7

14.4

26.2

8.3

48.0

20-30

13.9

11.3

17.7

8.6

24.0

30+

21.7

16.5

34.1

8.7

59.2

Figure 2A

Code
get_figure2("os_intervention_months")+
  ylab("OS in intervention (months)")

Model 1A

Code
get_model1("os_intervention_months")$Single

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.286  -6.598  -1.609   3.649  36.093 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.23294    4.38502   1.649 0.107753    
rr_intervention  0.35956    0.09156   3.927 0.000373 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.19 on 36 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.2999,    Adjusted R-squared:  0.2804 
F-statistic: 15.42 on 1 and 36 DF,  p-value: 0.0003729
Code
get_model1("os_intervention_months")$Combination

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.997  -6.312  -4.375   7.802  16.495 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)       5.1206    12.0306   0.426    0.683  
rr_intervention   0.5215     0.2243   2.325    0.053 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.17 on 7 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.3552 
F-statistic: 5.408 on 1 and 7 DF,  p-value: 0.05297

Model 2A

Code
get_model2("os_intervention_months")

Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.997  -6.520  -1.683   4.817  36.093 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)  
(Intercept)                  5.1206    10.5356   0.486   0.6294  
rr_intervention              0.5215     0.1964   2.655   0.0111 *
typeSingle                   2.1124    11.4645   0.184   0.8547  
rr_intervention:typeSingle  -0.1619     0.2179  -0.743   0.4614  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 43 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.3703,    Adjusted R-squared:  0.3264 
F-statistic:  8.43 on 3 and 43 DF,  p-value: 0.000161

Model 3A

Code
get_model3("os_intervention_months")

Call:
lm(formula = get(outcome_var) ~ rr_intervention + type, data = dat1)

Coefficients:
    (Intercept)  rr_intervention       typeSingle  
         11.691            0.390           -5.783  

Outcome B: Difference in OS

Figure 1B

Code
get_figure1("os_diff") + 
   ylab( "Difference in OS (months)")

Table 1B

Code
get_sum_table("os_diff")

rr_cat

median

percentile.25

percentile.75

min

max

factor

numeric

numeric

numeric

numeric

numeric

<10

1.2

0.9

1.5

0.6

1.8

10-20

-2.8

-18.8

4.5

-34.7

11.7

20-30

4.3

4.3

4.3

4.3

4.3

30+

4.4

2.0

9.4

-8.3

16.7

Figure 2B

Code
get_figure2("os_diff")+ 
   ylab( "Difference in OS (months)")

Model 1B

Code
get_model1("os_diff")$Single

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9410 -3.3640  0.6332  4.5754  7.5453 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      4.85950    3.03207   1.603    0.133
rr_intervention -0.05421    0.06616  -0.819    0.427

Residual standard error: 5.801 on 13 degrees of freedom
  (41 observations deleted due to missingness)
Multiple R-squared:  0.04911,   Adjusted R-squared:  -0.02403 
F-statistic: 0.6714 on 1 and 13 DF,  p-value: 0.4273
Code
get_model1("os_diff")$Combination

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.414  -8.564   3.266   8.395  15.060 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     -27.8169    14.7229  -1.889   0.1077  
rr_intervention   0.5786     0.2610   2.217   0.0685 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.07 on 6 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.4503,    Adjusted R-squared:  0.3587 
F-statistic: 4.915 on 1 and 6 DF,  p-value: 0.06848

Model 2B

Code
get_model2("os_diff")

Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.4138  -5.0971   0.6332   7.1972  15.0605 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)   
(Intercept)                -27.8169     9.8829  -2.815  0.01106 * 
rr_intervention              0.5786     0.1752   3.303  0.00374 **
typeSingle                  32.6764    10.8949   2.999  0.00737 **
rr_intervention:typeSingle  -0.6328     0.2017  -3.137  0.00543 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.774 on 19 degrees of freedom
  (45 observations deleted due to missingness)
Multiple R-squared:  0.3712,    Adjusted R-squared:  0.2719 
F-statistic: 3.739 on 3 and 19 DF,  p-value: 0.02885

Model 3B

Code
get_model3("os_diff")

Call:
lm(formula = get(outcome_var) ~ rr_intervention + type, data = dat1)

Coefficients:
    (Intercept)  rr_intervention       typeSingle  
        -2.2578           0.1014           0.9155  

Outcome C: PFS in intervention

Figure 1C

Code
get_figure1("pfs_intervention") +
  ylab("PFS in intervention (months)")

Table 1C

Code
get_sum_table("pfs_intervention")

rr_cat

median

percentile.25

percentile.75

min

max

factor

numeric

numeric

numeric

numeric

numeric

<10

3.1

2.5

3.6

2.0

4.2

10-20

5.5

5.5

5.6

5.5

5.6

20-30

6.1

4.2

7.3

2.3

7.3

30+

11.1

8.5

15.5

3.5

34.0

Figure 2C

Code
get_figure2("pfs_intervention")+
  ylab("PFS in intervention (months)")

Model 1C

Code
get_model1("pfs_intervention")$Single

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5663 -3.0900 -0.8781  1.7883 13.0154 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.85104    1.85648  -0.458    0.649    
rr_intervention  0.25995    0.03705   7.016 1.38e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.607 on 42 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.5396,    Adjusted R-squared:  0.5286 
F-statistic: 49.22 on 1 and 42 DF,  p-value: 1.381e-08
Code
get_model1("pfs_intervention")$Combination

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0836 -2.9441  0.1177  2.4055  5.0262 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)      0.37055    4.00052   0.093   0.9292  
rr_intervention  0.22876    0.06968   3.283   0.0168 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.462 on 6 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.6424,    Adjusted R-squared:  0.5828 
F-statistic: 10.78 on 1 and 6 DF,  p-value: 0.01676

Model 2C

Code
get_model2("pfs_intervention")

Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.5663 -3.0569 -0.6676  1.9185 13.0154 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)  
(Intercept)                 0.37055    5.17570   0.072   0.9432  
rr_intervention             0.22876    0.09014   2.538   0.0145 *
typeSingle                 -1.22158    5.48151  -0.223   0.8246  
rr_intervention:typeSingle  0.03119    0.09708   0.321   0.7494  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.48 on 48 degrees of freedom
  (16 observations deleted due to missingness)
Multiple R-squared:  0.5531,    Adjusted R-squared:  0.5252 
F-statistic:  19.8 on 3 and 48 DF,  p-value: 1.709e-08

Model 3C

Code
get_model3("pfs_intervention")

Call:
lm(formula = get(outcome_var) ~ rr_intervention + type, data = dat1)

Coefficients:
    (Intercept)  rr_intervention       typeSingle  
        -1.0995           0.2557           0.4481  

Outcome D: Difference in PFS

Figure 1D

Code
get_figure1("pfs_diff") + 
   ylab( "Difference in PFS (months)")

Table 1D

Code
get_sum_table("pfs_diff")

rr_cat

median

percentile.25

percentile.75

min

max

factor

numeric

numeric

numeric

numeric

numeric

<10

2.7

2.7

2.7

2.7

2.7

10-20

1.7

1.7

1.7

1.7

1.7

20-30

3.4

3.4

3.4

3.4

3.4

30+

5.6

4.9

7.2

1.7

13.0

Figure 2D

Code
get_figure2("pfs_diff")+ 
   ylab("Difference in PFS (months)")

Model 1D

Code
get_model1("pfs_diff")$Single

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8632 -1.9693 -0.7168  0.7825  7.7097 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)      3.10732    2.12029   1.466    0.168
rr_intervention  0.04548    0.04326   1.051    0.314

Residual standard error: 3.174 on 12 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.08432,   Adjusted R-squared:  0.008016 
F-statistic: 1.105 on 1 and 12 DF,  p-value: 0.3139
Code
get_model1("pfs_diff")$Combination

Call:
lm(formula = get(outcome_var) ~ rr_intervention, data = .)

Residuals:
      1       3       4       7       9      10      11 
-0.7935  1.3842  1.3842  0.9509 -1.3956 -1.1472 -0.3830 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)     5.933633   2.353093   2.522   0.0531 .
rr_intervention 0.004479   0.038660   0.116   0.9123  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.325 on 5 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.002677,  Adjusted R-squared:  -0.1968 
F-statistic: 0.01342 on 1 and 5 DF,  p-value: 0.9123

Model 2D

Code
get_model2("pfs_diff")

Call:
lm(formula = get(outcome_var) ~ rr_intervention * type, data = dat1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8632 -1.1995 -0.7079  0.9509  7.7097 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                 5.933633   4.905736   1.210    0.243
rr_intervention             0.004479   0.080599   0.056    0.956
typeSingle                 -2.826315   5.241180  -0.539    0.597
rr_intervention:typeSingle  0.041001   0.088957   0.461    0.651

Residual standard error: 2.762 on 17 degrees of freedom
  (47 observations deleted due to missingness)
Multiple R-squared:  0.1117,    Adjusted R-squared:  -0.04509 
F-statistic: 0.7123 on 3 and 17 DF,  p-value: 0.558

Model 3D

Code
get_model3("pfs_diff")

Call:
lm(formula = get(outcome_var) ~ rr_intervention + type, data = dat1)

Coefficients:
    (Intercept)  rr_intervention       typeSingle  
        3.93193          0.03814         -0.49483