Replications for introduction to g-methods: standard linear regression (part 0)

4 minute read

Published:

This tutorial series aim to replicate g-methods explained in this paper by Naimi, A. I., Cole, S. R., and Kennedy, E. H (2017)1 using R. Originally, the paper used SAS to demonstrate g-methods.

In this tutoiral, we will focus on replicating the data generation and estimates from the standard linear regression model to understand the crude differences between treated and control groups.

The empirical setting is to treat HIV with a therapy regimen (\(A\)) in two time periods (\(t = 0, t = 1\)). Additionally, we measure the time-varying confounder, HIV viral load (\(Z\)) at times \(t = 0\) and \(t = 1\). Note that this time-varying confounder is measured prior to the treatment administered at each time period. Also, we assume \(Z\) at time 0 is 1 (high, bad health condition) for all subjects. Our outcome is the CD4 count (cells/mm\(^3\)) observed at \(t = 2\).

Thus, we have:

Under the identifying assumptions described in the paper, we will estimate the average casual effect of always taking treatment (\(a_0 = 1, a_1 = 1\)), compared to never taking treatment (\(a_0 = 0, a_1 = 0\)) in both time periods. For notation, we are using subscripts to indicate time periods.

# Read packages 
library(tidyverse)
library(ascii)

We will begin with the standard linear regression to peak the crude differences between always-treated and never-treated in their outcome.

# Replication

# Data in Table 1 
# A0: Treatment at time 0 (for HIV)
# Z1: HIV viral load measurement prior to time 1 (Higher: Worse health condition)
# A1: Treatment at time 1
# Y: Mean of CD4 counts 
# N: Number of subjects 


a0 <- c(0, 0, 0, 0, 1, 1, 1, 1)
z1 <- c(0, 0, 1, 1, 0, 0, 1, 1)
a1 <- c(0, 1, 0, 1, 0, 1, 0, 1)
y <- c(87.288, 112.107, 119.654, 144.842, 105.282, 130.184, 137.720, 162.832)
n <- c(209271, 93779, 60657, 136293, 134781, 60789, 93903, 210527)

data <- data.frame(a0, z1, a1, y, n)

Replicates for Table 1

a0z1a1yn
00087.288209271
001112.10793779
010119.65460657
011144.842136293
100105.282134781
101130.18460789
110137.72093903
111162.832210527

Now, we fit the data with standard linear regression models to replicate Table 2.

Standard Linear Regressions

# Section 1
# Standard linear regression models in Table 2 
# Not considering standard error estimation 

# Crude model 
lm_1 <- lm(y ~ a0 + a1 + a0*a1, data = data, weights = n)

# Z-Adjusted model 
lm_2 <- lm(y ~ a0 + a1 + a0*a1 + z1, data = data, weights = n)

# A0 only model 
lm_3 <- lm(y ~ a0, data = data, weights = n)

# Z1-Adjusted A0 Only Model
lm_4 <- lm(y ~ a0 + z1, data = data, weights = n)

# A1 Only Model
lm_5 <- lm(y ~ a1, data = data, weights = n)

# Z1-Adjusted A1 Only Model
lm_6 <- lm(y ~ a1 + z1, data = data, weights = n)

We now check the crude difference in mean CD4 count for the always treated compared to the never treated. In the first model, \(\hat{\beta} = 60.96\) as reported in the paper. In model two, \(\hat{\beta} = 43.01\), which is slightly larger than as it is reported. This estimate is the \(Z_1\)-adjusted difference in mean CD4 count for the same contrast. The results from other four models are reported in Table below.

In the next post, we will move onto fitting g-methods to the data in Table 1, beginning with g-formula.

table2 <- data.frame(model = c("m1", "m2", "m3", "m4", "m5", "m6"), 
                     beta1 = c(lm_1$coefficients["a0"] + lm_1$coefficients["a1"] + lm_1$coefficients["a0:a1"], 
                               lm_2$coefficients["a0"] + lm_2$coefficients["a1"] + lm_2$coefficients["a0:a1"],
                               lm_3$coefficients["a0"],
                               lm_4$coefficients["a0"],
                               lm_5$coefficients["a1"],
                               lm_6$coefficients["a1"])) %>% 
  mutate(beta1 = round(beta1, 2))

Replicates for Table 2

modelbeta1
m160.96
m243.01
m327.08
m418.03
m538.91
m625.01
  1. Naimi, A. I., Cole, S. R., & Kennedy, E. H. (2017). An introduction to g methods. International journal of epidemiology, 46(2), 756-762.