Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.
Linear regression models (also known as “Ordinary Least Squares” model) allow us to determine if changing the values on a variable is associated with the values of another variable. In other words, if I make a 1-unit change in X , how much does Y change? In fact, linear regression is similar to the algebraic equation for a simple line (Y=mx+b , where m is the slope, X is the parameter that is changing, and b is the Y-intercept). In biostatistics, we use linear regression models to test the association between two or more variables where the outcome is a continuous data type.
In many text books, the linear regression model is called the “Ordinary Least Squares” or OLS model because it minimizes the squared errors (e.g., distance from the best-fit line). The linear regression model is pretty robust when the assumptions don’t hold. Regardless, it’s good practice to test these assumptions. There are several conditions that need to be satisfied in order for us to use the results from a linear regression model. These include:
However, the linear regression model is pretty robust to violations of these assumptions; hence, its popularity. Moreover, it is very easy to interpret as this article will demonstrate.
The structural form of a linear regression model:
\[ y_i = \beta_0 + \beta_ +\epsilon \]
Typical notations of the linear regression include:
The UC Irvine Machine Learning Respository contains a lot of datasets that are used to validate their machine learning models. You can check out more about their work at their website.
This tutorial will use the Pima Indians Diabetes Dataset, which you can download from this GitHub location. The data was originally posted on the UC Irvine Machine Learning Repository, but can be found on Kaggle. The Pima Indians Diabetes Dataset orginated from the National Institute of Diabetes and Digestive and Kidney Diseases as part of a study to predict diabetes among adult females (21 years old and older) of Pima Indian heritage.
You will need to install and load the ggplot and predict3d packages for this tutorial. You will also need to install and load the following packages.
### Load the libraries library("ggplot2") library("devtools") library("predict3d") library("psych") library("dplyr") library("gtsummary") library("DescTools") library("nortest") library("lmtest") library("sandwich")
Load the data from the GitHub site. You can use the knitr::kable function to generate a table.
diabetes.data
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
Once the data have been loaded, take a look at the summary statistics. You can do this using the describeBy function, which is part of the psych package.
knitr::kable( describeBy(diabetes.data) %>% round(2) )
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pregnancies | 1 | 768 | 3.85 | 3.37 | 3.00 | 3.46 | 2.97 | 0.00 | 17.00 | 17.00 | 0.90 | 0.14 | 0.12 |
Glucose | 2 | 768 | 120.89 | 31.97 | 117.00 | 119.38 | 29.65 | 0.00 | 199.00 | 199.00 | 0.17 | 0.62 | 1.15 |
BloodPressure | 3 | 768 | 69.11 | 19.36 | 72.00 | 71.36 | 11.86 | 0.00 | 122.00 | 122.00 | -1.84 | 5.12 | 0.70 |
SkinThickness | 4 | 768 | 20.54 | 15.95 | 23.00 | 19.94 | 17.79 | 0.00 | 99.00 | 99.00 | 0.11 | -0.53 | 0.58 |
Insulin | 5 | 768 | 79.80 | 115.24 | 30.50 | 56.75 | 45.22 | 0.00 | 846.00 | 846.00 | 2.26 | 7.13 | 4.16 |
BMI | 6 | 768 | 31.99 | 7.88 | 32.00 | 31.96 | 6.82 | 0.00 | 67.10 | 67.10 | -0.43 | 3.24 | 0.28 |
DiabetesPedigreeFunction | 7 | 768 | 0.47 | 0.33 | 0.37 | 0.42 | 0.25 | 0.08 | 2.42 | 2.34 | 1.91 | 5.53 | 0.01 |
Age | 8 | 768 | 33.24 | 11.76 | 29.00 | 31.54 | 10.38 | 21.00 | 81.00 | 60.00 | 1.13 | 0.62 | 0.42 |
Outcome | 9 | 768 | 0.35 | 0.48 | 0.00 | 0.31 | 0.00 | 0.00 | 1.00 | 1.00 | 0.63 | -1.60 | 0.02 |
There are 768 subjects with nine variables. We can see that the average number of pregnancies is 3.85 with a standard deviation (SD) of 3.37). We also see that the average age of the sample was 33.24 years (SD, 11.76), average BMI was 31.99 (SD, 7.8), and the average glucose level was 120.89 (SD, 31.97).
Let’s suppose our main research question is to determine whether age was associated with glucose level. We will set the glucose level as our dependent variable and age as our independent variable (or predictor of interest). Visualize the association between Age and Glucose level Let’s look at how Age is related to Glucose level by plotting their relationship. We’ll do this using ggplot. As Age increases, the Glucose level also increases. There appears to be a positive relationship between Age and Glucose level.
ggplot(diabetes.data, aes(x = Age, y = Glucose)) + geom_point() + stat_smooth()
Now, we can construct a linear regression model with Glucose level as the dependent variable and Age as the independent variable (or predictor of interest). We will use the lm() function with Glucose level as the Y variable and Age as the X variable. The formula for a regression model in R uses the ~ symbol. For example, if was want to regress Age on Glucose level, we use the notation Glucose ~ Age.
By using the lm() function, we can construct the linear regression model: lm(Glucose ~ Age, data = diabetes.data). We can create an object that will contain this linear model; I called this object linear.model1. We generate the 95% confidence interval (CI) by using the confint() function.
Here is how we put all of this together in R:
linear.model1
Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
The lm() function does not generate 95% CI, so you will need to use the confint() function.
### Generate the 95% CI confint(linear.model1)
2.5 % 97.5 % (Intercept) 90.5216601 103.6386585 Age 0.5304001 0.9024361
Make sure to have the library gtsummary loaded to create tables from the regression output.
We are interested in the coefficients. To make interpreting the output easier, we can create a table to visualize the critical elements.
#### Present the output in a table model1 % gt::tab_header("Table 2. Linear regression model output (Glucose ~ Age)") %>% gt::tab_options(table.align='left')
Table 2. Linear regression model output (Glucose ~ Age) | |||
Characteristic | Beta | 95% CI 1 | p-value |
---|---|---|---|
(Intercept) | 97 | 91, 104 | |
Age | 0.72 | 0.53, 0.90 | |
1 CI = Confidence Interval |
The Intercept denotes the Y intercept when X is equal to zero. In this case, it would be where Glucose level would be on the linear plot when Age is equal to zero, which is 97.08-units of Glucose.
When possible, present the coefficient’s point estimate and the 95% CI. For example, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level (95% CI: 0.53, 0.90).
The Age coefficient denotes the change in Glucose level for a one-unit increase in Age. In other words, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level. Since the 95% CI is between 0.53-units and 0.90-units of Glucose, it does not include zero, so this association is statistically significant. We can also look a the p-value of the Age coefficient to determine whether this is statistically significant (<0.0001). However, it is preferable to present the 95% CI when describing the association between X and Y.
The Adjusted R squared denotes the amount of data that are explained by the linear regression model. In other words, the current linear regression model explains 6.8% of the data.
We can plot the linear form of the model against the actual data using the ggPredict() function.
ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)
Here is a version without the scatterplot.
ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
We generate a new variable called pregnancy.history that is a dichotomous variable (0 = no history of pregnancy and 1 = history of pregnancy).
Let’s add to our current model by including a confounder. We have a variable called Pregnancies but this provides the number of past pregnancies. We want to create a new variable that has a dichotomous outcome: History of Pregnancy or No history of pregnancy. To do this, we need to to subset the data and create rules. Anyone with 1 or more pregnancies will be coded as 1. Anyone who does not have a history of pregnancy will be coded as 0.
diabetes.data$pregnancy.history[diabetes.data$Pregnancies == 0] = 0 diabetes.data$pregnancy.history[diabetes.data$Pregnancies > 0] = 1 table(diabetes.data$pregnancy.history)
0 1 111 657
We see that there are 657 women who a history of pregnancy and 111 women with no history of pregnancy.
A DAG diagram illustrating the relationship between Pregnancy History as a confounder on the Age to Glucose relationship can be drawn.
knitr::include_graphics("confounder.png")
In our linear regression model, we have the following structural form:
E[Glucosei|Agei]=β0+β1Agei+ϵ, where E[Glucosei|Agei] denotes the expected Glucose level for subject i given the Age of subject i.
But we can include a confounder pregnancy.history:
E[Glucosei|Agei,PregnancyHistoryi]=β0+β1Agei+β2PregnancyHistoryi+ϵ, where E[Glucosei|Agei,PregnancyHistoryi] denotes the expected Glucose level for subject i given the Age of subject i controlling for Pregnancy History of subject i.
All you need to do to add another varialbe in the regression model is to use the + symbol. For example lm(Glucose ~ Age + pregnancy.history, data = diabetes.data).
Using the lm() function, we can add pregnancy.history to the linear regression model:
Here is how we put all of this together in R:
### Linear regression model (Y = Glucose, X1 = Age, X2 = Pregnancy History) linear.model2
Call: lm(formula = Glucose ~ Age + pregnancy.history, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -125.715 -20.546 -2.991 17.316 87.734 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 102.00752 3.95100 25.818 < 2e-16 *** Age 0.76050 0.09638 7.891 1.04e-14 *** pregnancy.history -7.47264 3.22137 -2.320 0.0206 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.77 on 765 degrees of freedom Multiple R-squared: 0.07594, Adjusted R-squared: 0.07352 F-statistic: 31.43 on 2 and 765 DF, p-value: 7.592e-14
We can present the model output into a table.
model2 % gt::tab_header("Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)") %>% gt::tab_options(table.align='left')
Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History) | |||
Characteristic | Beta | 95% CI 1 | p-value |
---|---|---|---|
(Intercept) | 102 | 94, 110 | |
Age | 0.76 | 0.57, 0.95 | |
pregnancy.history | -7.5 | -14, -1.1 | 0.021 |
1 CI = Confidence Interval |
You can see that the Age coefficient is slightly different from our first model. It is 0.76 with a 95% CI of 0.57, 0.95. Compare this to the previous model’s result, which was 0.72; 95% CI: 0.53, 0.90.
model1 % gt::tab_header("Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]") %>% gt::tab_options(table.align='left')
Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)] | ||||||
Characteristic | Model 1 | Model 2 | ||||
---|---|---|---|---|---|---|
Beta | 95% CI 1 | p-value | Beta | 95% CI 1 | p-value | |
(Intercept) | 97 | 91, 104 | 102 | 94, 110 | ||
Age | 0.72 | 0.53, 0.90 | 0.76 | 0.57, 0.95 | ||
pregnancy.history | -7.5 | -14, -1.1 | 0.021 | |||
1 CI = Confidence Interval |
library(modelsummary) linear.model0
Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
linear.model1
Call: lm(formula = Glucose ~ Age + pregnancy.history, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -125.715 -20.546 -2.991 17.316 87.734 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 102.00752 3.95100 25.818 < 2e-16 *** Age 0.76050 0.09638 7.891 1.04e-14 *** pregnancy.history -7.47264 3.22137 -2.320 0.0206 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.77 on 765 degrees of freedom Multiple R-squared: 0.07594, Adjusted R-squared: 0.07352 F-statistic: 31.43 on 2 and 765 DF, p-value: 7.592e-14
tidy(linear.model0)
# A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 97.1 3.34 29.1 1.02e-125 2 Age 0.716 0.0948 7.56 1.15e- 13
tidy(linear.model1)
# A tibble: 3 × 5 term estimate std.error statistic p.value 1 (Intercept) 102. 3.95 25.8 3.36e-106 2 Age 0.760 0.0964 7.89 1.04e- 14 3 pregnancy.history -7.47 3.22 -2.32 2.06e- 2
modelsummary(list("Simple Regression"=linear.model0, "Multiple Regression"=linear.model1))
Simple Regression | Multiple Regression | |
---|---|---|
(Intercept) | 97.080 | 102.008 |
(3.341) | (3.951) | |
Age | 0.716 | 0.760 |
(0.095) | (0.096) | |
pregnancy.history | −7.473 | |
(3.221) | ||
Num.Obs. | 768 | 768 |
R2 | 0.069 | 0.076 |
R2 Adj. | 0.068 | 0.074 |
AIC | 7451.3 | 7447.9 |
BIC | 7465.2 | 7466.5 |
Log.Lik. | −3722.636 | −3719.945 |
F | 57.160 | 31.434 |
RMSE | 30.82 | 30.71 |
Model 1 is considered the crude model or the unadjusted model. Model 2 is the adjusted model because it is adjusting based on the Pregnancy History confounder. Notice that the β1,unadjusted is 0.72 which is lower than the β1,adjusted result which is 0.76.
Additionally, the Adjusted R squared is higher in model 2 (7.35%) compared to model 1, which was 6.82%. This means that Model 2 does a better job of explaining the data than Model 1.
Let’s plot Model 2’s results.
ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
We can see that the group that had a history of pregnancy is lower than the group that did not have a history of pregnancy. This makes sense when you look at the pregnancy.history coefficient. It is -7.5, which means that a subject with a history of pregnancy is associated with a 7.5 decrease in Glucose level (95% CI: -13.80, -1.15) compared to a subject without a history of pregnancy controlling for age. Therefore, for all ranges of Age, the group with a history of pregnancy will have Glucose levels that are 7.5 units lower than a group without a history of pregnancy. You can visualize this on the plot; the linear lines do not cross and remain constant across all ranges of Age. But there is a positive correlation between Age and Glucose level.
It is good practice to look at the residuals of the regression model to make sure that the assumptions hold.
Recall the assumptions for the linear regression model: * Outcome variable is normally distributed (parametric) * Observations are independent * Residuals are not correlated with the X variables (homoscedasticity) * Association between X and Y is linear
We can test to see if the residuals are uncorrelated to with the X variables (homoscedasticity). Let’s use the crude linear regression model from our previous example:
linear.model1
Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
confint(linear.model1)
2.5 % 97.5 % (Intercept) 90.5216601 103.6386585 Age 0.5304001 0.9024361
We can plot the residuals and see if they are associated with increasing values of the expected or predicted Glucose level. If there is no association, we should expect to see a uniform distribution across all ranges of the expected values of Glucose.
plot(linear.model1$res ~ linear.model1$fitted)
Reviewing the residual plot along the fitted values, there doesn’t appear to be any evidence of heteroskedasticity. Upon visual inspection, we can see that the residuals are uniform across the expected Glucose level range (also called the “fitted” values). We can verify this visual inspection by performing the Breusch-Pagan test of heteroskedasticity. We will need to install and load the lmtest package and use the bptest() function.
bptest(linear.model1)
studentized Breusch-Pagan test data: linear.model1 BP = 2.4585, df = 1, p-value = 0.1169
According to the BP-test results, the p-value is 0.1169, which means that we fail to reject the null that the variance of the residuals are constant. In other words, there are no associations between the residuals of the model and the predicted values generated from the model.
If, however, there was heteroskedasticity, we could address this by estimating robust standard errors. For linear regression models, the most common method is to use the Huber-White sandwich estimation method. To do this, we need to use the sandwich package and the coeftest() function. (Note: In practice, I default to the robust standard errors rather than let the lm() function estimate these for me.)
### Huber-White sandwich estimation robust1
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.080159 3.265127 29.732 < 2.2e-16 *** Age 0.716418 0.095573 7.496 1.82e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(robust1)
2.5 % 97.5 % (Intercept) 90.6704993 103.4898193 Age 0.5288023 0.9040339
Notice that the standard errors are slightly difference from the ones estimated in the previous model. In the previous model, the 95% CI was between 0.530 and 0.902. In the model with the robust standard errors, the 95% CI was between 0.529 and 0.904. The differences are trivial in this example, but could be important when the 95% CI is close to the null value.
We can also evaluate if the residuals are normally distributed. We can generate a histogram and a Q-Q plot. The historgram has a slight left skew and the Q-Q plot has its tails deviate from the neutral line.
#### Set up the matrix par(mfrow = c(1, 2)) #### Histogram of the residuals hist(linear.model1$res) #### QQ-plot of the residuals against the QQ line qqnorm(linear.model1$res); qqline(linear.model1$res, col = "2", lwd = 1, lty = 1)
We can also test for the normality of the residuals. Common tests of normality include the Shapiro-Wilk’s test, the Jarque Bera test, and the Kolmogorov-Smirnov (Lilliforms) test. I provided their codes below. Despite the differences in their output, the conclusions are all the same: the residuals are not normally distributed. Despite not being normally distributed, the linear regression model is pretty robust to violations of this assumption. You can make a concluding statement that there is an association between Age and Glucose level based on these findings.
#### Test normality using Shapiro-Wilk's test shapiro.test(linear.model1$res)
Shapiro-Wilk normality test data: linear.model1$res W = 0.97467, p-value = 2.838e-10
#### Test normality using Jarque Bera test JarqueBeraTest(linear.model1$res, robust = FALSE) ### Does not use robust method
Jarque Bera Test data: linear.model1$res X-squared = 25.565, df = 2, p-value = 2.81e-06
#### Test normality using the Kolmogorov-Smirnov test lillie.test(linear.model1$res)
Lilliefors (Kolmogorov-Smirnov) normality test data: linear.model1$res D = 0.065397, p-value = 2.941e-08
Linear regression models are useful for understanding the relationship between a predictor variable and outcome varaible if the outcome variable is continuous. Additionally, you can add confounders into the regression model to control for their effects. Once you control for confounders, you should compare the results with the crude model to see how the relationship between the predictor of interest and outcome changes. Finally, after reviewing the results of the linear regression model, it is good practice to look at the residuals and verify that the assumptions of homoscedasticity and normality continue to hold.
I would like to acknowledge Mark Bounthavong for providing the link to the dataset used in this tutorial together with the code and output interpretation. You can find out more about his work here: https://rpubs.com/mbounthavong/linear_regression_using_R.
The gtsummary package is great at merging outcomes from regression models into publication quality tables. Daniel D. Sjoberg authored the gtsummary package with instructions on his website. You can also use external functions by converting the gtsummary table into an object using as_gt(); instructions can be found here.