R Tutorial on Linear Regression Model

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

Introduction

Linear regression models (also known as “Ordinary Least Squares” model) allow us to determine if changing the values on a variable is associated with the values of another variable. In other words, if I make a 1-unit change in X , how much does Y change? In fact, linear regression is similar to the algebraic equation for a simple line (Y=mx+b , where m is the slope, X is the parameter that is changing, and b is the Y-intercept). In biostatistics, we use linear regression models to test the association between two or more variables where the outcome is a continuous data type.

In many text books, the linear regression model is called the “Ordinary Least Squares” or OLS model because it minimizes the squared errors (e.g., distance from the best-fit line). The linear regression model is pretty robust when the assumptions don’t hold. Regardless, it’s good practice to test these assumptions. There are several conditions that need to be satisfied in order for us to use the results from a linear regression model. These include:

Outcome variable is normally distributed (parametric)
Observations are independent
Residuals are not correlated with the X variables (homoscedasticity)
Association between X and Y is linear

However, the linear regression model is pretty robust to violations of these assumptions; hence, its popularity. Moreover, it is very easy to interpret as this article will demonstrate.

Simple linear regression

The structural form of a linear regression model:

\[ y_i = \beta_0 + \beta_ +\epsilon \]

Typical notations of the linear regression include:

Yi denotes the outcome (or dependent) variable for subject i
X1i denotes the predictor of interest (X1) for subject i
β0 denotes the Y-intercept when X is zero
β1 denotes the slope or the change in Y with a 1-unit change in X
ϵ denotes the error or residuals

Data

The UC Irvine Machine Learning Respository contains a lot of datasets that are used to validate their machine learning models. You can check out more about their work at their website.

This tutorial will use the Pima Indians Diabetes Dataset, which you can download from this GitHub location. The data was originally posted on the UC Irvine Machine Learning Repository, but can be found on Kaggle. The Pima Indians Diabetes Dataset orginated from the National Institute of Diabetes and Digestive and Kidney Diseases as part of a study to predict diabetes among adult females (21 years old and older) of Pima Indian heritage.

The following variables are included in the data

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 = no diabetes or 1 = diabetes)

Packages

You will need to install and load the ggplot and predict3d packages for this tutorial. You will also need to install and load the following packages.

### Load the libraries library("ggplot2") library("devtools") library("predict3d") library("psych") library("dplyr") library("gtsummary") library("DescTools") library("nortest") library("lmtest") library("sandwich")

Load Data

Load the data from the GitHub site. You can use the knitr::kable function to generate a table.

diabetes.data


Table 1. First six rows of the Pima Indians Diabetes Dataset  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  BMI  DiabetesPedigreeFunction  Age  Outcome  
  
 6  148  72  35  0  33.6  0.627  50  1  
 1  85  66  29  0  26.6  0.351  31  0  
 8  183  64  0  0  23.3  0.672  32  1  
 1  89  66  23  94  28.1  0.167  21  0  
 0  137  40  35  168  43.1  2.288  33  1  
 5  116  74  0  0  25.6  0.201  30  0  
 
 
Visualize the data
Once the data have been loaded, take a look at the summary statistics. You can do this using the describeBy function, which is part of the psych package.
knitr::kable( describeBy(diabetes.data) %>% round(2) )
  vars  n  mean  sd  median  trimmed  mad  min  max  range  skew  kurtosis  se  
  
 Pregnancies  1  768  3.85  3.37  3.00  3.46  2.97  0.00  17.00  17.00  0.90  0.14  0.12  
 Glucose  2  768  120.89  31.97  117.00  119.38  29.65  0.00  199.00  199.00  0.17  0.62  1.15  
 BloodPressure  3  768  69.11  19.36  72.00  71.36  11.86  0.00  122.00  122.00  -1.84  5.12  0.70  
 SkinThickness  4  768  20.54  15.95  23.00  19.94  17.79  0.00  99.00  99.00  0.11  -0.53  0.58  
 Insulin  5  768  79.80  115.24  30.50  56.75  45.22  0.00  846.00  846.00  2.26  7.13  4.16  
 BMI  6  768  31.99  7.88  32.00  31.96  6.82  0.00  67.10  67.10  -0.43  3.24  0.28  
 DiabetesPedigreeFunction  7  768  0.47  0.33  0.37  0.42  0.25  0.08  2.42  2.34  1.91  5.53  0.01  
 Age  8  768  33.24  11.76  29.00  31.54  10.38  21.00  81.00  60.00  1.13  0.62  0.42  
 Outcome  9  768  0.35  0.48  0.00  0.31  0.00  0.00  1.00  1.00  0.63  -1.60  0.02  
 
 
There are 768 subjects with nine variables. We can see that the average number of pregnancies is 3.85 with a standard deviation (SD) of 3.37). We also see that the average age of the sample was 33.24 years (SD, 11.76), average BMI was 31.99 (SD, 7.8), and the average glucose level was 120.89 (SD, 31.97).
Motivating example: Evalaute the association between Age and Glucose level
Let’s suppose our main research question is to determine whether age was associated with glucose level. We will set the glucose level as our dependent variable and age as our independent variable (or predictor of interest). Visualize the association between Age and Glucose level Let’s look at how Age is related to Glucose level by plotting their relationship. We’ll do this using ggplot. As Age increases, the Glucose level also increases. There appears to be a positive relationship between Age and Glucose level.
Plot the association between the subject’s age and glucose level
ggplot(diabetes.data, aes(x = Age, y = Glucose)) + geom_point() + stat_smooth()
Constructing the linear regression model
Now, we can construct a linear regression model with Glucose level as the dependent variable and Age as the independent variable (or predictor of interest). We will use the lm() function with Glucose level as the Y variable and Age as the X variable. The formula for a regression model in R uses the ~ symbol. For example, if was want to regress Age on Glucose level, we use the notation Glucose ~ Age.
By using the lm() function, we can construct the linear regression model: lm(Glucose ~ Age, data = diabetes.data). We can create an object that will contain this linear model; I called this object linear.model1. We generate the 95% confidence interval (CI) by using the confint() function.
Here is how we put all of this together in R:
Linear regression model (Y = Glucose, X = Age)
linear.model1 

 Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
The lm() function does not generate 95% CI, so you will need to use the confint() function.
### Generate the 95% CI confint(linear.model1)
 2.5 % 97.5 % (Intercept) 90.5216601 103.6386585 Age 0.5304001 0.9024361
Interpret the linear regression output
Make sure to have the library gtsummary loaded to create tables from the regression output.
We are interested in the coefficients. To make interpreting the output easier, we can create a table to visualize the critical elements.
#### Present the output in a table model1 % gt::tab_header("Table 2. Linear regression model output (Glucose ~ Age)") %>% gt::tab_options(table.align='left')
  Table 2. Linear regression model output (Glucose ~ Age)  
 Characteristic  Beta  95% CI 1  p-value  
  
(Intercept)  97  91, 104  
 Age  0.72  0.53, 0.90  
 
  1 CI = Confidence Interval  
 
The Intercept denotes the Y intercept when X is equal to zero. In this case, it would be where Glucose level would be on the linear plot when Age is equal to zero, which is 97.08-units of Glucose.
When possible, present the coefficient’s point estimate and the 95% CI. For example, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level (95% CI: 0.53, 0.90).
The Age coefficient denotes the change in Glucose level for a one-unit increase in Age. In other words, a 1-year increase in Age is associated with a 0.72-unit increase in Glucose level. Since the 95% CI is between 0.53-units and 0.90-units of Glucose, it does not include zero, so this association is statistically significant. We can also look a the p-value of the Age coefficient to determine whether this is statistically significant (<0.0001). However, it is preferable to present the 95% CI when describing the association between X and Y.
The Adjusted R squared denotes the amount of data that are explained by the linear regression model. In other words, the current linear regression model explains 6.8% of the data.
Visualize predicted model
We can plot the linear form of the model against the actual data using the ggPredict() function.
ggPredict(linear.model1, digits = 1, show.point = TRUE, se = TRUE, xpos = 0.5)
Here is a version without the scatterplot.
ggPredict(linear.model1, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
Adding a confounder
We generate a new variable called pregnancy.history that is a dichotomous variable (0 = no history of pregnancy and 1 = history of pregnancy).
Let’s add to our current model by including a confounder. We have a variable called Pregnancies but this provides the number of past pregnancies. We want to create a new variable that has a dichotomous outcome: History of Pregnancy or No history of pregnancy. To do this, we need to to subset the data and create rules. Anyone with 1 or more pregnancies will be coded as 1. Anyone who does not have a history of pregnancy will be coded as 0.
Generate groups based on pregnancy history (Group 0 = 0, Group 1 = 1 or more pregnancies)
diabetes.data$pregnancy.history[diabetes.data$Pregnancies == 0] = 0 diabetes.data$pregnancy.history[diabetes.data$Pregnancies > 0] = 1 table(diabetes.data$pregnancy.history)
 0 1 111 657 
We see that there are 657 women who a history of pregnancy and 111 women with no history of pregnancy.
A DAG diagram illustrating the relationship between Pregnancy History as a confounder on the Age to Glucose relationship can be drawn.
knitr::include_graphics("confounder.png")
In our linear regression model, we have the following structural form:
E[Glucosei|Agei]=β0+β1Agei+ϵ, where E[Glucosei|Agei] denotes the expected Glucose level for subject i given the Age of subject i.
But we can include a confounder pregnancy.history:
E[Glucosei|Agei,PregnancyHistoryi]=β0+β1Agei+β2PregnancyHistoryi+ϵ, where E[Glucosei|Agei,PregnancyHistoryi] denotes the expected Glucose level for subject i given the Age of subject i controlling for Pregnancy History of subject i.
All you need to do to add another varialbe in the regression model is to use the + symbol. For example lm(Glucose ~ Age + pregnancy.history, data = diabetes.data).
Using the lm() function, we can add pregnancy.history to the linear regression model:
Here is how we put all of this together in R:
### Linear regression model (Y = Glucose, X1 = Age, X2 = Pregnancy History) linear.model2 

 Call: lm(formula = Glucose ~ Age + pregnancy.history, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -125.715 -20.546 -2.991 17.316 87.734 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 102.00752 3.95100 25.818 < 2e-16 *** Age 0.76050 0.09638 7.891 1.04e-14 *** pregnancy.history -7.47264 3.22137 -2.320 0.0206 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.77 on 765 degrees of freedom Multiple R-squared: 0.07594, Adjusted R-squared: 0.07352 F-statistic: 31.43 on 2 and 765 DF, p-value: 7.592e-14
We can present the model output into a table.
Present the output in a table
model2 % gt::tab_header("Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)") %>% gt::tab_options(table.align='left')
  Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)  
 Characteristic  Beta  95% CI 1  p-value  
  
(Intercept)  102  94, 110  
 Age  0.76  0.57, 0.95  
 pregnancy.history  -7.5  -14, -1.1  0.021
 
  1 CI = Confidence Interval  
 
You can see that the Age coefficient is slightly different from our first model. It is 0.76 with a 95% CI of 0.57, 0.95. Compare this to the previous model’s result, which was 0.72; 95% CI: 0.53, 0.90.
Merge the two linear regression model’s outputs
model1 % gt::tab_header("Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]") %>% gt::tab_options(table.align='left')
  Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]  
 Characteristic  Model 1  Model 2  
 Beta  95% CI 1  p-value  Beta  95% CI 1  p-value  
  
(Intercept)  97  91, 104   102  94, 110  
 Age  0.72  0.53, 0.90   0.76  0.57, 0.95  
 pregnancy.history  -7.5  -14, -1.1  0.021
 
  1 CI = Confidence Interval  
 
Alternatively
library(modelsummary) linear.model0 

 Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
linear.model1 

 Call: lm(formula = Glucose ~ Age + pregnancy.history, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -125.715 -20.546 -2.991 17.316 87.734 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 102.00752 3.95100 25.818 < 2e-16 *** Age 0.76050 0.09638 7.891 1.04e-14 *** pregnancy.history -7.47264 3.22137 -2.320 0.0206 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.77 on 765 degrees of freedom Multiple R-squared: 0.07594, Adjusted R-squared: 0.07352 F-statistic: 31.43 on 2 and 765 DF, p-value: 7.592e-14
tidy(linear.model0)
# A tibble: 2 × 5 term estimate std.error statistic p.value     1 (Intercept) 97.1 3.34 29.1 1.02e-125 2 Age 0.716 0.0948 7.56 1.15e- 13
tidy(linear.model1)
# A tibble: 3 × 5 term estimate std.error statistic p.value     1 (Intercept) 102. 3.95 25.8 3.36e-106 2 Age 0.760 0.0964 7.89 1.04e- 14 3 pregnancy.history -7.47 3.22 -2.32 2.06e- 2
modelsummary(list("Simple Regression"=linear.model0, "Multiple Regression"=linear.model1))
   Simple Regression  Multiple Regression  
  
  (Intercept)  97.080  102.008  
  (3.341)  (3.951)  
  Age  0.716  0.760  
  (0.095)  (0.096)  
  pregnancy.history  −7.473  
  (3.221)  
  Num.Obs.  768  768  
  R2  0.069  0.076  
  R2 Adj.  0.068  0.074  
  AIC  7451.3  7447.9  
  BIC  7465.2  7466.5  
  Log.Lik.  −3722.636  −3719.945  
  F  57.160  31.434  
  RMSE  30.82  30.71  
 
 
Model 1 is considered the crude model or the unadjusted model. Model 2 is the adjusted model because it is adjusting based on the Pregnancy History confounder. Notice that the β1,unadjusted is 0.72 which is lower than the β1,adjusted result which is 0.76.
Additionally, the Adjusted R squared is higher in model 2 (7.35%) compared to model 1, which was 6.82%. This means that Model 2 does a better job of explaining the data than Model 1.
Let’s plot Model 2’s results.
ggPredict(linear.model2, digits = 1, show.point = FALSE, se = TRUE, xpos = 0.5)
We can see that the group that had a history of pregnancy is lower than the group that did not have a history of pregnancy. This makes sense when you look at the pregnancy.history coefficient. It is -7.5, which means that a subject with a history of pregnancy is associated with a 7.5 decrease in Glucose level (95% CI: -13.80, -1.15) compared to a subject without a history of pregnancy controlling for age. Therefore, for all ranges of Age, the group with a history of pregnancy will have Glucose levels that are 7.5 units lower than a group without a history of pregnancy. You can visualize this on the plot; the linear lines do not cross and remain constant across all ranges of Age. But there is a positive correlation between Age and Glucose level.
Evaluate residual plots
It is good practice to look at the residuals of the regression model to make sure that the assumptions hold.
Recall the assumptions for the linear regression model: * Outcome variable is normally distributed (parametric) * Observations are independent * Residuals are not correlated with the X variables (homoscedasticity) * Association between X and Y is linear
We can test to see if the residuals are uncorrelated to with the X variables (homoscedasticity). Let’s use the crude linear regression model from our previous example:
linear.model1 

 Call: lm(formula = Glucose ~ Age, data = diabetes.data) Residuals: Min 1Q Median 3Q Max -126.453 -20.849 -3.058 18.304 86.159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.08016 3.34095 29.06 < 2e-16 *** Age 0.71642 0.09476 7.56 1.15e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 30.86 on 766 degrees of freedom Multiple R-squared: 0.06944, Adjusted R-squared: 0.06822 F-statistic: 57.16 on 1 and 766 DF, p-value: 1.15e-13
Get the 95% CI
confint(linear.model1)
 2.5 % 97.5 % (Intercept) 90.5216601 103.6386585 Age 0.5304001 0.9024361
We can plot the residuals and see if they are associated with increasing values of the expected or predicted Glucose level. If there is no association, we should expect to see a uniform distribution across all ranges of the expected values of Glucose.
Plot the residuals against the predicted model (Is it homoscedastic?)
plot(linear.model1$res ~ linear.model1$fitted)
Reviewing the residual plot along the fitted values, there doesn’t appear to be any evidence of heteroskedasticity. Upon visual inspection, we can see that the residuals are uniform across the expected Glucose level range (also called the “fitted” values). We can verify this visual inspection by performing the Breusch-Pagan test of heteroskedasticity. We will need to install and load the lmtest package and use the bptest() function.
bptest(linear.model1)
 studentized Breusch-Pagan test data: linear.model1 BP = 2.4585, df = 1, p-value = 0.1169
According to the BP-test results, the p-value is 0.1169, which means that we fail to reject the null that the variance of the residuals are constant. In other words, there are no associations between the residuals of the model and the predicted values generated from the model.
If, however, there was heteroskedasticity, we could address this by estimating robust standard errors. For linear regression models, the most common method is to use the Huber-White sandwich estimation method. To do this, we need to use the sandwich package and the coeftest() function. (Note: In practice, I default to the robust standard errors rather than let the lm() function estimate these for me.)
### Huber-White sandwich estimation robust1 

 t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 97.080159 3.265127 29.732 < 2.2e-16 *** Age 0.716418 0.095573 7.496 1.82e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(robust1)
 2.5 % 97.5 % (Intercept) 90.6704993 103.4898193 Age 0.5288023 0.9040339
Notice that the standard errors are slightly difference from the ones estimated in the previous model. In the previous model, the 95% CI was between 0.530 and 0.902. In the model with the robust standard errors, the 95% CI was between 0.529 and 0.904. The differences are trivial in this example, but could be important when the 95% CI is close to the null value.
We can also evaluate if the residuals are normally distributed. We can generate a histogram and a Q-Q plot. The historgram has a slight left skew and the Q-Q plot has its tails deviate from the neutral line.
#### Set up the matrix par(mfrow = c(1, 2)) #### Histogram of the residuals hist(linear.model1$res) #### QQ-plot of the residuals against the QQ line qqnorm(linear.model1$res); qqline(linear.model1$res, col = "2", lwd = 1, lty = 1)
We can also test for the normality of the residuals. Common tests of normality include the Shapiro-Wilk’s test, the Jarque Bera test, and the Kolmogorov-Smirnov (Lilliforms) test. I provided their codes below. Despite the differences in their output, the conclusions are all the same: the residuals are not normally distributed. Despite not being normally distributed, the linear regression model is pretty robust to violations of this assumption. You can make a concluding statement that there is an association between Age and Glucose level based on these findings.
#### Test normality using Shapiro-Wilk's test shapiro.test(linear.model1$res)
 Shapiro-Wilk normality test data: linear.model1$res W = 0.97467, p-value = 2.838e-10
#### Test normality using Jarque Bera test JarqueBeraTest(linear.model1$res, robust = FALSE) ### Does not use robust method
 Jarque Bera Test data: linear.model1$res X-squared = 25.565, df = 2, p-value = 2.81e-06
#### Test normality using the Kolmogorov-Smirnov test lillie.test(linear.model1$res)
 Lilliefors (Kolmogorov-Smirnov) normality test data: linear.model1$res D = 0.065397, p-value = 2.941e-08
Conclusions
Linear regression models are useful for understanding the relationship between a predictor variable and outcome varaible if the outcome variable is continuous. Additionally, you can add confounders into the regression model to control for their effects. Once you control for confounders, you should compare the results with the crude model to see how the relationship between the predictor of interest and outcome changes. Finally, after reviewing the results of the linear regression model, it is good practice to look at the residuals and verify that the assumptions of homoscedasticity and normality continue to hold.
Acknowledgements
I would like to acknowledge Mark Bounthavong for providing the link to the dataset used in this tutorial together with the code and output interpretation. You can find out more about his work here: https://rpubs.com/mbounthavong/linear_regression_using_R.
References
The gtsummary package is great at merging outcomes from regression models into publication quality tables. Daniel D. Sjoberg authored the gtsummary package with instructions on his website. You can also use external functions by converting the gtsummary table into an object using as_gt(); instructions can be found here.

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
6	148	72	35	0	33.6	0.627	50	1
1	85	66	29	0	26.6	0.351	31	0
8	183	64	0	0	23.3	0.672	32	1
1	89	66	23	94	28.1	0.167	21	0
0	137	40	35	168	43.1	2.288	33	1
5	116	74	0	0	25.6	0.201	30	0

vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Pregnancies	1	768	3.85	3.37	3.00	3.46	2.97	0.00	17.00	17.00	0.90	0.14	0.12
Glucose	2	768	120.89	31.97	117.00	119.38	29.65	0.00	199.00	199.00	0.17	0.62	1.15
BloodPressure	3	768	69.11	19.36	72.00	71.36	11.86	0.00	122.00	122.00	-1.84	5.12	0.70
SkinThickness	4	768	20.54	15.95	23.00	19.94	17.79	0.00	99.00	99.00	0.11	-0.53	0.58
Insulin	5	768	79.80	115.24	30.50	56.75	45.22	0.00	846.00	846.00	2.26	7.13	4.16
BMI	6	768	31.99	7.88	32.00	31.96	6.82	0.00	67.10	67.10	-0.43	3.24	0.28
DiabetesPedigreeFunction	7	768	0.47	0.33	0.37	0.42	0.25	0.08	2.42	2.34	1.91	5.53	0.01
Age	8	768	33.24	11.76	29.00	31.54	10.38	21.00	81.00	60.00	1.13	0.62	0.42
Outcome	9	768	0.35	0.48	0.00	0.31	0.00	0.00	1.00	1.00	0.63	-1.60	0.02

Characteristic	Beta	95% CI 1
Table 2. Linear regression model output (Glucose ~ Age)
(Intercept)	97	91, 104
Age	0.72	0.53, 0.90
1 CI = Confidence Interval

Characteristic	Beta	95% CI 1	p-value
Table 3. Linear regression model output with confounder (Glucose ~ Age + Pregnancy History)
(Intercept)	102	94, 110
Age	0.76	0.57, 0.95
pregnancy.history	-7.5	-14, -1.1	0.021
1 CI = Confidence Interval

Characteristic	Model 1			Model 2
Table 4. Comparison between linear regression models [Model 1 (crude) v. Model 2 (adjusted)]
Characteristic	Beta	95% CI 1	p-value	Beta	95% CI 1	p-value
(Intercept)	97	91, 104		102	94, 110
Age	0.72	0.53, 0.90		0.76	0.57, 0.95
pregnancy.history	-7.5	-14, -1.1	0.021
1 CI = Confidence Interval

Simple Regression	Multiple Regression
(Intercept)	97.080	102.008
(3.341)	(3.951)
Age	0.716	0.760
(0.095)	(0.096)
pregnancy.history	−7.473
(3.221)
Num.Obs.	768	768
R2	0.069	0.076
R2 Adj.	0.068	0.074
AIC	7451.3	7447.9
BIC	7465.2	7466.5
Log.Lik.	−3722.636	−3719.945
F	57.160	31.434
RMSE	30.82	30.71