Linear and Logistic Regression

Christopher Meaney, Mateen Shaikh, Andrea S. Doria, and Rahim Moineddin

Learning Objectives

• Introduce simple linear regression, multiple linear regression, and logistic regression.

• Discuss the parameterization of each of the above models and interpretation of estimated regression coefficients.

• Introduce Pearson’s and Spearman’s correlation.

Introduction

The purpose of this chapter is to introduce readers to commonly encountered regression models in biomedical research. In discussing regression, the relationship between the simple linear regression model and Pearson’s measure of correlation is touched upon. We briefly introduce the nonparametric Spear-man’s correlation coefficient. Finally, the logistic regression model is discussed as it is used to assess covariate effects on a binary outcome variable. We demonstrate the computational implementation of these statistical models using R. We provide illustrative examples of each of these methods using the cartilage degeneration data set introduced in previous chapters.

Simple Linear Regression

Simple linear regression seeks to describe the dependence between a single outcome variable (y) and a single covariate (x). The outcome variable (y) is sometimes also referred to as the response or dependent variable. The covariate (x) is sometimes referred to as the independent, predictor, or explanatory variable. For example, if we consider the cartilage degeneration data set, that was introduced in Chapter 13, simple linear regression could be used to investigate the association between extent of cartilage degeneration (our outcome variable) and age (our covariate). In particular, we are interested in assessing the extent to which the mean level of cartilage degeneration changes as we age. In this particular example, both the outcome variable and the covariate are measured on a continuous scale. However, covariates may be of any type when handled appropriately, as discussed later in the chapter. When the outcome variable is not continuous (or not normally distributed), then linear regression is not an appropriate regression model. Depending on the measurement scale of the outcome variable, an extended class of regression model, that is, generalized linear models (GLMs), may be more appropriate. A thorough treatment of GLMs is given in McCullagh and Nelder.¹ We explore one particular class of GLM suitable for binary outcome data: the logistic regression model. We do not cover any other GLM classes that may have important applications for modelling count based responses, categorical responses, etc.

The primary reason for fitting a simple linear regression model to data is to explore/estimate the linear dependence between covariate and outcome. Considering the cartilage degeneration example, we may use linear regression to investigate the following research question: Is there a linear relationship between patient’s age (predictor variable) and area of degeneration (outcome variable) as in Fig. 14.1?

The simple linear regression model can be represented in a manner such that a particular value of slope and a particular value of intercept defines the line (the overall linear trend) that passes as close as possible to all of the points of the scatterplot shown in the left panel of Fig. 14.2.

Fig. 14.1 Overview of variables in a simple linear regression model.

plot(x=X$age,y=X$cart2,xlab=“

Age(years)”, ylab=“Cartilage

Degeneration(squared mm)”)

Obviously, many possible straight lines will fit the observed data reasonably well (and even more could fit the data poorly). The goal will be to find the best fitting line for these observed data. If every data point fit on a line, then every data point (x_i, y_i), would satisfy the equation y = β₀ + β₁x for some appropriate value of the intercept, β₀, and some appropriate value of the slope, β₁. It is rare that observed data points fall exactly along the fitted regression line (see Fig. 14.2). The vertical distance between observed data points and the model fitted value is called the residual/error (and is denoted e_i). As such the linear model can be defined as:

Fig. 14.2 Scatterplot illustrating the relationship between cartilage degeneration and age (left) and a regression line drawn (right) with an illustration of an error term, e_i.

y_i = β₀ + β₁x_i + e_i

where y_i represents the outcome variable (for subject i = 1,…, n), and x_i represents the covariate value (again, for subject i = 1,…, n). The e_i term represents an error term (or residual) (for subjects i = 1,…, n), and the parameters β₀ and β₁ are the intercept and slope, respectively. Isolating for the error term yields:

e_i = y₀ – (β₀ + β₁x_i

Therefore, the error term is the difference between the actual observed value of the outcome (y_i) and the corresponding value on the line (β₀ + β₁x_i) as illustrated in the right panel of Fig. 14.2. In the simple linear regression model, parameters are chosen such that they minimize the sum of these squared values between the observed data and the fitted model values (i.e., they minimize the sum of the squared residuals).

Since the true values of the parameters from the population are unknown, they are estimated using information contained in the observed sample of data, yielding regression parameter estimates denoted . As the regression parameter estimates are chosen to minimize the sum of squared deviations between the observed data points and the model fit, the fitting algorithm for linear regression is called “least squares.” That is, linear regression parameter estimates minimize a least squares objective function.^2,3

Interpreting the Model

In R, linear model estimates are easily obtained with the lm command.

> summary(lm(X$cart2~X$age))
Call:

lm(formula = X$cart2~X$age)
Residuals:

Min 1Q Median 3Q Max
−1.9924 −0.8880 −0.1599 0.7259

4.1834

Coefficients:

                 Estimate      Std.Error      t value      Pr(>|t|)
(Intercept)      −4.59651      1.48001        −3.106       0.00358 **
X$age            0.15054       0.02678        5.620        1.89e−06 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.341 on
38 degrees of freedom

Multiple R-squared: 0.4539,
Adjusted R-squared: 0.4396

F-statistic: 31.59 on 1 and 38 DF,
p-value: 1.889e-06

The least squares estimates corresponding to the slope and intercept parameters are 0.15 and −4.6, respectively. The units of the intercept are the same as the unit of the outcome, which is mm² in this case. The units of the slope, however, is a ratio of the outcome’s units and the covariate’s units. In this case, it would be mm²/years. The slope parameter from the simple linear regression model fit denotes the expected change in the mean of the outcome variable for a unit increase in the covariate. In our example, a year increase in age is estimated to increase the mean level of cartilage degeneration by 0.15 mm². The intercept in a simple linear regression model can be interpreted as the expected mean of the outcome when the covariate is held at zero. Depending on the measurement scale of the covariate, this may or may not make sense for a particular application. In our example we have said that the mean level of cartilage degeneration for someone of age zero is −4.59 mm². This illustrates one danger when extrapolating the model, or making conclusions beyond the scope of the data. The age of patients in the data does not extend to 0 years so extending the line that far to make conclusions is dangerous. One common way to give meaning to the intercept is to shift the model around the mean values of the predictor variable by subtracting the predictor’s mean from each predictor value: and using the values of rather than of x_i in the model. The slope will remain unchanged by this transformation of the covariate. The intercept is still measuring the mean value of the outcome when the covariate is held at zero; however, under this covariate transformation, zero represents the mean value of age. So, in this example, the intercept now represents the average outcome for an individual whose age is equal to the mean empirical age observed in the sample.

Inference

The previous sections introduced the simple linear regression model, discussed parameter estimation and model interpretation. Here we consider two aspects of inference for simple linear regression: confidence interval estimation and hypothesis testing.

In the context of the linear regression model, we are interested in performing inference on the intercept and slope parameters, respectively. In particular, we are primarily interested in testing whether the values of a parameter equals some specific quantity. For example, focusing on the slope parameter one is typically interesting in testing:

H₀: β₁ = 0

H_a: β₁ ≠ 0

Normal sampling theory can be used to demonstrate that the distribution of the associated test statistic under the null hypothesis follows an exact t-distribution (with n-2 degrees of freedom, as the regression model includes two parameter estimates). Our t-test statistic is given below:

T-based confidence intervals can be obtained inverting the aforementioned test statistic, yielding a t-based interval:

Based on the output from the simple linear model, . The critical value for a 95% confidence interval on 38 degrees of freedom (because n = 40) is t = 2.024. Using the estimated slope parameter, its corresponding standard error and noting the critical value of the t₃₈, distribution, we can go about constructing confidence intervals and performing hypothesis testing. Using the t-based approach defined above, we estimate a 95% confidence interval for the slope parameter to be (0.096, 0.205). The corresponding test statistic is 5.62, with an associated p-value of 1.86 × 10⁻⁶. The p value is less than the traditional α = 0.05 level for declaring statistical significance. We note that the confidence interval does not include zero. These two facts suggest that we should reject the null hypothesis that β₁ is zero. We conclude that a unit increase in age results in a statistically significant, 0.15 unit increase in expected cartilage degeneration.