2.4 Simple linear regression

Bootcamp R Statistics

 

 

Question, explore, analyze (a workflow for data science)

 

SCRIPT 2.4 Use this to follow along on this page and for the Practice exercises below.

 

Overview

“The general rule is straightforward but has surprising consequences: whenever the correlation between two scores is imperfect, there will be regression to the mean.” - Francis Galton

One of the most common and powerful tools in the statistical toolbox is linear regression. The concept and basic toolset was created in conjunction with investigating the heritable basis of resenblance between children and their parents (e.g. height) by Francis Galton. Exemplary of one of the greatest traditions in science, a scientist identified a problem, created a tool to solve the problem, and then immediately shared the tool for the greater good. This is a slight digression from our purposes here, but you can learn more about it here:

 

Contents

2.4.1 The question of simple regression

2.4.2 Data and assumptions

2.4.3 Graphing

2.4.4 Test and alternatives

2.4.5 Practice exercises

 

 

2.4.1 The question of simple regression

The essential motivation for simple linear regression is to relate the value of a numeric variable to that of another variable. There may be several objectives to the analysis:

 

A few definitions

Equation 1 is the classic EQUATION for a linear regression model. (NB, EQUATION is emphasized here to make a sharp distinction between the equation representing the statistical model, and the R formula that we will use to implement it)

 

Equation 2 is our assumption for the residual error

 

Equation 3 is our sum of squares (SS) error for the residuals

 

Equation 4 is our estimate of the slope

 

Equation 4 is our estimate of the intercept

 

 

 

2.4.2 Data and assumptions

We will explore the simple regression model in R using the Kaggle fish market dataset.

 

# Download the fish data .xlsx file linked above and load it into R
# (I named my data object "fish") 
# Try this:

names(fish)
table(fish$Species)

# slice out the rows for Perch

fish$Species=="Perch" #just a reminder
perch <- fish[fish$Species=="Perch" , ]
head(perch)

 

The principle assumptions of simple linear regression are:

 

 

 

2.4.3 Graphing

The traditional way to graph the simple linear regression is with a scatterplot, with the dependent variable on the y axis and the predictor variable on the x axis. The regression equation above can be used to estimate the line of best fit for the sample data, which is predicted value of y. Thus, prediction is one of the functions here (as in predicting the value of y given a certain value of x if there were to be further data collection). This regression line is often incorporated in plots representing regression.

The simple regression function in R is lm() (for linear model). In order to estimate the line of best fit and the regression coefficients, we will make use of it.

# Try this:
# A simple regression of perch Height as the predictor variable (x)
# and Width as the dependent (y) variable

# First make a plot
plot(y = perch$Height, x = perch$Width,
     ylab = "Height", xlab = "Width",
     main = "My perch regression plot",
     pch = 20, col = "blue", cex = 1)

# Does it look there is a strong linear relationship
# (it looks very strong to me)

# In order to draw on the line of best fit we must calculate the regression

?lm # NB the formula argument...

# We usually would store the model output in an object

mylm <- lm(formula = Height ~ Width, # read y "as a function of" x 
           data =  perch)
mylm # NB the intercept (0.30), and the slope (1.59)

# We use the abline() function to draw the regression line onto our plot
# NB the 

?abline
abline(reg = mylm) # Not bad

# Some people like to summarize the regression equation on their plot
# We can do that with the text() function
# y = intercept + slope * x
?text
text(x = 3,    # x axis placement
     y = 11,   # y axis placement
     labels = "y = 0.30 + (1.59) * x")

 

 

Testing the assumptions

The data scientist must take responsibility for the assumptions of their analyses, and for validating the statistical model. A basic part of Exploratory Data Analysis (EDA) is to fomally test and visualise the assumptions. We will briefly do this in a few ways. Before we begin it is important to acknowledge that this part of the analysis is subjective and it is subtle, which is to say that it is hard to perform without practice. As much as we wish that Null Hypothesis Significance Testing is totally objective, the opposite is true, and the practice of data analysis requires experience.

Here, we will specifically test 2 of the assumption mentioned above, that of Gaussian residual distribution, and that of homoscedasticity. We will examin both graphically, and additionally we will formally test the assumption of Gaussian residuals.

To start with, let’s explicitly visualize the residuals. This is a step that might be unusual for a standard exploration of regresssion assumptions, but for our purposes here it will serve to be explicit about what the residuals actually are.

## Test assumptions ####
# Try this:

# Test Gaussian residuals

# Make our plot and regression line again
plot(y = perch$Height, x = perch$Width,
     ylab = "Height", xlab = "Width",
     main = "My perch RESIDUAL plot",
     pch = 20, col = "blue", cex = 1)
abline(reg = mylm)

# We can actually "draw on" the magnitude of residuals
arrows(x0 = perch$Width,
       x1 = perch$Width,
       y0 = predict(mylm), # start residual line on PREDICTED values
       y1 = predict(mylm) + residuals(mylm), # length of residual
       length = 0) # makes arrowhead length zero (or it looks weird here)

 

Closer look at the residual distribution

Remember how we visually examine distributions? With a frequency histogram and possibly a q-q plot right? Here we will do those for a peek, but we will also add a formal, objective test of deviation from normality. This part of exploratory data analysis is subtle and requires experience (i.e. it is hard), and there are many approaches. Our methods here are a starting point.

 

# residual distribution
# Try this:

library(car) # for qqPlot()

par(mfrow = c(1,2)) # Print graphs into 1x2 grid (row,column)

hist(residuals(mylm), main = "")
qqPlot(residuals(mylm))

par(mfrow = c(1,1)) # Set back to 1x1

Diagnosis

 

It is your job as a data scientist to be skeptical of data, assumptions, and conclusions. Do not pussyfoot this.

It is not good enough to merely make these diagnostic graphs robotically; the whole point is to judge whether the the assumptions have been violated. This is important (and remember, hard) because if the assumptions are not met it is unlikely that the dependent statistical model is valid. Here, we can look a little closer at the histogram and the expected Gaussian distribution, and we can also perform a formal statistical test to help us decide.

## Gussie up the histogram ####

# Make a new histogram
hist(residuals(mylm), 
     xlim = c(-2, 2), ylim = c(0,.9),
     main = "",
     prob = T) # We want probability density this time (not frequency)

# Add a density line to just help visualize "where the data are"
lines(                       # lines() function
  density(residuals(mylm)),   # density() function
  col = "green4", lty = 1, lwd = 3) # Mere vanity

# Make x points for theoretical Gaussian
x <- seq(-1,+1,by=0.02) 

# Draw on theoretical Gaussian for our residual parameters
curve(dnorm(x, mean = mean(residuals(mylm)),
            sd = sd(residuals(mylm))),
      add = T,
      col = "blue", lty = 3, lwd = 3) # mere vanity

# Draw on expected mean
abline(v = 0, # vertical line at the EXPECTED resid. mean = 0
       freq = F,
       col = "red", lty = 2, lwd = 3) # mere vanity

# Add legend
legend(x = .6, y = .9,
       legend = c("Our residuals", "Gaussian", "Mean"),
       lty = c(1,3,2),
       col = c("green4", "blue","red"), lwd = c(3,3,3))

Diagnosis

 

Finally, let’s perform a statistical test of whether there is evidence our residuals deviate from Gaussian. There are a lot of options for this, but we will only consider one here for illustration, in the interest of brevity. We will (somewhat arbitrarily) use the Shapiro-Wilk test for Gaussian.

Side note: Tests like this are a bit atypical within the NHST framework, in that usually when we perform a statistical test, we have a hypothesis WE BELIEVE TO BE TRUE that there is a difference (say between the regression slope and zero, or maybe between 2 means for a different test). In this typical case we are testing against the null of NO DIFFERENCE. When we perform such a test and examine the p-value, we compare the p-value to our alpha value.

 

The tyranny of the p-value

The rule we traditionally use is that we reject the null of no difference if our calculated p-value is lower than our chosen alpha (usually 0.05**). When testing assumptions of no difference we believe to be true, like here, we still typically use the 0.05 alpha threshold. In this case, when p > 0.05, we can take it as a lack of evidence that there is a difference. NB this is slightly different than consituting EVIDENCE that there is NO DIFFERENCE!

**The good old p-value is sometimes misinterpreted, or relied on “too heavily”. Read more about this important idea in Altman and Krzywinski 2017.

 

## Shapiro test ####
# Try this:
  
shapiro.test(residuals(mylm))

R output

 

Reporting the test of assumptions

The reporting of evidence supporting claims that assumptions underlying statistical tests have been tested and are “OK”, etc., are often understated even though they are a very important part of the practice of statistics. Based on the results of our Shapiro-Wilk test, we might report our findings in this way in a report (in a Methods section), prior to reporting the results of our regression (in the Results section):

We found no evidence our assumption of Gaussian residual distribution was violated (Shapiro-Wilk: W = 0.97, n = 56, p = 0.14)

 

Diagnostic plots and heteroscedasticity

Despite being challenging to pronounce and spell heteroscedasticiy, (help pronouncing it here; strong opinion about spelling it here), the concept of heteroscedasticity is simple - the that variance of the residuals should be constant across the predicted values. We usually examine this visually, which is easy to do in R.

## Heteroscadsticity ####

# Try this:
plot(y = residuals(mylm), x = fitted(mylm),
     pch = 16, cex = .8) 

# There is a lot hidden inside our regression object
summary(mylm)$sigma # Voila: The residual standard error

(uci <- summary(mylm)$sigma*1.96) # upper 95% confidence interval
(lci <- -summary(mylm)$sigma*1.96) # upper 95% confidence interval

# Add lines for mean and upper and lower 95% CI
abline(h = c(0, uci, lci),
       lwd = c(2,2,2),
       lty = c(2,3,3),
       col = c("blue", "red", "red"))

 

What we are looking for in this graph, ideally, is an even spread of residuals across the x-axis representing our fitted values. Remember, the x axis here represent perch Width, and each data point is a single observation of perch Height. The blue reference line is the mean PREDICTED perch Height for each value of Width. The difference between each data point and the horizontal line at zero is the residual difference, or residual error.

We are also looking for an absence of any systematic pattern in the data, that might suggest a lack of independence.

 

Diagnosis

 

 

2.4.4 Test and alternatives

You have examined your data and tested assumption of simple linear regression, and are happy to proceed. Let’s look at the main results of regression.

## Regression results ####
# Try this:

# Full results summary
summary(mylm)

R Output

This full results summary is important to understand (NB the summary() function will produce different output depending on the class() and kind of object passed to it).

 

Reporting results

A typical way to report results for our regression model might be: We found a significant linear relationship for Height predicting Weight in perch (regression: R-squared = 0.97, df = 1,54, P < 0.0001). Of course, this would be accompannied by an appropriate graph if important and relevant in the context of other results.

As usual, reporting copied and pasted results that have not been summarised appropriately is regarded as very poor practice, even for beginning students.

 

Alternatives to regression

There are actually a large number of alternatives to simple linear regression in case our data do not conform to the assumptions. Some of these are quite advanced and beyond the scope of this bootcamp (like weighted regression, or else specifically modelling the variance in some way). The most reasonable solutions to try first would be data transformation, or possibly if it were adequate to merely demonstrate a relationship between the variables, Spearman Rank correlation. A final alternative of intermediate difficulty, might be to try nonparametric regression, like implemented in Kendal-Theil-Siegel nonparametric regression.

 

 

 

2.4.5 Practice exercises

For the following exercises, we continue to use the fish dataset

 

1 Test whether the assumption of Gaussian residuals holds for the R formula Weight ~ Length1 for perch in the fish dataset. Describe the evidence for why or why not; show your code.

 

2 Perform the regression for Weight ~ Height for the species Bream. Assess whether the residuals fit the Gaussian assumption. Present any graphical tests or other results and your conclusion in the scientific style.

 

3 For the analysis in #2 above present the results of your linear regression (if the residuals fit the Gaussian assumption) or a Spearman rank correlation (if they did not).

 

4 Plot perch$Weight ~ perch$Length2. The relationship is obviously not linear but curved. Devise and execute a solution to enable the use of linear regression, possibly by transforming the data. Show any relevant code and briefly explain your results and conclusions.

 

5 Explore the data for perch and describe the covariance of all of the morphological, numeric variables using all relevant means, while being as concise as possible. Show your code.

 

6 Write a plausible practice question involving the the exploration or analysis of regression. Make use of the fish data from any species except for Perch.