7  Linear Regression

Setup Code (Click to Expand)
# packages needed to run the code in this section
# !pip install pandas numpy matplotlib seaborn statsmodel

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from matplotlib import rc

# data path
path = '../data/'
file_name = 'deprivation.csv'

# import data
df = pd.read_csv(f"{path}{file_name}")

# set plot style
sns.set_style('whitegrid')

# set plot font
rc('font',**{'family':'sans-serif','sans-serif':['Arial']})

# set plot colour palette
colours = ['#1C355E', '#00A499', '#005EB8']
sns.set_palette(sns.color_palette(colours))

Linear regression is a cornerstone of data science and statistical inference, and is almost certainly the most common statistical method in use today (t-tests might run it close in industry). Its popularity is owed to its incredible flexibility and simplicity, as it can be used to model a wide range of relationships in a variety of contexts, and it is both simple to use and to interpret. If there is one statistical method that every data scientist and analyst should be familiar with, it is linear regression (followed closely by logistic regression, which is closely related).

Not only is linear regression a phenomenally useful tool in a data scientist’s arsenal, but it is also a fantastic way to learn about the fundamentals of statistical inference. Linear regression serves as the perfect introduction to statistical inference, and it is a great starting point for learning about regression analysis because so many regression models are based on linear regression. Once you have a good understanding of linear regression, you can then move on to more advanced regression models, such as logistic regression, Poisson regression, and so on.

For a really good, accessible introduction to linear regression, I’d highly recommend Josh Starmer’s StatQuest video on the topic:

This tutorial will introduce you to linear regression in Python, using data pulled from two different sources. The first data source is the Public Health England’s Fingertips API. The data included in this tutorial relates to health outcomes (life expectancy at birth) and deprivation (Indices of Multiple Deprivation (IMD) score and decile). The second data source is the Office of National Statistics’ Health Index, which is a composite measure of the health of the population in England. The Health Index is derived from a wide range of indicators related to three health domains: Healthy People (health outcomes), Healthy Lives (health-related behaviours and personal circumstances), and Healthy Places (wider drivers of health that relate to the places people live). The data included in this tutorial comes from two composite sub-domains in the Healthy Lives domain, which measure the physiological and behavioural risk factors that are associated with poor health outcomes. Both data sources are aggregated at the upper-tier local authority level, and the data covers the years 2015 to 2018.

7.1 Exploratory Data Analysis

# data summary
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   area_code            583 non-null    object 
 1   year                 583 non-null    int64  
 2   life_expectancy      583 non-null    float64
 3   imd_score            583 non-null    float64
 4   imd_decile           583 non-null    int64  
 5   physiological_score  583 non-null    float64
 6   behavioural_score    583 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 32.0+ KB
# view first five rows
df.head()
area_code year life_expectancy imd_score imd_decile physiological_score behavioural_score
0 E06000001 2015 78.738395 35.037 1 89.9 84.9
1 E06000001 2016 79.073210 35.037 1 88.8 84.4
2 E06000001 2017 79.080000 35.037 1 84.8 82.7
3 E06000001 2018 78.810000 35.037 1 84.6 81.0
4 E06000002 2015 77.787260 40.460 1 96.4 80.3
df.describe()
year life_expectancy imd_score imd_decile physiological_score behavioural_score
count 583.000000 583.000000 583.000000 583.000000 583.000000 583.000000
mean 2016.497427 81.167021 23.182043 5.442539 99.015952 99.395026
std 1.118223 1.612812 8.043172 2.827766 10.012470 8.698003
min 2015.000000 76.535000 5.846000 1.000000 78.200000 72.400000
25% 2015.500000 80.011722 17.016000 3.000000 92.100000 93.250000
50% 2016.000000 81.210000 22.965000 5.000000 97.900000 100.600000
75% 2017.000000 82.229125 28.104000 8.000000 104.500000 105.900000
max 2018.000000 86.050000 45.039000 10.000000 125.900000 120.000000
# imd
sns.scatterplot(x='life_expectancy', y='imd_score', data=df)
sns.despine()
plt.show()

# physiological risk factors
sns.scatterplot(x='life_expectancy', y='physiological_score', data=df)
sns.despine()
plt.show()

# behavioural risk factors
sns.scatterplot(x='life_expectancy', y='behavioural_score', data=df)
sns.despine()
plt.show()

# correlations with life expectancy
df.corr(numeric_only=True).loc[:,'life_expectancy']
year                  -0.005660
life_expectancy        1.000000
imd_score             -0.837260
imd_decile             0.813546
physiological_score    0.581565
behavioural_score      0.887986
Name: life_expectancy, dtype: float64
# correlation matrix
corr = df[[
    'life_expectancy', 'imd_score',
    'physiological_score', 'behavioural_score']].corr()
corr
life_expectancy imd_score physiological_score behavioural_score
life_expectancy 1.000000 -0.837260 0.581565 0.887986
imd_score -0.837260 1.000000 -0.378271 -0.807654
physiological_score 0.581565 -0.378271 1.000000 0.587748
behavioural_score 0.887986 -0.807654 0.587748 1.000000
# plot correlation matrix
sns.heatmap(corr,
    cmap=sns.diverging_palette(220, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    square=True, annot=True)
plt.show()

There are strong correlations between life expectancy and the three independent variables, however, the correlation between IMD score and the two risk factor variables is also relatively strong, and in the case of behavioural risk factors, it is very strong. This could be an issue.

7.2 Linear Regression

# create model
model = sm.OLS.from_formula(
    'life_expectancy ~ imd_score + physiological_score + behavioural_score',
    data=df)

# fit model
results = model.fit()

# print summary
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.844
Model:                            OLS   Adj. R-squared:                  0.843
Method:                 Least Squares   F-statistic:                     1042.
Date:                Fri, 16 Feb 2024   Prob (F-statistic):          7.23e-233
Time:                        11:52:46   Log-Likelihood:                -564.29
No. Observations:                 583   AIC:                             1137.
Df Residuals:                     579   BIC:                             1154.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              71.5915      0.623    114.851      0.000      70.367      72.816
imd_score              -0.0775      0.006    -13.583      0.000      -0.089      -0.066
physiological_score     0.0238      0.003      7.141      0.000       0.017       0.030
behavioural_score       0.0907      0.006     15.020      0.000       0.079       0.103
==============================================================================
Omnibus:                      171.707   Durbin-Watson:                   0.765
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              611.061
Skew:                           1.343   Prob(JB):                    2.04e-133
Kurtosis:                       7.236   Cond. No.                     3.36e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

7.3 Evaluating and Interpreting Regression Models

When we fit a regression model, we have to consider both how well the model fits the data (evaluation) and the influence that each explanatory variable is having on the outcome (interpretation). The evaluation process is making sure that the model is doing a good enough job that we can reasonably draw some conclusions from it, while the interpretation process is understanding the conclusions we can draw. It’s important to understand the purpose that both of these processes serve, and to understand that the evaluation process is just kicking the tires, while what we are ultimately concerned with is the interpretation.

There are several elements to consider when interpreting the results of a linear regression model. Broadly speaking the focus should be on the magnitude and the precision of the model coeffficients. The magnitude of the coefficients indicates the strength of the association between the explanatory variables and the outcome variable, while the precision of the coefficients indicates the uncertainty in the estimated coefficients. We use the regression model coefficients to understand the magnitude of the effect and standard errors (and confidence intervals) to understand the precision of the effect.

7.3.1 Model Evaluation

There are a number of ways we might evaluate model performance, but the starting point should be the model’s \(R^2\) and the p-values of the model coefficients. These can help us consider whether the model fits the data and whether we have enough data to trust the conclusions we can draw from each explanatory variable.

7.3.1.1 \(R^2\)

The \(R^2\) value is the proportion of the variance in the outcome that is explained by the model. It is often used as a measure of the goodness of fit of the model, and by extension the extent to which the specified model explains the outcome, but caution should be applied when using \(R^2\) this way, because \(R^2\) can be misleading and doesn’t always tell the whole story. One of the main problems with \(R^2\) when used in an inferential context is that adding more variables to the model will always increase the \(R^2\) value, even if the new variables are not actually related to the outcome. Adjusted \(R^2\) values are used to account for this problem, but they don’t resolve all issues with \(R^2\), so it is important to consider \(R^2\) as just one of a number of indicators of the model’s fit. If the \(R^2\) is high, this doesn’t necessarily mean that the model is a good fit for the data, but if the \(R^2\) is particularly low, this might be a good indication that the model is not a good fit for the data, and at the very least it is good reason to be cautious about the results of the analysis, and to take a closer look at the model and the data to understand why the \(R^2\) is so low.

Defining high/low \(R^2\) values is dependent on the context of the analysis. The theory behind the model should inform our expectations about the \(R^2\) value, and we should also consider the number of explanatory variables in the model (adding variables to a regression will increase the \(R^2\)). If the theory suggests that the model should explain a small amount of the variance in the outcome, and the model has only a few explanatory variables, then a low \(R^2\) value might not be a cause for concern.

The Adjusted \(R^2\) value for the above regression model is 0.843, which means that the model explains 84.3% of the observed variance in life expectancy in our dataset. This is a relatively high \(R^2\) value.

7.3.1.2 P-Values

P-values are the probability of observing a coefficient as large as the estimated coefficient for a particular explanatory variable, given that the null hypothesis is true (i.e. that the coefficient is equal to zero), indicating that no effect is present. This can be a little tricky to understand at first, but a slightly crude way of thinking about this is that p-values are an estimate of the probability that the model coefficient you observe is actually distinguishable from zero.

P-values are used to test the statistical significance of the coefficients. If the p-value is less than the significance level then the coefficient is statistically significant, and we can reject the null hypothesis that the coefficient is equal to zero. If the p-value is greater than the significance level, then the coefficient is not statistically significant, and we cannot reject the null hypothesis that the coefficient is equal to zero. The significance level is often set at 0.05, which means that we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis. This means that if the p-value is less than 0.05, then there is a greater than 95% probability that the coefficient is not equal to zero. However, the significance level is a subjective decision, and can be set at any value. For example, if the significance level is set at 0.01, then we are willing to accept a 1% chance of incorrectly rejecting the null hypothesis.

In recent years p-values (and the wider concept of statistical significance) have been the subject of a great deal of discussion in the scientific community, because there is a belief among many practitioners that a lot of scientific research places too much weight in the p-value, treating it as the most important factor when interpreting a regression model. I think this perspective is correct, but I think dismissing p-values entirely is a mistake too. The real problem is not with p-values themselves, but with the goal of an analysis being the identification of a statistically significant effect.

I think the best way to think about p-values is the position prescribed by Gelman, Hill, and Vehtari (2020). Asking whether our coefficient is distinguishable from zero is the wrong question. As analysts, we should typically know enough about the context we are studying to build regression models with explanatory variables that will some effect. The question is how much. A statistically insignificant model coefficient is less a sign that a variable has no effect, and more a sign that there is insufficient data to detect a meaningful effect. If our model coefficients are not statistically significant, it is simply a good indication that we need more data. If the model coefficients are statistically significant, then our focus should be on interpreting the coefficients, not on the p-value itself.

7.3.2 Model Interpretation

Model interpretation is the process of understanding the influence that each explanatory variable has on the outcome. This involves understanding the direction, the magnitude, and the precision of the model coefficients. The coefficients themselves will tell us about direction and magnitude, while the standard errors and confidence intervals will tell us about precision.

7.3.2.1 Model Coefficients

The model coefficients are the estimated values of the regression coefficients, or the estimated change in the outcome variable for a one unit change in the explanatory variable, while holding all other explanatory variables constant. The intercept is the estimated value of the outcome variable when all of the explanatory variables are equal to zero (or the baseline values if the variables are transformed).

A regression’s coefficients are the most important part of interpreting the model. They specify the association between the explanatory variables and the outcome, telling us both the direction and the magnitude of the association. The direction of the association is indicated by the sign of the coefficient (positive = outcome increases with the explanatory variable, negative = outcome decreases with the explanatory variable). The magnitude of the association is indicated by the size of the coefficient (the larger the coefficient, the stronger the association). If the direction of the association is going in the wrong direction to that which we expect, this obviously indicates issues either in the theory or in the model, and if the magnitude of the effect is too small to be practically significant, then either the explanatory variable is not a good predictor of the outcome, or the model is not doing a good job of fitting the data (either the explanatory variable doesn’t really matter or the model is not capturing the true relationship between the explanatory variables and the outcome). However, while the direction is relatively easy to interpret, magnitude is not, because what counts as a practically significant effect is dependent on the context of the analysis.

7.3.2.2 Standard Errors

The standard errors are the standard deviation of the estimated coefficients. This is a measure of the precision of the estimated association between the explanatory variables and the outcome. The larger the standard error, the less precise the coefficient estimate. Precisely measured coefficients suggest that the model is fitting the data well, and we can be confident that the association between the explanatory variables and the outcome is real, and not just a result of random variation in the data. Defining what counts as ‘precise’ in any analysis, much like the coefficient magnitude, is dependent on the context of the analysis. However, in general, a standard error of less than 0.5 is considered to be a good estimate, and a standard error of less than 0.25 is considered to be a very precise estimate.

It is important to recognise that a precise coefficient alone cannot be treated as evidence of a causal effect between the explanatory variable and the outcome but it is, at least, a good indication that the explanatory variable is a good predictor of the outcome. In order to establish a causal relationship, we may consider methods that are more robust to the presence of confounding or ommitted variables, the gold standard being randomised controlled trials, but with a strong theoretical basis and a good understanding of the data, we can make a good case for a causal relationship based on observational data.

7.4 Next Steps

7.5 Resources