7 Linear Regression

Setup Code (Click to Expand)

# packages needed to run the code in this section
# !pip install pandas numpy matplotlib seaborn statsmodel

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

from matplotlib import rc

# data path
path = '../data/'
file_name = 'deprivation.csv'

# import data
df = pd.read_csv(f"{path}{file_name}")

# set plot style
sns.set_style('whitegrid')

# set plot font
rc('font',**{'family':'sans-serif','sans-serif':['Arial']})

# set plot colour palette
colours = ['#1C355E', '#00A499', '#005EB8']
sns.set_palette(sns.color_palette(colours))

Linear regression is a cornerstone of data science and statistical inference, and is almost certainly the most common statistical method in use today (t-tests might run it close in industry). Its popularity is owed to its incredible flexibility and simplicity, as it can be used to model a wide range of relationships in a variety of contexts, and it is both simple to use and to interpret. If there is one statistical method that every data scientist and analyst should be familiar with, it is linear regression (followed closely by logistic regression, which is closely related).

Not only is linear regression a phenomenally useful tool in a data scientist’s arsenal, but it is also a fantastic way to learn about the fundamentals of statistical inference. Linear regression serves as the perfect introduction to statistical inference, and it is a great starting point for learning about regression analysis because so many regression models are based on linear regression. Once you have a good understanding of linear regression, you can then move on to more advanced regression models, such as logistic regression, Poisson regression, and so on.

For a really good, accessible introduction to linear regression, I’d highly recommend Josh Starmer’s StatQuest video on the topic:

This tutorial will introduce you to linear regression in Python, using data pulled from two different sources. The first data source is the Public Health England’s Fingertips API. The data included in this tutorial relates to health outcomes (life expectancy at birth) and deprivation (Indices of Multiple Deprivation (IMD) score and decile). The second data source is the Office of National Statistics’ Health Index, which is a composite measure of the health of the population in England. The Health Index is derived from a wide range of indicators related to three health domains: Healthy People (health outcomes), Healthy Lives (health-related behaviours and personal circumstances), and Healthy Places (wider drivers of health that relate to the places people live). The data included in this tutorial comes from two composite sub-domains in the Healthy Lives domain, which measure the physiological and behavioural risk factors that are associated with poor health outcomes. Both data sources are aggregated at the upper-tier local authority level, and the data covers the years 2015 to 2018.

7.1 Exploratory Data Analysis

# data summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   area_code            583 non-null    object 
 1   year                 583 non-null    int64  
 2   life_expectancy      583 non-null    float64
 3   imd_score            583 non-null    float64
 4   imd_decile           583 non-null    int64  
 5   physiological_score  583 non-null    float64
 6   behavioural_score    583 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 32.0+ KB

# view first five rows
df.head()

	area_code	year	life_expectancy	imd_score	imd_decile	physiological_score	behavioural_score
0	E06000001	2015	78.738395	35.037	1	89.9	84.9
1	E06000001	2016	79.073210	35.037	1	88.8	84.4
2	E06000001	2017	79.080000	35.037	1	84.8	82.7
3	E06000001	2018	78.810000	35.037	1	84.6	81.0
4	E06000002	2015	77.787260	40.460	1	96.4	80.3

df.describe()

	year	life_expectancy	imd_score	imd_decile	physiological_score	behavioural_score
count	583.000000	583.000000	583.000000	583.000000	583.000000	583.000000
mean	2016.497427	81.167021	23.182043	5.442539	99.015952	99.395026
std	1.118223	1.612812	8.043172	2.827766	10.012470	8.698003
min	2015.000000	76.535000	5.846000	1.000000	78.200000	72.400000
25%	2015.500000	80.011722	17.016000	3.000000	92.100000	93.250000
50%	2016.000000	81.210000	22.965000	5.000000	97.900000	100.600000
75%	2017.000000	82.229125	28.104000	8.000000	104.500000	105.900000
max	2018.000000	86.050000	45.039000	10.000000	125.900000	120.000000

# imd
sns.scatterplot(x='life_expectancy', y='imd_score', data=df)
sns.despine()
plt.show()

# physiological risk factors
sns.scatterplot(x='life_expectancy', y='physiological_score', data=df)
sns.despine()
plt.show()

# behavioural risk factors
sns.scatterplot(x='life_expectancy', y='behavioural_score', data=df)
sns.despine()
plt.show()

# correlations with life expectancy
df.corr(numeric_only=True).loc[:,'life_expectancy']

year                  -0.005660
life_expectancy        1.000000
imd_score             -0.837260
imd_decile             0.813546
physiological_score    0.581565
behavioural_score      0.887986
Name: life_expectancy, dtype: float64

# correlation matrix
corr = df[[
    'life_expectancy', 'imd_score',
    'physiological_score', 'behavioural_score']].corr()
corr

	life_expectancy	imd_score	physiological_score	behavioural_score
life_expectancy	1.000000	-0.837260	0.581565	0.887986
imd_score	-0.837260	1.000000	-0.378271	-0.807654
physiological_score	0.581565	-0.378271	1.000000	0.587748
behavioural_score	0.887986	-0.807654	0.587748	1.000000

# plot correlation matrix
sns.heatmap(corr,
    cmap=sns.diverging_palette(220, 10, as_cmap=True),
    vmin=-1.0, vmax=1.0,
    square=True, annot=True)
plt.show()

There are strong correlations between life expectancy and the three independent variables, however, the correlation between IMD score and the two risk factor variables is also relatively strong, and in the case of behavioural risk factors, it is very strong. This could be an issue.

7.2 Linear Regression

# create model
model = sm.OLS.from_formula(
    'life_expectancy ~ imd_score + physiological_score + behavioural_score',
    data=df)

# fit model
results = model.fit()

# print summary
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        life_expectancy   R-squared:                       0.844
Model:                            OLS   Adj. R-squared:                  0.843
Method:                 Least Squares   F-statistic:                     1042.
Date:                Fri, 16 Feb 2024   Prob (F-statistic):          7.23e-233
Time:                        11:52:46   Log-Likelihood:                -564.29
No. Observations:                 583   AIC:                             1137.
Df Residuals:                     579   BIC:                             1154.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept              71.5915      0.623    114.851      0.000      70.367      72.816
imd_score              -0.0775      0.006    -13.583      0.000      -0.089      -0.066
physiological_score     0.0238      0.003      7.141      0.000       0.017       0.030
behavioural_score       0.0907      0.006     15.020      0.000       0.079       0.103
==============================================================================
Omnibus:                      171.707   Durbin-Watson:                   0.765
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              611.061
Skew:                           1.343   Prob(JB):                    2.04e-133
Kurtosis:                       7.236   Cond. No.                     3.36e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

7.3 Evaluating and Interpreting Regression Models

When we fit a regression model, we have to consider both how well the model fits the data (evaluation) and the influence that each explanatory variable is having on the outcome (interpretation). The evaluation process is making sure that the model is doing a good enough job that we can reasonably draw some conclusions from it, while the interpretation process is understanding the conclusions we can draw. It’s important to understand the purpose that both of these processes serve, and to understand that the evaluation process is just kicking the tires, while what we are ultimately concerned with is the interpretation.

There are several elements to consider when interpreting the results of a linear regression model. Broadly speaking the focus should be on the magnitude and the precision of the model coeffficients. The magnitude of the coefficients indicates the strength of the association between the explanatory variables and the outcome variable, while the precision of the coefficients indicates the uncertainty in the estimated coefficients. We use the regression model coefficients to understand the magnitude of the effect and standard errors (and confidence intervals) to understand the precision of the effect.

7.3.1 Model Evaluation

There are a number of ways we might evaluate model performance, but the starting point should be the model’s $R^{2}$ and the p-values of the model coefficients. These can help us consider whether the model fits the data and whether we have enough data to trust the conclusions we can draw from each explanatory variable.

7.3.1.1 $R^{2}$

The $R^{2}$ value is the proportion of the variance in the outcome that is explained by the model. It is often used as a measure of the goodness of fit of the model, and by extension the extent to which the specified model explains the outcome, but caution should be applied when using $R^{2}$ this way, because $R^{2}$ can be misleading and doesn’t always tell the whole story. One of the main problems with $R^{2}$ when used in an inferential context is that adding more variables to the model will always increase the $R^{2}$ value, even if the new variables are not actually related to the outcome. Adjusted $R^{2}$ values are used to account for this problem, but they don’t resolve all issues with $R^{2}$ , so it is important to consider $R^{2}$ as just one of a number of indicators of the model’s fit. If the $R^{2}$ is high, this doesn’t necessarily mean that the model is a good fit for the data, but if the $R^{2}$ is particularly low, this might be a good indication that the model is not a good fit for the data, and at the very least it is good reason to be cautious about the results of the analysis, and to take a closer look at the model and the data to understand why the $R^{2}$ is so low.

Defining high/low $R^{2}$ values is dependent on the context of the analysis. The theory behind the model should inform our expectations about the $R^{2}$ value, and we should also consider the number of explanatory variables in the model (adding variables to a regression will increase the $R^{2}$ ). If the theory suggests that the model should explain a small amount of the variance in the outcome, and the model has only a few explanatory variables, then a low $R^{2}$ value might not be a cause for concern.

The Adjusted $R^{2}$ value for the above regression model is 0.843, which means that the model explains 84.3% of the observed variance in life expectancy in our dataset. This is a relatively high $R^{2}$ value.

7.3.1.2 P-Values

P-values are the probability of observing a coefficient as large as the estimated coefficient for a particular explanatory variable, given that the null hypothesis is true (i.e. that the coefficient is equal to zero), indicating that no effect is present. This can be a little tricky to understand at first, but a slightly crude way of thinking about this is that p-values are an estimate of the probability that the model coefficient you observe is actually distinguishable from zero.

P-values are used to test the statistical significance of the coefficients. If the p-value is less than the significance level then the coefficient is statistically significant, and we can reject the null hypothesis that the coefficient is equal to zero. If the p-value is greater than the significance level, then the coefficient is not statistically significant, and we cannot reject the null hypothesis that the coefficient is equal to zero. The significance level is often set at 0.05, which means that we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis. This means that if the p-value is less than 0.05, then there is a greater than 95% probability that the coefficient is not equal to zero. However, the significance level is a subjective decision, and can be set at any value. For example, if the significance level is set at 0.01, then we are willing to accept a 1% chance of incorrectly rejecting the null hypothesis.

In recent years p-values (and the wider concept of statistical significance) have been the subject of a great deal of discussion in the scientific community, because there is a belief among many practitioners that a lot of scientific research places too much weight in the p-value, treating it as the most important factor when interpreting a regression model. I think this perspective is correct, but I think dismissing p-values entirely is a mistake too. The real problem is not with p-values themselves, but with the goal of an analysis being the identification of a statistically significant effect.

I think the best way to think about p-values is the position prescribed by Gelman, Hill, and Vehtari (2020). Asking whether our coefficient is distinguishable from zero is the wrong question. As analysts, we should typically know enough about the context we are studying to build regression models with explanatory variables that will some effect. The question is how much. A statistically insignificant model coefficient is less a sign that a variable has no effect, and more a sign that there is insufficient data to detect a meaningful effect. If our model coefficients are not statistically significant, it is simply a good indication that we need more data. If the model coefficients are statistically significant, then our focus should be on interpreting the coefficients, not on the p-value itself.

7.3.2 Model Interpretation

Model interpretation is the process of understanding the influence that each explanatory variable has on the outcome. This involves understanding the direction, the magnitude, and the precision of the model coefficients. The coefficients themselves will tell us about direction and magnitude, while the standard errors and confidence intervals will tell us about precision.

7.3.2.1 Model Coefficients

The model coefficients are the estimated values of the regression coefficients, or the estimated change in the outcome variable for a one unit change in the explanatory variable, while holding all other explanatory variables constant. The intercept is the estimated value of the outcome variable when all of the explanatory variables are equal to zero (or the baseline values if the variables are transformed).

A regression’s coefficients are the most important part of interpreting the model. They specify the association between the explanatory variables and the outcome, telling us both the direction and the magnitude of the association. The direction of the association is indicated by the sign of the coefficient (positive = outcome increases with the explanatory variable, negative = outcome decreases with the explanatory variable). The magnitude of the association is indicated by the size of the coefficient (the larger the coefficient, the stronger the association). If the direction of the association is going in the wrong direction to that which we expect, this obviously indicates issues either in the theory or in the model, and if the magnitude of the effect is too small to be practically significant, then either the explanatory variable is not a good predictor of the outcome, or the model is not doing a good job of fitting the data (either the explanatory variable doesn’t really matter or the model is not capturing the true relationship between the explanatory variables and the outcome). However, while the direction is relatively easy to interpret, magnitude is not, because what counts as a practically significant effect is dependent on the context of the analysis.

7.3.2.2 Standard Errors

The standard errors are the standard deviation of the estimated coefficients. This is a measure of the precision of the estimated association between the explanatory variables and the outcome. The larger the standard error, the less precise the coefficient estimate. Precisely measured coefficients suggest that the model is fitting the data well, and we can be confident that the association between the explanatory variables and the outcome is real, and not just a result of random variation in the data. Defining what counts as ‘precise’ in any analysis, much like the coefficient magnitude, is dependent on the context of the analysis. However, in general, a standard error of less than 0.5 is considered to be a good estimate, and a standard error of less than 0.25 is considered to be a very precise estimate.

It is important to recognise that a precise coefficient alone cannot be treated as evidence of a causal effect between the explanatory variable and the outcome but it is, at least, a good indication that the explanatory variable is a good predictor of the outcome. In order to establish a causal relationship, we may consider methods that are more robust to the presence of confounding or ommitted variables, the gold standard being randomised controlled trials, but with a strong theoretical basis and a good understanding of the data, we can make a good case for a causal relationship based on observational data.

7 Linear Regression

7.1 Exploratory Data Analysis

7.2 Linear Regression

7.3 Evaluating and Interpreting Regression Models

7.3.1 Model Evaluation

7.3.1.1 $R^{2}$

7.3.1.2 P-Values

7.3.2 Model Interpretation

7.3.2.1 Model Coefficients

7.3.2.2 Standard Errors

7.4 Next Steps

7.5 Resources

7.1 Exploratory Data Analysis

7.2 Linear Regression

7.3 Evaluating and Interpreting Regression Models

7.3.1 Model Evaluation

7.3.1.1 R2

7.3.1.2 P-Values

7.3.2 Model Interpretation

7.3.2.1 Model Coefficients

7.3.2.2 Standard Errors

7.4 Next Steps

7.5 Resources

7.3.1.1 $R^{2}$