5  Exploratory Data Analysis

Setup Code (Click to Expand)
# packages needed to run the code in this section
# !pip install pandas numpy matplotlib seaborn skimpy

# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib import rc
from skimpy import skim

# data path
path = '../data/'
file_name = 'heart_disease.csv'

# import data
df = pd.read_csv(f"{path}{file_name}")

# get categorical columns
cat_cols = ['sex', 'fasting_bs', 'resting_ecg', 'angina', 'heart_disease']
for col in cat_cols:
    df[col] = df[col].astype('category')

# set plot style
sns.set_style('whitegrid')

# set plot font
rc('font',**{'family':'sans-serif','sans-serif':['Arial']})

# set plot colour palette
colours = ['#1C355E', '#00A499', '#005EB8']
sns.set_palette(sns.color_palette(colours))

Exploratory data analysis (EDA) is the process of inspecting, visualising, and summarising a dataset. It is the first step in any data science project, and the importance of EDA can often be overlooked. Without exploring the data, it is difficult to know how to construct a piece of analysis or a model, and it is difficult to know if the data is suitable for the task at hand. As a critical step in the data science workflow, it is important to spend time on EDA and to be thorough and methodical in the process. While EDA is often the most time-consuming step in an analysis, taking the time to explore the data can save time in the long run.

EDA is an iterative process. In this tutorial, we will use the pandas and seaborn packages to explore a dataset containing information about heart disease. We will start by inspecting the data itself, to get a sense of the structure and the components of the dataset, and to identify any data quality issues (such as missing values). We will then compute summary statistics to get a better understanding of the distribution and central tendency of the variables that are relevant to the analysis. Finally, we will use data visualisations to explore specific variables in more detail, and to identify any interesting relationships between variables.

5.1 Inspecting the Data

The first step when doing EDA is to inspect the data itself and get an idea of the structure of the dataset, the variable types, and the typical values of each variable. This gives a better understanding of exactly what data is being used and informs decisions both about the next steps in the exploratory process and any modelling choices.

We can use the head() and info() functions to get a sense of the structure of the data. The head() function returns the first five rows of the data, and the info() method returns a summary of the data, including the number of rows, the number of columns, the column names, and the data type of each column. The info() method is particularly useful for identifying missing values, as it returns the number of non-null values in each column. If the number of non-null values is less than the number of rows, then there are missing values in the column.

In addition to these two methods, we can also use the nunique() method to count the number of unique values in each column, which helps identify categorical variables, and we can use the unique() method to get a list of the unique values in a column.

df.head()
age sex resting_bp cholesterol fasting_bs resting_ecg max_hr angina heart_peak_reading heart_disease
0 40 M 140 289 0 Normal 172 N 0.0 0
1 49 F 160 180 0 Normal 156 N 1.0 1
2 37 M 130 283 0 ST 98 N 0.0 0
3 48 F 138 214 0 Normal 108 Y 1.5 1
4 54 M 150 195 0 Normal 122 N 0.0 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   age                 918 non-null    int64   
 1   sex                 918 non-null    category
 2   resting_bp          918 non-null    int64   
 3   cholesterol         918 non-null    int64   
 4   fasting_bs          918 non-null    category
 5   resting_ecg         918 non-null    category
 6   max_hr              918 non-null    int64   
 7   angina              918 non-null    category
 8   heart_peak_reading  918 non-null    float64 
 9   heart_disease       918 non-null    category
dtypes: category(5), float64(1), int64(4)
memory usage: 41.1 KB
# count unique values in each column
df.nunique()
age                    50
sex                     2
resting_bp             67
cholesterol           222
fasting_bs              2
resting_ecg             3
max_hr                119
angina                  2
heart_peak_reading     53
heart_disease           2
dtype: int64
# unique values of the outcome variable
df.heart_disease.unique()
[0, 1]
Categories (2, int64): [0, 1]
# unique values of a continuous explanatory variable
df.cholesterol.unique()
array([289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, 164, 204,
       234, 273, 196, 201, 248, 267, 223, 184, 288, 215, 209, 260, 468,
       188, 518, 167, 224, 172, 186, 254, 306, 250, 177, 227, 230, 294,
       264, 259, 175, 318, 216, 340, 233, 205, 245, 194, 270, 213, 365,
       342, 253, 277, 202, 297, 225, 246, 412, 265, 182, 218, 268, 163,
       529, 100, 206, 238, 139, 263, 291, 229, 307, 210, 329, 147,  85,
       269, 275, 179, 392, 466, 129, 241, 255, 276, 282, 338, 160, 156,
       272, 240, 393, 161, 228, 292, 388, 166, 247, 331, 341, 243, 279,
       198, 249, 168, 603, 159, 190, 185, 290, 212, 231, 222, 235, 320,
       187, 266, 287, 404, 312, 251, 328, 285, 280, 192, 193, 308, 219,
       257, 132, 226, 217, 303, 298, 256, 117, 295, 173, 315, 281, 309,
       200, 336, 355, 326, 171, 491, 271, 274, 394, 221, 126, 305, 220,
       242, 347, 344, 358, 169, 181,   0, 236, 203, 153, 316, 311, 252,
       458, 384, 258, 349, 142, 197, 113, 261, 310, 232, 110, 123, 170,
       369, 152, 244, 165, 337, 300, 333, 385, 322, 564, 239, 293, 407,
       149, 199, 417, 178, 319, 354, 330, 302, 313, 141, 327, 304, 286,
       360, 262, 325, 299, 409, 174, 183, 321, 353, 335, 278, 157, 176,
       131], dtype=int64)

5.2 Summary Statistics

Summary statistics are a quick and easy way to get a sense of the distribution and central tendency of the variables in the dataset. We can use the describe() method to get a quick overview of every column in the dataset, including the row count, the mean and standard deviation, the minimum and maximum value, and quartiles of each variable.

# summary of the data
df.describe()
age resting_bp cholesterol max_hr heart_peak_reading
count 918.000000 918.000000 918.000000 918.000000 918.000000
mean 53.510893 132.396514 198.799564 136.809368 0.887364
std 9.432617 18.514154 109.384145 25.460334 1.066570
min 28.000000 0.000000 0.000000 60.000000 -2.600000
25% 47.000000 120.000000 173.250000 120.000000 0.000000
50% 54.000000 130.000000 223.000000 138.000000 0.600000
75% 60.000000 140.000000 267.000000 156.000000 1.500000
max 77.000000 200.000000 603.000000 202.000000 6.200000

While the describe() function is pretty effective, the skimpy package can provide a more detailed summary of the data, using the skim() function. If you are looking for a single function to capture the entire process of inspecting the data and computing summary statistics, skim() is the function for the job, giving you a wealth of information about the dataset as a whole and each variable in the data.

Another package that provides a similar function is the [profiling][ydata-profiling] package, which can be used to generate a report containing a summary of the data, including the data types, missing values, and summary statistics. The ydata-profiling package is particularly useful for generating a report that can be shared with others, as it can be exported as an HTML file, however it’s a bit more resource-intensive than skimpy, so we will stick with skimpy for this tutorial.

# more detailed summary of the data
skim(df)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ dataframe          Values ┃ ┃ Column Type  Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 918    │ │ category    │ 5     │ │ sex                   │                                │
│ │ Number of columns │ 10     │ │ int32       │ 4     │ │ fasting_bs            │                                │
│ └───────────────────┴────────┘ │ float64     │ 1     │ │ resting_ecg           │                                │
│                                └─────────────┴───────┘ │ angina                │                                │
│                                                        │ heart_disease         │                                │
│                                                        └───────────────────────┘                                │
│                                                     number                                                      │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━━┓  │
│ ┃ column_name               NA    NA %    mean    sd     p0      p25    p50   p75   p100   hist    ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━━┩  │
│ │ age                        0     0    54  9.4    28   47  54  60   77▁▃▅▇▅▁  │  │
│ │ resting_bp                 0     0   130   19     0  120 130 140  200   ▇▆▁  │  │
│ │ cholesterol                0     0   200  110     0  170 220 270  600 ▃▂▇▁   │  │
│ │ max_hr                     0     0   140   25    60  120 140 160  200 ▃▇▇▆▁  │  │
│ │ heart_peak_reading         0     0  0.89  1.1  -2.6    0 0.6 1.5  6.2  ▇▆▃   │  │
│ └──────────────────────────┴──────┴────────┴────────┴───────┴────────┴───────┴──────┴──────┴───────┴─────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name                           NA        NA %           ordered               unique            ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩  │
│ │ sex                                        0            0False                               2 │  │
│ │ fasting_bs                                 0            0False                               2 │  │
│ │ resting_ecg                                0            0False                               3 │  │
│ │ angina                                     0            0False                               2 │  │
│ │ heart_disease                              0            0False                               2 │  │
│ └──────────────────────────────────────┴──────────┴───────────────┴──────────────────────┴───────────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯

If we want to examine a particular variable, the functions mean(), median(), quantile(), min(), and max() will return the same information as the describe() function. We can also get a sense of dispersion by computing the standard deviation or variance of a variable. The std() function returns the standard deviation of a variable, and the var() function returns the variance.

# mean & median age
df.age.mean(), df.age.median()
(53.510893246187365, 54.0)
# min and max age
df.age.min(), df.age.max()
(28, 77)
# dispersion of age
df.age.std(), df.age.var()
(9.43261650673201, 88.9742541630732)

Finally, we can use the value_counts() function to get a count of the number of observations in each category of a discrete variable.

# heart disease count
df['heart_disease'].value_counts()
heart_disease
1    508
0    410
Name: count, dtype: int64
# resting ecg count
df['resting_ecg'].value_counts()
resting_ecg
Normal    552
LVH       188
ST        178
Name: count, dtype: int64
# angina
df['angina'].value_counts()
angina
N    547
Y    371
Name: count, dtype: int64
# cholesterol
df['cholesterol'].value_counts()
cholesterol
0      172
254     11
223     10
220     10
230      9
      ... 
392      1
316      1
153      1
466      1
131      1
Name: count, Length: 222, dtype: int64

We can also use the groupby() method to get the counts of each category in a categorical variable, grouped by another categorical variable.

df.groupby(['resting_ecg'])['heart_disease'].value_counts()
resting_ecg  heart_disease
LVH          1                106
             0                 82
Normal       1                285
             0                267
ST           1                117
             0                 61
Name: count, dtype: int64

In addition to the counts, we can also get the proportions of each category using the normalize=True argument.

df.groupby(['resting_ecg'])['heart_disease'].value_counts(normalize=True).round(3)
resting_ecg  heart_disease
LVH          1                0.564
             0                0.436
Normal       1                0.516
             0                0.484
ST           1                0.657
             0                0.343
Name: proportion, dtype: float64

5.3 Data Visualisation

While inspecting the data directly and using summary statistics to describe it is a good first step, data visualisation is a more effective way to explore the data. It allows us to quickly identify patterns and relationships in the data, and to identify any data quality issues that might not be immediately obvious without a visual representation of the data.

When using data visualisation for exploratory purposes, the intent is generally to visualise the way data is distributed, both within and between variables. This can be done using a variety of different types of plots, including histograms, bar charts, box plots, scatter plots, and line plots. How variables are distributed can tell us a lot about the variable itself, and how variables are distributed relative to each other can tell us a lot about the potential relationship between the variables.

In this tutorial, we will use the matplotlib and seaborn packages to create a series of data visualisations to explore the data in more detail. The seaborn package is a high-level data visualisation library that is built on top of matplotlib. Although data visualisation in Python is not as straightforward as it is in R, seaborn makes it much easier to create good quality and informative plots.

5.3.1 Visualising Data Distributions

The first step in the exploratory process is to visualise the data distributions of key variables in the dataset. This allows us to get a sense of the typical values and central tendency of the variable, as well as identifying any outliers or other data quality issues.

5.3.1.1 Continuous Distributions

For continuous variables, we can use histograms to visualise the distribution of the data. We can use the histplot() function to create a histogram of a continuous variable. The binwidth argument allows us to specify the width of the bins in the histogram.

# age distribution
sns.histplot(data=df, x='age', binwidth=5)
sns.despine()
plt.show()

# max hr distribution
sns.histplot(data=df, x='max_hr', binwidth=10)
sns.despine()
plt.show()

# cholesterol distribution
sns.histplot(data=df, x='cholesterol', binwidth=25)
sns.despine()
plt.show()

# cholesterol distribution
sns.histplot(data=df.loc[df.cholesterol!=0], x='cholesterol', binwidth=25)
sns.despine()
plt.show()

The inflated zero values in the cholesterol distribution suggests that there may be an issue with data quality that needs addressing.

5.3.1.2 Discrete Distributions

We can use bar plots to visualise the distribution of discrete variables. We can use the countplot() function to create a bar plot of a discrete variable.

# heart disease distribution
sns.countplot(data=df, x='heart_disease')
sns.despine()
plt.show()

# sex distribution
sns.countplot(data=df, x='sex')
sns.despine()
plt.show()

# angina distribution
sns.countplot(data=df, x='angina')
sns.despine()
plt.show()

5.3.2 Comparing Distributions

There are a number of ways to compare the distributions of multiple variables. Bar plots can be used to visualise two discrete variables, while histograms and box plots are useful for comparing the distribution of a continuous variable across the groups of a discrete variable, and scatter plots are particularly useful for comparing the distribution of two continuous variables.

5.3.2.1 Visualising Multiple Discrete Variables

Bar plots are an effective way to visualize the observed relationship (or association, at least) between a discrete explanatory variable and a discrete outcome (whether binary, ordinal, or categorical). We can use the countplot() function to create bar plots, and the hue argument to split the bars by a particular variable and display them in different colours.

# heart disease by sex
sns.countplot(data=df, x='heart_disease', hue='sex')
sns.despine()
plt.show()

# heart disease by resting ecg
sns.countplot(data=df, x='heart_disease', hue='resting_ecg')
sns.despine()
plt.show()

# angina
sns.countplot(data=df, x='heart_disease', hue='angina')
sns.despine()
plt.show()

# fasting bs
sns.countplot(data=df, x='heart_disease', hue='fasting_bs')
sns.despine()
plt.show()

5.3.2.2 Visualising A Continuous Variable Across Discrete Groups

Histograms and box plots are useful for comparing the distribution of a continuous variable across the groups of a discrete variable.

5.3.2.2.1 Histogram Plots

We can use the histplot() function to create a histogram of a continuous variable. The hue argument allows us to split the histogram by a particular variable and display them in different colours, while the multiple argument allows us to specify how the histograms should be displayed. The multiple argument can be set to stack to stack the histograms on top of each other, or dodge to display the histograms side-by-side.

# age distribution by heart disease
sns.histplot(data=df, x='age', hue='heart_disease', binwidth=5, multiple='dodge')
sns.despine()
plt.show()

# cholesterol
sns.histplot(data=df, x='cholesterol', hue='heart_disease', binwidth=25, multiple='dodge')
sns.despine()
plt.show()

# filter zero values
sns.histplot(
    data=df.loc[df.cholesterol!=0],
    x='cholesterol',
    hue='heart_disease',
    binwidth=25,
    multiple='dodge')

sns.despine()
plt.show()

The fact that there is a significantly larger proportion of positive heart disease cases in the zero cholesterol values further demonstrates the need to address this data quality issue.

5.3.2.2.2 Box Plots

Box plots visualize the characteristics of a continuous distribution over discrete groups. We can use the boxplot() function to create box plots, and the hue argument to split the box plots by a particular variable and display them in different colours.

However, while box plots can be very useful, they are not always the most effective way of visualising this information, as explained boxplots by Cedric Scherer. This guide uses box plots for the sake of simplicity, but it is worth considering other options when visualising distributions.

# age & heart disease
sns.boxplot(data=df, x='heart_disease', y='age')
sns.despine()
plt.show()

# age & heart disease, split by sex
# fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(data=df, x='heart_disease', y='age', hue='sex')
sns.despine()
plt.show()

# max hr & heart disease
sns.boxplot(data=df, x='heart_disease', y='max_hr')
sns.despine()
plt.show()

# max hr & heart disease, split by sex
# fig, ax = plt.subplots(figsize=(10,6))
sns.boxplot(data=df, x='heart_disease', y='max_hr', hue='sex')
sns.despine()
plt.show()

5.3.2.3 Visualising Multiple Discrete Variables

Scatter plots are an effective way to visualize how two continuous variables vary together. We can use the scatterplot() function to create scatter plots, and the hue argument to split the scatter plots by a particular variable and display them in different colours.

# age & resting bp
sns.scatterplot(data=df, x='age', y='resting_bp')
sns.despine()
plt.show()

# age & resting bp
sns.scatterplot(data=df.loc[df.resting_bp!=0], x='age', y='resting_bp')
sns.despine()
plt.show()

sns.scatterplot(data=df.loc[df.cholesterol!=0], x='age', y='cholesterol')
sns.despine()
plt.show()

sns.scatterplot(data=df, x='age', y='max_hr')
sns.despine()
plt.show()

The scatter plot visualising age and resting blood pressure highlights another observation that needs to be removed due to data quality issues.

If there appears to be an association between the two continuous variables that you have plotted, as is the case with age and maximum heart rate in the above plot, you can also add a regression line to visualize the strength of that association. The regplot() function can be used to add a regression line to a scatter plot. The ci argument specifies whether or not to display the confidence interval of the regression line.

# age & max hr
sns.regplot(data=df, x='age', y='max_hr', ci=None)
sns.despine()
plt.show()

You can also include discrete variables by assigning the discrete groups different colours in the scatter plot, and if you add regression lines to these plots, separate regression lines will be fit to the discrete groups. This can be useful for visualising how the association between the two continuous variables varies across the discrete groups.

The lmplot() function can be used to create scatter plots with regression lines, and the hue argument can be used to split the scatter plots by a particular variable and display them in different colours.

# age & resting bp, split by heart disease
sns.scatterplot(data=df.loc[df.resting_bp!=0], x='age', y='resting_bp', hue='heart_disease')
sns.despine()
plt.show()

# age & cholesterol, split by heart disease (with regression line)
sns.lmplot(
    data=df.loc[df.cholesterol!=0],
    x='age', y='cholesterol',
    hue='heart_disease',
    ci=None,
    height = 7,
    aspect=1.3)

plt.show()

# age & max hr, split by heart disease (with regression line)
sns.lmplot(
    data=df,
    x='age', y='max_hr',
    hue='heart_disease',
    ci=None,
    height = 7,
    aspect=1.3)

plt.show()

5.4 Next Steps

There are many more visualisation techniques that you can use to explore your data. You can find plenty of inspiration for different approaches to visualising data in the seaborn and matplotlib documentation. There are also a number of other Python libraries that can be used to create visualisations, including plotly, bokeh, and altair.

The next step in the data science process is to build a model to either explain or predict the outcome variable, heart disease. The exploratory work done here can help inform decisions about the choice of the model, and the choice of the variables that will be used to build the model. It will also help clean up the data, particularly the zero values in the cholesterol and resting blood pressure variables, to ensure that the model is built on the best possible data.

5.5 Resources

There are a wealth of resources available to help you learn more about data visualisation, and while the resources for producing visualisations in R are more extensive, there are still a number of good resources for producing visualisations in Python.

While the following resources are R-based, they are still useful for learning about data visualisation principles: