Setup Code (Click to Expand)
# packages needed to run the code in this section
# !pip install pyjanitor pandas polars fingertips_py
import re
import pandas as pd
import polars as pl
from janitor import clean_names
# import fingertips_py as ftp
# packages needed to run the code in this section
# !pip install pyjanitor pandas polars fingertips_py
import re
import pandas as pd
import polars as pl
from janitor import clean_names
# import fingertips_py as ftp
The first step in any analysis is to find the right data for the problem at hand, and to transform that data so that it is structured appropriately for the method of analysis. Finding the right data can be challenging, sometimes requiring getting data from multiple sources, and even then, the data might be a mess, with missing values, duplicates, formatting that makes it difficult to work with in R or Python, and a myriad of other problems. The process of transforming data to make it suitable for analysis is often referred to as data wrangling, or data munging. The data wrangling process can take many forms, from simple data cleaning to more complex transformations like combining multiple data sets, or creating new variables to enrich the data.
The Python ecosystem for data wrangling is not quite as rich as in R, but there is one library that is extremely dominant (pandas) and one library that is seeking to disrupt it (polars).
We will use Public Health England’s Fingertips API (for which there are packages available in R and Python) as the data source for this guide, as it represents a good opportunity to give a brief overview of this data and how to use it. Fingertips is a repository of public health data on a wide range of topics. The data included in this tutorial relates to the wider determinants of health.
First, we can look at what is available.
Unfortunately, the fingertips_py
package does seem to be a little less mature than the R equivalent, and there are a lot of errors running the functions we would need to import the data. While this is the case, we’ll take a little shortcut, importing the data from a CSV that I’ve stored in the guides GitHub repository, using the Fingertips R package.
= pd.read_csv("https://raw.githubusercontent.com/NHS-South-Central-and-West/data-science-guides/main/data/phof.csv", low_memory=False) phof_raw
phof_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264329 entries, 0 to 264328
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 IndicatorID 264329 non-null int64
1 IndicatorName 264329 non-null object
2 ParentCode 261229 non-null object
3 ParentName 261229 non-null object
4 AreaCode 264329 non-null object
5 AreaName 264329 non-null object
6 AreaType 264329 non-null object
7 Sex 264329 non-null object
8 Age 264329 non-null object
9 CategoryType 0 non-null float64
10 Category 0 non-null float64
11 Timeperiod 264329 non-null object
12 Value 249986 non-null float64
13 LowerCI95.0limit 243173 non-null float64
14 UpperCI95.0limit 243173 non-null float64
15 LowerCI99.8limit 56885 non-null float64
16 UpperCI99.8limit 56885 non-null float64
17 Count 187440 non-null float64
18 Denominator 198367 non-null float64
19 Valuenote 28742 non-null object
20 RecentTrend 42091 non-null object
21 ComparedtoEnglandvalueorpercentiles 264329 non-null object
22 Comparedtopercentiles 264329 non-null object
23 TimeperiodSortable 264329 non-null int64
24 Newdata 72025 non-null object
25 Comparedtogoal 38885 non-null object
26 Timeperiodrange 264329 non-null object
dtypes: float64(9), int64(2), object(16)
memory usage: 54.5+ MB
phof_raw.describe()
IndicatorID | CategoryType | Category | Value | LowerCI95.0limit | UpperCI95.0limit | LowerCI99.8limit | UpperCI99.8limit | Count | Denominator | TimeperiodSortable | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 264329.000000 | 0.0 | 0.0 | 249986.000000 | 243173.000000 | 243173.000000 | 56885.000000 | 56885.000000 | 1.874400e+05 | 1.983670e+05 | 2.643290e+05 |
mean | 72371.514560 | NaN | NaN | 72.092017 | 65.526956 | 81.604363 | 122.884874 | 165.406557 | 4.023386e+04 | 1.390792e+07 | 2.015224e+07 |
std | 30934.529754 | NaN | NaN | 276.950447 | 253.887549 | 311.174288 | 452.040131 | 622.510687 | 1.124600e+06 | 1.361739e+09 | 4.862563e+04 |
min | 10301.000000 | NaN | NaN | -36.400000 | -72.200000 | -5.200000 | 0.000000 | 0.000000 | -1.800000e+02 | 2.544061e-01 | 2.000000e+07 |
25% | 41001.000000 | NaN | NaN | 12.398315 | 8.300000 | 16.790000 | 22.773030 | 33.147850 | 6.200000e+01 | 7.900000e+02 | 2.012000e+07 |
50% | 90634.000000 | NaN | NaN | 40.859605 | 35.579380 | 50.900000 | 58.024690 | 69.074819 | 4.850000e+02 | 4.798000e+03 | 2.016000e+07 |
75% | 92517.000000 | NaN | NaN | 75.900000 | 72.400000 | 79.700120 | 83.092446 | 90.801542 | 4.164250e+03 | 1.181570e+05 | 2.019000e+07 |
max | 93982.000000 | NaN | NaN | 9110.634850 | 8473.979610 | 9782.239270 | 8122.734550 | 10179.121160 | 1.942292e+08 | 2.894740e+11 | 2.023000e+07 |
phof_raw.head()
IndicatorID | IndicatorName | ParentCode | ParentName | AreaCode | AreaName | AreaType | Sex | Age | CategoryType | ... | Count | Denominator | Valuenote | RecentTrend | ComparedtoEnglandvalueorpercentiles | Comparedtopercentiles | TimeperiodSortable | Newdata | Comparedtogoal | Timeperiodrange | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90362 | A01a - Healthy life expectancy at birth | NaN | NaN | E92000001 | England | England | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20090000 | NaN | NaN | 3y |
1 | 90362 | A01a - Healthy life expectancy at birth | NaN | NaN | E92000001 | England | England | Female | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20090000 | NaN | NaN | 3y |
2 | 90362 | A01a - Healthy life expectancy at birth | E92000001 | England | E06000001 | Hartlepool | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Worse | Not compared | 20090000 | NaN | NaN | 3y |
3 | 90362 | A01a - Healthy life expectancy at birth | E92000001 | England | E06000002 | Middlesbrough | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Worse | Not compared | 20090000 | NaN | NaN | 3y |
4 | 90362 | A01a - Healthy life expectancy at birth | E92000001 | England | E06000003 | Redcar and Cleveland | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Worse | Not compared | 20090000 | NaN | NaN | 3y |
5 rows × 27 columns
With the raw data imported, we are now able to start the process of cleaning and wrangling the data into a format that is ready for analysis.
First, we will clean the column names using the pyjanitor’s clean_names
function. This will convert all column names to lower case and remove any spaces and special characters. This will make it easier to work with the data in the future.
= (
phof_raw
phof_raw
.pipe(clean_names) )
We’ve got a total of 168 indicators, which means some of the indicators from the Public Health Outcomes Framework are not available for upper tier local authorities. We can filter the data to just the indicators we are interested in (as well as filtering any rows where the value is missing), using the panda’s loc
function. We can also use the group_by
and mean
functions to get the mean value of the indicators across all areas in the data. Below is an example:
(
phof_raw'indicatorid'].isin([90366, 20101, 11601]))]
.loc[(phof_raw[='value')
.dropna(subset'indicatorname')['value']
.groupby(
.mean() )
indicatorname
A01b - Life expectancy at birth 80.199513
B16 - Utilisation of outdoor space for exercise or health reasons 16.216824
C04 - Low birth weight of term babies 2.870756
Name: value, dtype: float64
The data pulled from the Fingertips API has a number of idiosyncracies that need addressing when wrangling the data. For example, the timeperiodsortable
column refers to the year that the data relates to, and it contains four superfluous zeros at the end of each year (i.e, 20230000). We can use the pandas assign
function to create a year
column which strips out the zeros from the timeperiodsortable
column by dividing it by 10000, and dropping the decimal place by converting year to integer (using the astype(int)
function).
# filter for the indicators of interest and wrangle data into tidy structure
(
phof_raw
.loc[('indicatorid'].isin([90366, 20101, 11601]))
phof_raw[& (phof_raw['areacode'] == 'E92000001'
)]=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
.assign(year'indicatorname', 'areacode', 'year'], as_index=False)['value']
.groupby([
.mean() )
indicatorname | areacode | year | value | |
---|---|---|---|---|
0 | A01b - Life expectancy at birth | E92000001 | 2001 | 78.448740 |
1 | A01b - Life expectancy at birth | E92000001 | 2002 | 78.700330 |
2 | A01b - Life expectancy at birth | E92000001 | 2003 | 78.987650 |
3 | A01b - Life expectancy at birth | E92000001 | 2004 | 79.358295 |
4 | A01b - Life expectancy at birth | E92000001 | 2005 | 79.614160 |
5 | A01b - Life expectancy at birth | E92000001 | 2006 | 79.817360 |
6 | A01b - Life expectancy at birth | E92000001 | 2007 | 80.082390 |
7 | A01b - Life expectancy at birth | E92000001 | 2008 | 80.355270 |
8 | A01b - Life expectancy at birth | E92000001 | 2009 | 80.744790 |
9 | A01b - Life expectancy at birth | E92000001 | 2010 | 80.982525 |
10 | A01b - Life expectancy at birth | E92000001 | 2011 | 81.154860 |
11 | A01b - Life expectancy at birth | E92000001 | 2012 | 81.276175 |
12 | A01b - Life expectancy at birth | E92000001 | 2013 | 81.283525 |
13 | A01b - Life expectancy at birth | E92000001 | 2014 | 81.337425 |
14 | A01b - Life expectancy at birth | E92000001 | 2015 | 81.339375 |
15 | A01b - Life expectancy at birth | E92000001 | 2016 | 81.419660 |
16 | A01b - Life expectancy at birth | E92000001 | 2017 | 81.565000 |
17 | A01b - Life expectancy at birth | E92000001 | 2018 | 81.270000 |
18 | A01b - Life expectancy at birth | E92000001 | 2021 | 80.748915 |
19 | B16 - Utilisation of outdoor space for exercis... | E92000001 | 2011 | 14.015031 |
20 | B16 - Utilisation of outdoor space for exercis... | E92000001 | 2012 | 15.329390 |
21 | B16 - Utilisation of outdoor space for exercis... | E92000001 | 2013 | 17.130101 |
22 | B16 - Utilisation of outdoor space for exercis... | E92000001 | 2014 | 17.907098 |
23 | B16 - Utilisation of outdoor space for exercis... | E92000001 | 2015 | 17.917663 |
24 | C04 - Low birth weight of term babies | E92000001 | 2006 | 3.018037 |
25 | C04 - Low birth weight of term babies | E92000001 | 2007 | 2.912639 |
26 | C04 - Low birth weight of term babies | E92000001 | 2008 | 2.894468 |
27 | C04 - Low birth weight of term babies | E92000001 | 2009 | 2.919269 |
28 | C04 - Low birth weight of term babies | E92000001 | 2010 | 2.853172 |
29 | C04 - Low birth weight of term babies | E92000001 | 2011 | 2.844595 |
30 | C04 - Low birth weight of term babies | E92000001 | 2012 | 2.798094 |
31 | C04 - Low birth weight of term babies | E92000001 | 2013 | 2.822148 |
32 | C04 - Low birth weight of term babies | E92000001 | 2014 | 2.864100 |
33 | C04 - Low birth weight of term babies | E92000001 | 2015 | 2.774469 |
34 | C04 - Low birth weight of term babies | E92000001 | 2016 | 2.786173 |
35 | C04 - Low birth weight of term babies | E92000001 | 2017 | 2.819327 |
36 | C04 - Low birth weight of term babies | E92000001 | 2018 | 2.864822 |
37 | C04 - Low birth weight of term babies | E92000001 | 2019 | 2.902918 |
38 | C04 - Low birth weight of term babies | E92000001 | 2020 | 2.856152 |
39 | C04 - Low birth weight of term babies | E92000001 | 2021 | 2.774210 |
We can also remove the ID that is appended to the beginning of each indicator name. We can use str.replace
to remove this part of the string from the indicator name, using a regular expression to remove everything up to and including the dash (and the space immediately after).
(
phof_raw'indicatorid'].isin([90366, 20101, 11601]))]
.loc[(phof_raw[
.assign(= lambda x: x['indicatorname']
indicatorname str.replace('^[^-]+- ', '', regex=True))
. )
indicatorid | indicatorname | parentcode | parentname | areacode | areaname | areatype | sex | age | categorytype | ... | count | denominator | valuenote | recenttrend | comparedtoenglandvalueorpercentiles | comparedtopercentiles | timeperiodsortable | newdata | comparedtogoal | timeperiodrange | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2984 | 90366 | Life expectancy at birth | NaN | NaN | E92000001 | England | England | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20010000 | NaN | NaN | 3y |
2985 | 90366 | Life expectancy at birth | NaN | NaN | E92000001 | England | England | Female | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20010000 | NaN | NaN | 3y |
2986 | 90366 | Life expectancy at birth | E92000001 | England | E06000001 | Hartlepool | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20010000 | NaN | NaN | 3y |
2987 | 90366 | Life expectancy at birth | E92000001 | England | E06000002 | Middlesbrough | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20010000 | NaN | NaN | 3y |
2988 | 90366 | Life expectancy at birth | E92000001 | England | E06000003 | Redcar and Cleveland | Counties & UAs (from Apr 2023) | Male | All ages | NaN | ... | NaN | NaN | NaN | NaN | Not compared | Not compared | 20010000 | NaN | NaN | 3y |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
103354 | 20101 | Low birth weight of term babies | E92000001 | England | E10000029 | Suffolk | Counties & UAs (from Apr 2023) | Persons | >=37 weeks gestational age at birth | NaN | ... | 169.0 | 6124.0 | NaN | No significant change | Similar | Not compared | 20210000 | NaN | NaN | 1y |
103355 | 20101 | Low birth weight of term babies | E92000001 | England | E10000030 | Surrey | Counties & UAs (from Apr 2023) | Persons | >=37 weeks gestational age at birth | NaN | ... | 277.0 | 11536.0 | NaN | No significant change | Better | Not compared | 20210000 | NaN | NaN | 1y |
103356 | 20101 | Low birth weight of term babies | E92000001 | England | E10000031 | Warwickshire | Counties & UAs (from Apr 2023) | Persons | >=37 weeks gestational age at birth | NaN | ... | 128.0 | 5474.0 | NaN | No significant change | Better | Not compared | 20210000 | NaN | NaN | 1y |
103357 | 20101 | Low birth weight of term babies | E92000001 | England | E10000032 | West Sussex | Counties & UAs (from Apr 2023) | Persons | >=37 weeks gestational age at birth | NaN | ... | 138.0 | 7854.0 | NaN | No significant change | Better | Not compared | 20210000 | NaN | NaN | 1y |
103358 | 20101 | Low birth weight of term babies | E92000001 | England | E10000034 | Worcestershire | Counties & UAs (from Apr 2023) | Persons | >=37 weeks gestational age at birth | NaN | ... | 117.0 | 5069.0 | NaN | No significant change | Better | Not compared | 20210000 | NaN | NaN | 1y |
8847 rows × 27 columns
Finally, the data is structured in a wide format, with each indicator having its own column. This is not a tidy structure, as each column should represent a variable, and each row should represent an observation. We can use the pivot_table
function to convert the data into a tidy structure, with each indicator having its own row. Pandas will turn areacode
and timeperiodsortable
into the index of the dataframe, and in order to convert these variables back to columns we use reset_index
and rename_axis(None, axis=1)
. We can also use the rename
function to rename the columns to something more meaningful, and the dropna
function to remove any rows where the value is missing.
(
phof_raw'indicatorid'].isin([90366, 20101, 11601]))]
.loc[(phof_raw[
.groupby('indicatorname', 'areacode', 'timeperiodsortable'],
[=False
as_index'value']
)[
.mean()
.pivot_table(=['areacode', 'timeperiodsortable'],
index='indicatorname',
columns='value'
values
)
.rename(={
columns'A01b - Life expectancy at birth': 'life_expectancy',
'C04 - Low birth weight of term babies': 'low_birth_weight',
'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
})
.reset_index()None, axis=1)
.rename_axis(
.dropna() )
areacode | timeperiodsortable | life_expectancy | outdoor_space | low_birth_weight | |
---|---|---|---|---|---|
11 | E06000001 | 20120000 | 79.607655 | 21.131038 | 3.222749 |
12 | E06000001 | 20130000 | 79.075110 | 9.837764 | 2.555911 |
13 | E06000001 | 20140000 | 78.846375 | 18.596376 | 2.989691 |
14 | E06000001 | 20150000 | 78.738395 | 11.314412 | 1.794616 |
31 | E06000002 | 20110000 | 78.344225 | 16.792305 | 3.448276 |
... | ... | ... | ... | ... | ... |
3104 | E92000001 | 20110000 | 81.154860 | 14.015031 | 2.844595 |
3105 | E92000001 | 20120000 | 81.276175 | 15.329390 | 2.798094 |
3106 | E92000001 | 20130000 | 81.283525 | 17.130101 | 2.822148 |
3107 | E92000001 | 20140000 | 81.337425 | 17.907098 | 2.864100 |
3108 | E92000001 | 20150000 | 81.339375 | 17.917663 | 2.774469 |
667 rows × 5 columns
We can combine all of the above steps into a single function, which we can then apply to the raw data to get the data into a tidy structure.
# function to wrangle data into tidy structure
def wrangle_phof_data(data):
"""
Returns a tidy, formatted DataFrame with Public Health England data
"""
= (
phof_data
data'indicatorid'].isin([90366, 20101, 11601]))]
.loc[(data[=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
.assign(year
.groupby('indicatorname', 'areacode', 'year'],
[=False
as_index'value']
)[
.mean()
.pivot_table(=['areacode', 'year'],
index='indicatorname',
columns='value'
values
)
.rename(={
columns'A01b - Life expectancy at birth': 'life_expectancy',
'C04 - Low birth weight of term babies': 'low_birth_weight',
'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
})
.reset_index()None, axis=1)
.rename_axis(
.dropna()
)
return phof_data
= wrangle_phof_data(phof_raw)
phof phof.head()
areacode | year | life_expectancy | outdoor_space | low_birth_weight | |
---|---|---|---|---|---|
11 | E06000001 | 2012 | 79.607655 | 21.131038 | 3.222749 |
12 | E06000001 | 2013 | 79.075110 | 9.837764 | 2.555911 |
13 | E06000001 | 2014 | 78.846375 | 18.596376 | 2.989691 |
14 | E06000001 | 2015 | 78.738395 | 11.314412 | 1.794616 |
31 | E06000002 | 2011 | 78.344225 | 16.792305 | 3.448276 |
We have successfully imported, cleaned, and wrangled Fingertips data into a tidy structure. We can now move on to the next step of the data science process, which is to explore the data and perform some analysis.
We could save the data to a CSV file at this point, using the pandas to_csv
function (for example, phof.to_csv('data/phof.csv', index=False)
), or we could perform the analysis on the dataframe object we have created.
This is just one example of the data wrangling process, but in reality there is a vast array of ways you might transform data in the process of preparing it for an analysis. There is no set recipe for this process, but the goal is to transform data from whatever raw form it arrives in to something structured (typically this means tidy data) that lends itself to analysis.