Setup Code (Click to Expand)
# packages needed to run the code in this section
# !pip install pyjanitor pandas polars fingertips_py

import re

import pandas as pd
import polars as pl

from janitor import clean_names

# import fingertips_py as ftp

The first step in any analysis is to find the right data for the problem at hand, and to transform that data so that it is structured appropriately for the method of analysis. Finding the right data can be challenging, sometimes requiring getting data from multiple sources, and even then, the data might be a mess, with missing values, duplicates, formatting that makes it difficult to work with in R or Python, and a myriad of other problems. The process of transforming data to make it suitable for analysis is often referred to as data wrangling, or data munging. The data wrangling process can take many forms, from simple data cleaning to more complex transformations like combining multiple data sets, or creating new variables to enrich the data.

4.1 Python Libraries for Data Wrangling

The Python ecosystem for data wrangling is not quite as rich as in R, but there is one library that is extremely dominant (pandas) and one library that is seeking to disrupt it (polars).

4.2 Fingertips Data

We will use Public Health England’s Fingertips API (for which there are packages available in R and Python) as the data source for this guide, as it represents a good opportunity to give a brief overview of this data and how to use it. Fingertips is a repository of public health data on a wide range of topics. The data included in this tutorial relates to the wider determinants of health.

First, we can look at what is available.

Unfortunately, the fingertips_py package does seem to be a little less mature than the R equivalent, and there are a lot of errors running the functions we would need to import the data. While this is the case, we’ll take a little shortcut, importing the data from a CSV that I’ve stored in the guides GitHub repository, using the Fingertips R package.

phof_raw = pd.read_csv("https://raw.githubusercontent.com/NHS-South-Central-and-West/data-science-guides/main/data/phof.csv", low_memory=False)
phof_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264329 entries, 0 to 264328
Data columns (total 27 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   IndicatorID                          264329 non-null  int64  
 1   IndicatorName                        264329 non-null  object 
 2   ParentCode                           261229 non-null  object 
 3   ParentName                           261229 non-null  object 
 4   AreaCode                             264329 non-null  object 
 5   AreaName                             264329 non-null  object 
 6   AreaType                             264329 non-null  object 
 7   Sex                                  264329 non-null  object 
 8   Age                                  264329 non-null  object 
 9   CategoryType                         0 non-null       float64
 10  Category                             0 non-null       float64
 11  Timeperiod                           264329 non-null  object 
 12  Value                                249986 non-null  float64
 13  LowerCI95.0limit                     243173 non-null  float64
 14  UpperCI95.0limit                     243173 non-null  float64
 15  LowerCI99.8limit                     56885 non-null   float64
 16  UpperCI99.8limit                     56885 non-null   float64
 17  Count                                187440 non-null  float64
 18  Denominator                          198367 non-null  float64
 19  Valuenote                            28742 non-null   object 
 20  RecentTrend                          42091 non-null   object 
 21  ComparedtoEnglandvalueorpercentiles  264329 non-null  object 
 22  Comparedtopercentiles                264329 non-null  object 
 23  TimeperiodSortable                   264329 non-null  int64  
 24  Newdata                              72025 non-null   object 
 25  Comparedtogoal                       38885 non-null   object 
 26  Timeperiodrange                      264329 non-null  object 
dtypes: float64(9), int64(2), object(16)
memory usage: 54.5+ MB
phof_raw.describe()
IndicatorID CategoryType Category Value LowerCI95.0limit UpperCI95.0limit LowerCI99.8limit UpperCI99.8limit Count Denominator TimeperiodSortable
count 264329.000000 0.0 0.0 249986.000000 243173.000000 243173.000000 56885.000000 56885.000000 1.874400e+05 1.983670e+05 2.643290e+05
mean 72371.514560 NaN NaN 72.092017 65.526956 81.604363 122.884874 165.406557 4.023386e+04 1.390792e+07 2.015224e+07
std 30934.529754 NaN NaN 276.950447 253.887549 311.174288 452.040131 622.510687 1.124600e+06 1.361739e+09 4.862563e+04
min 10301.000000 NaN NaN -36.400000 -72.200000 -5.200000 0.000000 0.000000 -1.800000e+02 2.544061e-01 2.000000e+07
25% 41001.000000 NaN NaN 12.398315 8.300000 16.790000 22.773030 33.147850 6.200000e+01 7.900000e+02 2.012000e+07
50% 90634.000000 NaN NaN 40.859605 35.579380 50.900000 58.024690 69.074819 4.850000e+02 4.798000e+03 2.016000e+07
75% 92517.000000 NaN NaN 75.900000 72.400000 79.700120 83.092446 90.801542 4.164250e+03 1.181570e+05 2.019000e+07
max 93982.000000 NaN NaN 9110.634850 8473.979610 9782.239270 8122.734550 10179.121160 1.942292e+08 2.894740e+11 2.023000e+07
phof_raw.head()
IndicatorID IndicatorName ParentCode ParentName AreaCode AreaName AreaType Sex Age CategoryType ... Count Denominator Valuenote RecentTrend ComparedtoEnglandvalueorpercentiles Comparedtopercentiles TimeperiodSortable Newdata Comparedtogoal Timeperiodrange
0 90362 A01a - Healthy life expectancy at birth NaN NaN E92000001 England England Male All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20090000 NaN NaN 3y
1 90362 A01a - Healthy life expectancy at birth NaN NaN E92000001 England England Female All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20090000 NaN NaN 3y
2 90362 A01a - Healthy life expectancy at birth E92000001 England E06000001 Hartlepool Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Worse Not compared 20090000 NaN NaN 3y
3 90362 A01a - Healthy life expectancy at birth E92000001 England E06000002 Middlesbrough Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Worse Not compared 20090000 NaN NaN 3y
4 90362 A01a - Healthy life expectancy at birth E92000001 England E06000003 Redcar and Cleveland Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Worse Not compared 20090000 NaN NaN 3y

5 rows × 27 columns

With the raw data imported, we are now able to start the process of cleaning and wrangling the data into a format that is ready for analysis.

4.3 Data Wrangling

First, we will clean the column names using the pyjanitor’s clean_names function. This will convert all column names to lower case and remove any spaces and special characters. This will make it easier to work with the data in the future.

phof_raw = (
    phof_raw
    .pipe(clean_names)
)

We’ve got a total of 168 indicators, which means some of the indicators from the Public Health Outcomes Framework are not available for upper tier local authorities. We can filter the data to just the indicators we are interested in (as well as filtering any rows where the value is missing), using the panda’s loc function. We can also use the group_by and mean functions to get the mean value of the indicators across all areas in the data. Below is an example:

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .dropna(subset='value')
  .groupby('indicatorname')['value']
  .mean()
)
indicatorname
A01b - Life expectancy at birth                                      80.199513
B16 - Utilisation of outdoor space for exercise or health reasons    16.216824
C04 - Low birth weight of term babies                                 2.870756
Name: value, dtype: float64

The data pulled from the Fingertips API has a number of idiosyncracies that need addressing when wrangling the data. For example, the timeperiodsortable column refers to the year that the data relates to, and it contains four superfluous zeros at the end of each year (i.e, 20230000). We can use the pandas assign function to create a year column which strips out the zeros from the timeperiodsortable column by dividing it by 10000, and dropping the decimal place by converting year to integer (using the astype(int) function).

# filter for the indicators of interest and wrangle data into tidy structure
(
  phof_raw
  .loc[(
    phof_raw['indicatorid'].isin([90366, 20101, 11601])) 
    & (phof_raw['areacode'] == 'E92000001' 
    )]
  .assign(year=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
  .groupby(['indicatorname', 'areacode', 'year'], as_index=False)['value']
  .mean()
)
indicatorname areacode year value
0 A01b - Life expectancy at birth E92000001 2001 78.448740
1 A01b - Life expectancy at birth E92000001 2002 78.700330
2 A01b - Life expectancy at birth E92000001 2003 78.987650
3 A01b - Life expectancy at birth E92000001 2004 79.358295
4 A01b - Life expectancy at birth E92000001 2005 79.614160
5 A01b - Life expectancy at birth E92000001 2006 79.817360
6 A01b - Life expectancy at birth E92000001 2007 80.082390
7 A01b - Life expectancy at birth E92000001 2008 80.355270
8 A01b - Life expectancy at birth E92000001 2009 80.744790
9 A01b - Life expectancy at birth E92000001 2010 80.982525
10 A01b - Life expectancy at birth E92000001 2011 81.154860
11 A01b - Life expectancy at birth E92000001 2012 81.276175
12 A01b - Life expectancy at birth E92000001 2013 81.283525
13 A01b - Life expectancy at birth E92000001 2014 81.337425
14 A01b - Life expectancy at birth E92000001 2015 81.339375
15 A01b - Life expectancy at birth E92000001 2016 81.419660
16 A01b - Life expectancy at birth E92000001 2017 81.565000
17 A01b - Life expectancy at birth E92000001 2018 81.270000
18 A01b - Life expectancy at birth E92000001 2021 80.748915
19 B16 - Utilisation of outdoor space for exercis... E92000001 2011 14.015031
20 B16 - Utilisation of outdoor space for exercis... E92000001 2012 15.329390
21 B16 - Utilisation of outdoor space for exercis... E92000001 2013 17.130101
22 B16 - Utilisation of outdoor space for exercis... E92000001 2014 17.907098
23 B16 - Utilisation of outdoor space for exercis... E92000001 2015 17.917663
24 C04 - Low birth weight of term babies E92000001 2006 3.018037
25 C04 - Low birth weight of term babies E92000001 2007 2.912639
26 C04 - Low birth weight of term babies E92000001 2008 2.894468
27 C04 - Low birth weight of term babies E92000001 2009 2.919269
28 C04 - Low birth weight of term babies E92000001 2010 2.853172
29 C04 - Low birth weight of term babies E92000001 2011 2.844595
30 C04 - Low birth weight of term babies E92000001 2012 2.798094
31 C04 - Low birth weight of term babies E92000001 2013 2.822148
32 C04 - Low birth weight of term babies E92000001 2014 2.864100
33 C04 - Low birth weight of term babies E92000001 2015 2.774469
34 C04 - Low birth weight of term babies E92000001 2016 2.786173
35 C04 - Low birth weight of term babies E92000001 2017 2.819327
36 C04 - Low birth weight of term babies E92000001 2018 2.864822
37 C04 - Low birth weight of term babies E92000001 2019 2.902918
38 C04 - Low birth weight of term babies E92000001 2020 2.856152
39 C04 - Low birth weight of term babies E92000001 2021 2.774210

We can also remove the ID that is appended to the beginning of each indicator name. We can use str.replace to remove this part of the string from the indicator name, using a regular expression to remove everything up to and including the dash (and the space immediately after).

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .assign(
    indicatorname = lambda x: x['indicatorname']
                              .str.replace('^[^-]+- ', '', regex=True))
)
indicatorid indicatorname parentcode parentname areacode areaname areatype sex age categorytype ... count denominator valuenote recenttrend comparedtoenglandvalueorpercentiles comparedtopercentiles timeperiodsortable newdata comparedtogoal timeperiodrange
2984 90366 Life expectancy at birth NaN NaN E92000001 England England Male All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20010000 NaN NaN 3y
2985 90366 Life expectancy at birth NaN NaN E92000001 England England Female All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20010000 NaN NaN 3y
2986 90366 Life expectancy at birth E92000001 England E06000001 Hartlepool Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20010000 NaN NaN 3y
2987 90366 Life expectancy at birth E92000001 England E06000002 Middlesbrough Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20010000 NaN NaN 3y
2988 90366 Life expectancy at birth E92000001 England E06000003 Redcar and Cleveland Counties & UAs (from Apr 2023) Male All ages NaN ... NaN NaN NaN NaN Not compared Not compared 20010000 NaN NaN 3y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
103354 20101 Low birth weight of term babies E92000001 England E10000029 Suffolk Counties & UAs (from Apr 2023) Persons >=37 weeks gestational age at birth NaN ... 169.0 6124.0 NaN No significant change Similar Not compared 20210000 NaN NaN 1y
103355 20101 Low birth weight of term babies E92000001 England E10000030 Surrey Counties & UAs (from Apr 2023) Persons >=37 weeks gestational age at birth NaN ... 277.0 11536.0 NaN No significant change Better Not compared 20210000 NaN NaN 1y
103356 20101 Low birth weight of term babies E92000001 England E10000031 Warwickshire Counties & UAs (from Apr 2023) Persons >=37 weeks gestational age at birth NaN ... 128.0 5474.0 NaN No significant change Better Not compared 20210000 NaN NaN 1y
103357 20101 Low birth weight of term babies E92000001 England E10000032 West Sussex Counties & UAs (from Apr 2023) Persons >=37 weeks gestational age at birth NaN ... 138.0 7854.0 NaN No significant change Better Not compared 20210000 NaN NaN 1y
103358 20101 Low birth weight of term babies E92000001 England E10000034 Worcestershire Counties & UAs (from Apr 2023) Persons >=37 weeks gestational age at birth NaN ... 117.0 5069.0 NaN No significant change Better Not compared 20210000 NaN NaN 1y

8847 rows × 27 columns

Finally, the data is structured in a wide format, with each indicator having its own column. This is not a tidy structure, as each column should represent a variable, and each row should represent an observation. We can use the pivot_table function to convert the data into a tidy structure, with each indicator having its own row. Pandas will turn areacode and timeperiodsortable into the index of the dataframe, and in order to convert these variables back to columns we use reset_index and rename_axis(None, axis=1). We can also use the rename function to rename the columns to something more meaningful, and the dropna function to remove any rows where the value is missing.

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .groupby(
    ['indicatorname', 'areacode', 'timeperiodsortable'], 
    as_index=False
    )['value']
  .mean()
  .pivot_table(
    index=['areacode', 'timeperiodsortable'],
    columns='indicatorname',
    values='value'
    )
  .rename(
    columns={
      'A01b - Life expectancy at birth': 'life_expectancy',
      'C04 - Low birth weight of term babies': 'low_birth_weight', 
      'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
      })
  .reset_index()
  .rename_axis(None, axis=1)
  .dropna()
)
areacode timeperiodsortable life_expectancy outdoor_space low_birth_weight
11 E06000001 20120000 79.607655 21.131038 3.222749
12 E06000001 20130000 79.075110 9.837764 2.555911
13 E06000001 20140000 78.846375 18.596376 2.989691
14 E06000001 20150000 78.738395 11.314412 1.794616
31 E06000002 20110000 78.344225 16.792305 3.448276
... ... ... ... ... ...
3104 E92000001 20110000 81.154860 14.015031 2.844595
3105 E92000001 20120000 81.276175 15.329390 2.798094
3106 E92000001 20130000 81.283525 17.130101 2.822148
3107 E92000001 20140000 81.337425 17.907098 2.864100
3108 E92000001 20150000 81.339375 17.917663 2.774469

667 rows × 5 columns

We can combine all of the above steps into a single function, which we can then apply to the raw data to get the data into a tidy structure.

# function to wrangle data into tidy structure
def wrangle_phof_data(data):
  """
  Returns a tidy, formatted DataFrame with Public Health England data
  """
  phof_data = (
    data
    .loc[(data['indicatorid'].isin([90366, 20101, 11601]))]
    .assign(year=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
    .groupby(
      ['indicatorname', 'areacode', 'year'], 
      as_index=False
      )['value']
    .mean()
    .pivot_table(
      index=['areacode', 'year'],
      columns='indicatorname',
      values='value'
      )
    .rename(
      columns={
        'A01b - Life expectancy at birth': 'life_expectancy',
        'C04 - Low birth weight of term babies': 'low_birth_weight', 
        'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
        })
    .reset_index()
    .rename_axis(None, axis=1)
    .dropna()
    )
  
  return phof_data

phof = wrangle_phof_data(phof_raw)
phof.head()
areacode year life_expectancy outdoor_space low_birth_weight
11 E06000001 2012 79.607655 21.131038 3.222749
12 E06000001 2013 79.075110 9.837764 2.555911
13 E06000001 2014 78.846375 18.596376 2.989691
14 E06000001 2015 78.738395 11.314412 1.794616
31 E06000002 2011 78.344225 16.792305 3.448276

4.4 Next Steps

We have successfully imported, cleaned, and wrangled Fingertips data into a tidy structure. We can now move on to the next step of the data science process, which is to explore the data and perform some analysis.

We could save the data to a CSV file at this point, using the pandas to_csv function (for example, phof.to_csv('data/phof.csv', index=False)), or we could perform the analysis on the dataframe object we have created.

This is just one example of the data wrangling process, but in reality there is a vast array of ways you might transform data in the process of preparing it for an analysis. There is no set recipe for this process, but the goal is to transform data from whatever raw form it arrives in to something structured (typically this means tidy data) that lends itself to analysis.

4.5 Resources