4 Wrangling Data

Setup Code (Click to Expand)

# packages needed to run the code in this section
# !pip install pyjanitor pandas polars fingertips_py

import re

import pandas as pd
import polars as pl

from janitor import clean_names

# import fingertips_py as ftp

The first step in any analysis is to find the right data for the problem at hand, and to transform that data so that it is structured appropriately for the method of analysis. Finding the right data can be challenging, sometimes requiring getting data from multiple sources, and even then, the data might be a mess, with missing values, duplicates, formatting that makes it difficult to work with in R or Python, and a myriad of other problems. The process of transforming data to make it suitable for analysis is often referred to as data wrangling, or data munging. The data wrangling process can take many forms, from simple data cleaning to more complex transformations like combining multiple data sets, or creating new variables to enrich the data.

4.1 Python Libraries for Data Wrangling

The Python ecosystem for data wrangling is not quite as rich as in R, but there is one library that is extremely dominant (pandas) and one library that is seeking to disrupt it (polars).

4.2 Fingertips Data

We will use Public Health England’s Fingertips API (for which there are packages available in R and Python) as the data source for this guide, as it represents a good opportunity to give a brief overview of this data and how to use it. Fingertips is a repository of public health data on a wide range of topics. The data included in this tutorial relates to the wider determinants of health.

First, we can look at what is available.

Unfortunately, the fingertips_py package does seem to be a little less mature than the R equivalent, and there are a lot of errors running the functions we would need to import the data. While this is the case, we’ll take a little shortcut, importing the data from a CSV that I’ve stored in the guides GitHub repository, using the Fingertips R package.

phof_raw = pd.read_csv("https://raw.githubusercontent.com/NHS-South-Central-and-West/data-science-guides/main/data/phof.csv", low_memory=False)

phof_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264329 entries, 0 to 264328
Data columns (total 27 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   IndicatorID                          264329 non-null  int64  
 1   IndicatorName                        264329 non-null  object 
 2   ParentCode                           261229 non-null  object 
 3   ParentName                           261229 non-null  object 
 4   AreaCode                             264329 non-null  object 
 5   AreaName                             264329 non-null  object 
 6   AreaType                             264329 non-null  object 
 7   Sex                                  264329 non-null  object 
 8   Age                                  264329 non-null  object 
 9   CategoryType                         0 non-null       float64
 10  Category                             0 non-null       float64
 11  Timeperiod                           264329 non-null  object 
 12  Value                                249986 non-null  float64
 13  LowerCI95.0limit                     243173 non-null  float64
 14  UpperCI95.0limit                     243173 non-null  float64
 15  LowerCI99.8limit                     56885 non-null   float64
 16  UpperCI99.8limit                     56885 non-null   float64
 17  Count                                187440 non-null  float64
 18  Denominator                          198367 non-null  float64
 19  Valuenote                            28742 non-null   object 
 20  RecentTrend                          42091 non-null   object 
 21  ComparedtoEnglandvalueorpercentiles  264329 non-null  object 
 22  Comparedtopercentiles                264329 non-null  object 
 23  TimeperiodSortable                   264329 non-null  int64  
 24  Newdata                              72025 non-null   object 
 25  Comparedtogoal                       38885 non-null   object 
 26  Timeperiodrange                      264329 non-null  object 
dtypes: float64(9), int64(2), object(16)
memory usage: 54.5+ MB

phof_raw.describe()

	IndicatorID	CategoryType	Category	Value	LowerCI95.0limit	UpperCI95.0limit	LowerCI99.8limit	UpperCI99.8limit	Count	Denominator	TimeperiodSortable
count	264329.000000	0.0	0.0	249986.000000	243173.000000	243173.000000	56885.000000	56885.000000	1.874400e+05	1.983670e+05	2.643290e+05
mean	72371.514560	NaN	NaN	72.092017	65.526956	81.604363	122.884874	165.406557	4.023386e+04	1.390792e+07	2.015224e+07
std	30934.529754	NaN	NaN	276.950447	253.887549	311.174288	452.040131	622.510687	1.124600e+06	1.361739e+09	4.862563e+04
min	10301.000000	NaN	NaN	-36.400000	-72.200000	-5.200000	0.000000	0.000000	-1.800000e+02	2.544061e-01	2.000000e+07
25%	41001.000000	NaN	NaN	12.398315	8.300000	16.790000	22.773030	33.147850	6.200000e+01	7.900000e+02	2.012000e+07
50%	90634.000000	NaN	NaN	40.859605	35.579380	50.900000	58.024690	69.074819	4.850000e+02	4.798000e+03	2.016000e+07
75%	92517.000000	NaN	NaN	75.900000	72.400000	79.700120	83.092446	90.801542	4.164250e+03	1.181570e+05	2.019000e+07
max	93982.000000	NaN	NaN	9110.634850	8473.979610	9782.239270	8122.734550	10179.121160	1.942292e+08	2.894740e+11	2.023000e+07

phof_raw.head()

	IndicatorID	IndicatorName	ParentCode	ParentName	AreaCode	AreaName	AreaType	Sex	Age	CategoryType	...	Count	Denominator	Valuenote	RecentTrend	ComparedtoEnglandvalueorpercentiles	Comparedtopercentiles	TimeperiodSortable	Newdata	Comparedtogoal	Timeperiodrange
0	90362	A01a - Healthy life expectancy at birth	NaN	NaN	E92000001	England	England	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20090000	NaN	NaN	3y
1	90362	A01a - Healthy life expectancy at birth	NaN	NaN	E92000001	England	England	Female	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20090000	NaN	NaN	3y
2	90362	A01a - Healthy life expectancy at birth	E92000001	England	E06000001	Hartlepool	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Worse	Not compared	20090000	NaN	NaN	3y
3	90362	A01a - Healthy life expectancy at birth	E92000001	England	E06000002	Middlesbrough	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Worse	Not compared	20090000	NaN	NaN	3y
4	90362	A01a - Healthy life expectancy at birth	E92000001	England	E06000003	Redcar and Cleveland	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Worse	Not compared	20090000	NaN	NaN	3y

5 rows × 27 columns

With the raw data imported, we are now able to start the process of cleaning and wrangling the data into a format that is ready for analysis.

4.3 Data Wrangling

First, we will clean the column names using the pyjanitor’s clean_names function. This will convert all column names to lower case and remove any spaces and special characters. This will make it easier to work with the data in the future.

phof_raw = (
    phof_raw
    .pipe(clean_names)
)

We’ve got a total of 168 indicators, which means some of the indicators from the Public Health Outcomes Framework are not available for upper tier local authorities. We can filter the data to just the indicators we are interested in (as well as filtering any rows where the value is missing), using the panda’s loc function. We can also use the group_by and mean functions to get the mean value of the indicators across all areas in the data. Below is an example:

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .dropna(subset='value')
  .groupby('indicatorname')['value']
  .mean()
)

indicatorname
A01b - Life expectancy at birth                                      80.199513
B16 - Utilisation of outdoor space for exercise or health reasons    16.216824
C04 - Low birth weight of term babies                                 2.870756
Name: value, dtype: float64

The data pulled from the Fingertips API has a number of idiosyncracies that need addressing when wrangling the data. For example, the timeperiodsortable column refers to the year that the data relates to, and it contains four superfluous zeros at the end of each year (i.e, 20230000). We can use the pandas assign function to create a year column which strips out the zeros from the timeperiodsortable column by dividing it by 10000, and dropping the decimal place by converting year to integer (using the astype(int) function).

# filter for the indicators of interest and wrangle data into tidy structure
(
  phof_raw
  .loc[(
    phof_raw['indicatorid'].isin([90366, 20101, 11601])) 
    & (phof_raw['areacode'] == 'E92000001' 
    )]
  .assign(year=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
  .groupby(['indicatorname', 'areacode', 'year'], as_index=False)['value']
  .mean()
)

	indicatorname	areacode	year	value
0	A01b - Life expectancy at birth	E92000001	2001	78.448740
1	A01b - Life expectancy at birth	E92000001	2002	78.700330
2	A01b - Life expectancy at birth	E92000001	2003	78.987650
3	A01b - Life expectancy at birth	E92000001	2004	79.358295
4	A01b - Life expectancy at birth	E92000001	2005	79.614160
5	A01b - Life expectancy at birth	E92000001	2006	79.817360
6	A01b - Life expectancy at birth	E92000001	2007	80.082390
7	A01b - Life expectancy at birth	E92000001	2008	80.355270
8	A01b - Life expectancy at birth	E92000001	2009	80.744790
9	A01b - Life expectancy at birth	E92000001	2010	80.982525
10	A01b - Life expectancy at birth	E92000001	2011	81.154860
11	A01b - Life expectancy at birth	E92000001	2012	81.276175
12	A01b - Life expectancy at birth	E92000001	2013	81.283525
13	A01b - Life expectancy at birth	E92000001	2014	81.337425
14	A01b - Life expectancy at birth	E92000001	2015	81.339375
15	A01b - Life expectancy at birth	E92000001	2016	81.419660
16	A01b - Life expectancy at birth	E92000001	2017	81.565000
17	A01b - Life expectancy at birth	E92000001	2018	81.270000
18	A01b - Life expectancy at birth	E92000001	2021	80.748915
19	B16 - Utilisation of outdoor space for exercis...	E92000001	2011	14.015031
20	B16 - Utilisation of outdoor space for exercis...	E92000001	2012	15.329390
21	B16 - Utilisation of outdoor space for exercis...	E92000001	2013	17.130101
22	B16 - Utilisation of outdoor space for exercis...	E92000001	2014	17.907098
23	B16 - Utilisation of outdoor space for exercis...	E92000001	2015	17.917663
24	C04 - Low birth weight of term babies	E92000001	2006	3.018037
25	C04 - Low birth weight of term babies	E92000001	2007	2.912639
26	C04 - Low birth weight of term babies	E92000001	2008	2.894468
27	C04 - Low birth weight of term babies	E92000001	2009	2.919269
28	C04 - Low birth weight of term babies	E92000001	2010	2.853172
29	C04 - Low birth weight of term babies	E92000001	2011	2.844595
30	C04 - Low birth weight of term babies	E92000001	2012	2.798094
31	C04 - Low birth weight of term babies	E92000001	2013	2.822148
32	C04 - Low birth weight of term babies	E92000001	2014	2.864100
33	C04 - Low birth weight of term babies	E92000001	2015	2.774469
34	C04 - Low birth weight of term babies	E92000001	2016	2.786173
35	C04 - Low birth weight of term babies	E92000001	2017	2.819327
36	C04 - Low birth weight of term babies	E92000001	2018	2.864822
37	C04 - Low birth weight of term babies	E92000001	2019	2.902918
38	C04 - Low birth weight of term babies	E92000001	2020	2.856152
39	C04 - Low birth weight of term babies	E92000001	2021	2.774210

We can also remove the ID that is appended to the beginning of each indicator name. We can use str.replace to remove this part of the string from the indicator name, using a regular expression to remove everything up to and including the dash (and the space immediately after).

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .assign(
    indicatorname = lambda x: x['indicatorname']
                              .str.replace('^[^-]+- ', '', regex=True))
)

	indicatorid	indicatorname	parentcode	parentname	areacode	areaname	areatype	sex	age	categorytype	...	count	denominator	valuenote	recenttrend	comparedtoenglandvalueorpercentiles	comparedtopercentiles	timeperiodsortable	newdata	comparedtogoal	timeperiodrange
2984	90366	Life expectancy at birth	NaN	NaN	E92000001	England	England	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20010000	NaN	NaN	3y
2985	90366	Life expectancy at birth	NaN	NaN	E92000001	England	England	Female	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20010000	NaN	NaN	3y
2986	90366	Life expectancy at birth	E92000001	England	E06000001	Hartlepool	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20010000	NaN	NaN	3y
2987	90366	Life expectancy at birth	E92000001	England	E06000002	Middlesbrough	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20010000	NaN	NaN	3y
2988	90366	Life expectancy at birth	E92000001	England	E06000003	Redcar and Cleveland	Counties & UAs (from Apr 2023)	Male	All ages	NaN	...	NaN	NaN	NaN	NaN	Not compared	Not compared	20010000	NaN	NaN	3y
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103354	20101	Low birth weight of term babies	E92000001	England	E10000029	Suffolk	Counties & UAs (from Apr 2023)	Persons	>=37 weeks gestational age at birth	NaN	...	169.0	6124.0	NaN	No significant change	Similar	Not compared	20210000	NaN	NaN	1y
103355	20101	Low birth weight of term babies	E92000001	England	E10000030	Surrey	Counties & UAs (from Apr 2023)	Persons	>=37 weeks gestational age at birth	NaN	...	277.0	11536.0	NaN	No significant change	Better	Not compared	20210000	NaN	NaN	1y
103356	20101	Low birth weight of term babies	E92000001	England	E10000031	Warwickshire	Counties & UAs (from Apr 2023)	Persons	>=37 weeks gestational age at birth	NaN	...	128.0	5474.0	NaN	No significant change	Better	Not compared	20210000	NaN	NaN	1y
103357	20101	Low birth weight of term babies	E92000001	England	E10000032	West Sussex	Counties & UAs (from Apr 2023)	Persons	>=37 weeks gestational age at birth	NaN	...	138.0	7854.0	NaN	No significant change	Better	Not compared	20210000	NaN	NaN	1y
103358	20101	Low birth weight of term babies	E92000001	England	E10000034	Worcestershire	Counties & UAs (from Apr 2023)	Persons	>=37 weeks gestational age at birth	NaN	...	117.0	5069.0	NaN	No significant change	Better	Not compared	20210000	NaN	NaN	1y

8847 rows × 27 columns

Finally, the data is structured in a wide format, with each indicator having its own column. This is not a tidy structure, as each column should represent a variable, and each row should represent an observation. We can use the pivot_table function to convert the data into a tidy structure, with each indicator having its own row. Pandas will turn areacode and timeperiodsortable into the index of the dataframe, and in order to convert these variables back to columns we use reset_index and rename_axis(None, axis=1). We can also use the rename function to rename the columns to something more meaningful, and the dropna function to remove any rows where the value is missing.

(
  phof_raw
  .loc[(phof_raw['indicatorid'].isin([90366, 20101, 11601]))]
  .groupby(
    ['indicatorname', 'areacode', 'timeperiodsortable'], 
    as_index=False
    )['value']
  .mean()
  .pivot_table(
    index=['areacode', 'timeperiodsortable'],
    columns='indicatorname',
    values='value'
    )
  .rename(
    columns={
      'A01b - Life expectancy at birth': 'life_expectancy',
      'C04 - Low birth weight of term babies': 'low_birth_weight', 
      'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
      })
  .reset_index()
  .rename_axis(None, axis=1)
  .dropna()
)

	areacode	timeperiodsortable	life_expectancy	outdoor_space	low_birth_weight
11	E06000001	20120000	79.607655	21.131038	3.222749
12	E06000001	20130000	79.075110	9.837764	2.555911
13	E06000001	20140000	78.846375	18.596376	2.989691
14	E06000001	20150000	78.738395	11.314412	1.794616
31	E06000002	20110000	78.344225	16.792305	3.448276
...	...	...	...	...	...
3104	E92000001	20110000	81.154860	14.015031	2.844595
3105	E92000001	20120000	81.276175	15.329390	2.798094
3106	E92000001	20130000	81.283525	17.130101	2.822148
3107	E92000001	20140000	81.337425	17.907098	2.864100
3108	E92000001	20150000	81.339375	17.917663	2.774469

667 rows × 5 columns

We can combine all of the above steps into a single function, which we can then apply to the raw data to get the data into a tidy structure.

# function to wrangle data into tidy structure
def wrangle_phof_data(data):
  """
  Returns a tidy, formatted DataFrame with Public Health England data
  """
  phof_data = (
    data
    .loc[(data['indicatorid'].isin([90366, 20101, 11601]))]
    .assign(year=lambda x: (x['timeperiodsortable'] / 10000).astype(int))
    .groupby(
      ['indicatorname', 'areacode', 'year'], 
      as_index=False
      )['value']
    .mean()
    .pivot_table(
      index=['areacode', 'year'],
      columns='indicatorname',
      values='value'
      )
    .rename(
      columns={
        'A01b - Life expectancy at birth': 'life_expectancy',
        'C04 - Low birth weight of term babies': 'low_birth_weight', 
        'B16 - Utilisation of outdoor space for exercise or health reasons': 'outdoor_space'
        })
    .reset_index()
    .rename_axis(None, axis=1)
    .dropna()
    )
  
  return phof_data

phof = wrangle_phof_data(phof_raw)
phof.head()

	areacode	year	life_expectancy	outdoor_space	low_birth_weight
11	E06000001	2012	79.607655	21.131038	3.222749
12	E06000001	2013	79.075110	9.837764	2.555911
13	E06000001	2014	78.846375	18.596376	2.989691
14	E06000001	2015	78.738395	11.314412	1.794616
31	E06000002	2011	78.344225	16.792305	3.448276

4.4 Next Steps

We have successfully imported, cleaned, and wrangled Fingertips data into a tidy structure. We can now move on to the next step of the data science process, which is to explore the data and perform some analysis.

We could save the data to a CSV file at this point, using the pandas to_csv function (for example, phof.to_csv('data/phof.csv', index=False)), or we could perform the analysis on the dataframe object we have created.

This is just one example of the data wrangling process, but in reality there is a vast array of ways you might transform data in the process of preparing it for an analysis. There is no set recipe for this process, but the goal is to transform data from whatever raw form it arrives in to something structured (typically this means tidy data) that lends itself to analysis.

4.1 Python Libraries for Data Wrangling

4.2 Fingertips Data

4.3 Data Wrangling

4.4 Next Steps

4.5 Resources