Cleaning data in Python

Table of Contents

Note: pandas is a powerfull open source Python data analysis library that is used for data cleaning. A complete documentation can be found here. Please also check the video of 10-minute tour of pandas from Wes McKinney.

Set up environments

If you are using Python for data analysis, we recommend the Anaconda Scientific Python Distribution. It is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at https://www.continuum.io/downloads. For more information about getting started with Anaconda and other supports, please refer to the links under LOOKING FOR MORE at http://continuum.io/downloads.

After you download the Anaconda distribution to your local machine, install it and run spyder in the command line. You will be prompted to a similar interface as below: the left part is a text editor where you can write scripts, and the right part is the Console where your analysis results will show when you run your scripts.

Note: you can personalize the font, background color, theme, etc. through Tools -> Preferences. Also, you’ll find the tutorial helpful if you are new to Spyder.

Data analysis packages in Python

For data analysis in Python, we recommend several packages (all included in the Spyder platform):

  • NumPy: the fundamental package for scientific computing
  • matplotlib: the most popular Python library for producing plots and other 2D data visualizations
  • pandas: a library providing high-performance, easy-to-use data structures and data analysis tools
  • SciPy: a collection of packages addressing a number of different standard problem domains in scientific computing
  • sklearn: machine learning in Python

You’ll also find the book Python for Data Analysis (http://shop.oreilly.com/product/0636920023784.do) helpful.

Cleaning data in Python

Here is a simple example (from loading data till handling missing values) for how to clean data using Python.

Note: the dataset we use here is the Canadian Community Health Survey (CCHS), 2012: Mental Health Component. For how to download this dataset from odesi, please refer to this guide: https://data.library.utoronto.ca/finding-and-working-microdata#4d

Load dataset into Spyder

After you run the program, you can type the below scripts in the console to load dataset into the program. Note that the csv file has to be in the current working directory, otherwise an error will be raised.

We can also see the dimension of the dataset using the shape method (in this case, the dataset has 25113 rows and 586 columns)

We can see the descrptive stat of columns with numeric values, using describe() method:

If say one of the columns which is expected to contain numbers include letters by mistake, you can set them to NaN (missing data) at the time of reading the data.
Suppose the letter included is 'g':
In[8]: cchs = pd.read_csv("cchs2012.csv", header=0, na_values=['g'])

If the data contains missing data indicated by 'NA', you can read the data in as follows:
In[9]: cchs = pd.read_csv("cchs2012.csv", header=0, na_values=['NA'])

You can find all the rows where a specific column holds NaN values as follows:
In[10]: cchs[ pd.isnull(cchs['GEO_PRV'])]

Subset

By observations: e.g. get the first 100 observations (rows)

By variables: e.g. get the first 100 variables (columns)

By both: e.g. get the first 100*100 of the original dataset

By variable name: e.g. get column “VERDATE”, “ADM_RNO”,  and “GEO_PRV”

Drop data

Drop observations: e.g. drop the row with index 1 and 3

Drop variables: e.g. drop variables "GEOGCMA1" and "ADM_N09"

Transform data

Remove duplicates, if any (in this case not)

Replace values: e.g. replace 1 with 7 for the whole dataset (the dimension doesn’t change – only all the 1’s are replaced with 7)

Rename index: e.g. change index from 0 to "person1"

Discretization and binning: set up serveral bins (groups)

Detect and filter outlier: e.g. 5900 -> upper bound

Create new variables

E.g. create a new col “CITIZEN” and make the value for all observations 1

Rename variables

E.g. rename two columns: "ADM_RNO" to "ADM"; "GEO_PRV" to "GEO" (the dimension doesn’t change – only the two column names are changed)

Merge two data sets

E.g. merge on the unique shared key: subset_7 is the first 10 observations with first columns 0- 99; subset_8 is the first 10 observations with column 1, 100, 101 and 102; the column subset_7 and subset_8 share is column 1, based on which we’ll merge the two subsets.

The merged dataset should have dimension 10*103 (column 0-102)

Handle missing values

We use a subset (10*10) of the dataset for this task

Drop missing values: drop the obeservation with missing values

Fill missing values: with column mean

A few last words

This concludes the Cleaning Data in Python tutorials but it’s only the beginning for aspiring Data Analyst. Here we list some resources that you might find helpful along the way:

Enjoy analyzing data!

Technique: 
Utilities: