Cleaning Data in Python

 

 

Table of Contents

Note: pandas is a powerfull open source Python data analysis library that is used for data cleaning. A complete documentation can be found here. Please also check the video of 10-minute tour of pandas from Wes McKinney.

Set up environments

If you are using Python for data analysis, we recommend the Anaconda Scientific Python Distribution. It is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at https://www.continuum.io/downloads. It's available for Windows, macOS, and Linux. For more information about getting started with Anaconda and other supports, please refer to the links under LOOKING FOR MORE at http://continuum.io/downloads

After you download the Anaconda distribution to your local machine, install it and open Anaconda Navigator.  Launch spyder for this tutorial. Spyder is a open source cross-platform IntegratedDevelopment Environments (IDE) for data science. It integrates essential libraries for data analysis, including NumPy, SciPy, and pandas, which we will be using in this tutorial.

You will be prompted to a similar interface as below: the left part is a text editor where you can write scripts, and the right part is the Console where your analysis results will show when you run your scripts. The text editor allows you to write multiple lines of codes and run them together. For this tutorial, we are writing our code directly in the console. By hitting ENTER after a line of code, the console will run it and you will see the result immediately. Above the console is the Help section. To use this feature, just highlight the code and hit Ctrl+I.

Note: you can personalize the font, background color, theme, etc. through Tools -> Preferences. Also, you’ll find the tutorial helpful if you are new to Spyder.

Data analysis packages in Python

For data analysis in Python, we recommend several libraries (packages). All these libraries are included in the spyder platform, which you can simply import them and work with them:

  • NumPy: the fundamental package for scientific computing
  • matplotlib: the most popular Python library for producing plots and other 2D data visualizations
  • pandas: a library providing high-performance, easy-to-use data structures and data analysis tools
  • SciPy: a collection of packages addressing a number of different standard problem domains in scientific computing
  • sklearn: machine learning in Python

You’ll also find the book Python for Data Analysis (http://shop.oreilly.com/product/0636920023784.do) helpful.

Cleaning data in Python

Here is a simple example (from loading data till handling missing values) for how to clean data using Python.

Note: the dataset we use here is the Canadian Community Health Survey (CCHS), 2012: Mental Health Component. The dataset could be retrieved from odesi (https://odesi.ca/) by searching CCHS 2012. Make sure you are downloading the Mental Health Component. Further instructions on how to download this dataset from odesi, please refer to this guide: http://guides.scholarsportal.info/odesi

Load dataset into Spyder

After you run the program, you can type the scripts below to set up the working directory.

**Note: The top part of this image is the Help panel in spyder. If you highlight a term and press CTRL+I, the Help panel will show the definition and instructions to use the command.

[1] importing the operating system

[2] Change the working directory to the folder where your data is saved in.

[3] Get the current working directory (getcwd) to confirm if you've successfully changed your directory.

You can then input the below scripts in the console to load dataset into the program. Note that the csv file has to be in the current working directory, otherwise an error will be raised.

[4] Import the pandas library to work with for this tutorial.

[5] Define the variable "data" and use the read_csv command in pandas to load your CCHS2012 dataset

You can also see the dimension of the dataset using the shape method (in this case, the dataset has 25113 rows and 586 columns)

We can see the descrptive stat of columns with numeric values, using describe() method:

If say one of the columns which is expected to contain numbers include letters by mistake, you can set them to NaN (missing data) at the time of reading the data.
Suppose the letter included by mistake is 'g':

If the data contains missing data indicated by 'NA', you can read the data in as follows:

You can find all the rows where a specific column holds NaN values as follows. Though in this case, the dataset does not contain any NaN values.

Subset

By observations: e.g. get the first 100 observations (rows)

...                 ...            ...              ...                ...        ...

By variables: e.g. get the first 100 variables (columns)

...                       ...               ...              ...            ...           ...           ...

By both: e.g. get the first 100*100 of the original dataset

...                  ...             ...             ...           ...           ...

By variable name: e.g. get column “VERDATE”, “ADM_RNO”,  and “GEO_PRV”

...            ...                     ...                     ...

Drop data

Drop observations: e.g. drop the row with index 1 and 3

...                       ...               ...           ...               ...               ...        ...

Drop variables: e.g. drop columns with header "GEOGCMA1" and "ADM_N09". Note that in the command, axis=1 means this header is indicating a column, and axis=0 means this header is indicating a row/index.

...                       ...               ...           ...               ...               ...        ...

Transform data

Remove duplicates, if any (in this case not)

...                       ...               ...           ...               ...               ...        ...

Replace values: e.g. replace 1 with 7 for the whole dataset (the dimension doesn’t change – only all the 1’s are replaced with 7)

Rename index: e.g. change index name from "0"to "person1"

Discretization and binning: set up serveral bins (groups)

Detect and filter outlier: e.g. 5900 -> upper bound

Create new variables

E.g. create a new col “CITIZEN” and make the value for all observations 1

Rename variables

E.g. rename two columns: "ADM_RNO" to "ADM"; "GEO_PRV" to "GEO" (the dimension doesn’t change – only the two column names are changed)

Merge two data sets

E.g. merge on the unique shared key: subset_7 is the first 10 observations with first columns 0- 99; subset_8 is the first 10 observations with column 1, 100, 101 and 102; the column subset_7 and subset_8 share is column 1, based on which we’ll merge the two subsets.

The merged dataset should have dimension 10*103 (column 0-102)

Handle missing values

We use a subset (10*10) of the dataset for this task

Drop missing values: drop the obeservation with missing values

Fill missing values: with column mean

A few last words

This concludes the Cleaning Data in Python tutorial, but it’s only the beginning for an aspiring Data Analyst. Here we list some resources that you might find helpful:

Enjoy analyzing data!

Technique: 
Utilities: