Cleaning Data in Python

SET UP ENVIRONMENT

Software

If you are using Python for data analysis, we recommend the Anaconda Scientific Python Distribution. It is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at https://www.continuum.io/downloads. It's available for Windows, macOS, and Linux. For more information about getting started with Anaconda and other supports, please refer to the links under LOOKING FOR MORE at http://continuum.io/downloads

After you download the Anaconda distribution to your local machine, install it and open Anaconda Navigator.

Launch spyder for this tutorial. Spyder is a open source cross-platform IntegratedDevelopment Environments (IDE) for data science. It integrates essential libraries for data analysis, including NumPy, SciPy, and pandas, which we will be using in this tutorial.

You will be prompted to a similar interface as below: the left part is a text editor where you can write scripts, and the right part is the Console where your analysis results will show when you run your scripts. The text editor allows you to write multiple lines of codes and run them together. For this tutorial, we are writing our code directly in the console. By hitting ENTER after a line of code, the console will run it and you will see the result immediately. Above the console is the Help section. To use this feature, just highlight the code and hit Ctrl+I.

You can personalize the font, background color, theme, etc. through Tools -> Preferences. Here is an example to change to dark background colour. Also, you’ll find the tutorial helpful if you are new to Spyder (Help --> Spyder tutorial)

The right top part (above the console) is Help tool in spyder. You will be able to type in command to search the instructions on this command. Or, you can also highlight a command in the editor/console and press Ctrl + I to show the instructions on the command. More functions of the Help section will be introduced later in this tutorial.

Data analysis packages in Python

For data analysis in Python, we recommend several libraries (packages). All these libraries are included in the spyder platform, which you can simply import them and work with them:

  • pandas: a library providing high-performance, easy-to-use data structures and data analysis tools. A complete documentation can be found here. Please also check the video of 10-minute tour of pandas from Wes McKinney.
  • NumPy: the fundamental package for scientific computing
  • matplotlib: the most popular Python library for producing plots and other 2D data visualizations
  • SciPy: a collection of packages addressing a number of different standard problem domains in scientific computing
  • sklearn: machine learning in Python

You’ll also find the book Python for Data Analysis (http://shop.oreilly.com/product/0636920023784.do) helpful.

 

CLEANING DATA IN PYTHON

Here is a simple example (from loading data till handling missing values) for how to clean data using Python.

Download Dataset

The dataset we use here is the Canadian Community Health Survey, 2012: Mental Health Component. You can directly access the dataset from here: https://search1.odesi.ca/#/details?uri=%2Fodesi%2Fcchs-82M0013-E-2012-me..., or you can access odesi.ca and search for CCHS 2012, and look for the one with Mental Health Component.

After loading the page, click "Explore & Download"

In the new page, find the "Download" button on the right top corner.

In the download page, select the data format to "Comma Separated Value file" for a csv file that python can work on, check the "Include documentation" box, and click "DOWNLOAD" to download the dataset.

Unzip the files in the folder, you will see a csv file and a PDF file in the folder. The csv file contains the data we will be cleaning, and the PDF file is the metadata contains the codebook for all variables in this dataset. This PDF file helps you to understand all the variables and values better.

Load dataset into Spyder

After you run the program, you can type the scripts below to set up the working directory. In python, codes after "#" symbol are considered as notes for the code.

The first line of code imports the module that contains mischellaneous operating system interaces, including change working directory and print current working directory (Line 2 and Line 3). Line [2] changes the working directory to where the folder path where the files are saved to. Line [3] confirms the working directory by printing it.

Above the console in the Help session, you can take a look of the files loaded in this directory under the "File Explorer" tab.

You can then input the below scripts in the console to load dataset into the program. Note that the csv file has to be in the current working directory, otherwise an error will be raised.

Line [4] imports the pandas library to work with for this tutorial. Line [5] defines the variable "data" and use the read_csv command in pandas to load your CCHS2012 dataset to the system.

Now above the console in the Help section, by clicking on "Variable Explorer" you will be able to see the details of each variable defined in the current console session, including data type and size.

Double-click the variable will open a new window to display the whole dataset.

You can also see the dimension of the dataset using the shape command (in this case, the dataset has 25113 rows and 586 columns)

We can see the descriptive stats of columns with numeric values, using describe command:

If say one of the columns which is expected to contain numbers include letters by mistake, you can set them to NaN (missing data) at the time of reading the data.
Suppose the letter included by mistake is 'g':

For details of how missing data are recorded, refer to the codebook for each variable. Here is an example of a codebook for "Variable AUD_01: Drank alcohol in past 12 months". Among of the responses of this variable, if the value recorded is '8', that means this respondent refused to reply this question.

If the data contains missing data indicated by 'NA', you can read the data in as follows:

You can find all the rows where a specific column holds NaN values as follows. Though in this case, the dataset does not contain any NaN values.

Subset

By observations: e.g. get the first 100 observations (rows)

...                 ...            ...              ...                ...        ...

By variables: e.g. get the first 100 variables (columns)

...                       ...               ...              ...            ...           ...           ...

By both: e.g. get the first 100*100 of the original dataset

...                  ...             ...             ...           ...           ...

By variable name: e.g. get column “VERDATE”, “ADM_RNO”,  and “GEO_PRV”

...            ...                     ...                     ...

Drop data

Drop observations: e.g. drop the row with index 1 and 3

...                       ...               ...           ...               ...               ...        ...

Drop variables: e.g. drop columns with header "GEOGCMA1" and "ADM_N09". Note that in the command, axis=1 means this header is indicating a column, and axis=0 means this header is indicating a row/index.

...                       ...               ...           ...               ...               ...        ...

Transform data

Remove duplicates, if any (in this case not)

...                       ...               ...           ...               ...               ...        ...

Replace values: e.g. replace 1 with 7 for the whole dataset (the dimension doesn’t change – only all the 1’s are replaced with 7)

Rename index: e.g. change index name from "0"to "person1"

Discretization and binning: set up serveral bins (groups)

For this command, we tested dividing groups under the "Variable AUDG06: Number of drinks per day on days when drank in past 12 months"

...

At the bottom of the result, it displays the break values of the dataset for the four categories divided.

Create new variables

E.g. create a new column “CITIZEN” and make the value for all observations 1

...

You can see a new column "CITIZEN" is added in this image at the end of the dataset

Rename variables

E.g. rename two columns: "ADM_RNO" to "ADM"; "GEO_PRV" to "GEO" (the dimension doesn’t change – only the two column names are changed)

Merge two data sets

E.g. merge on the unique shared key: subset_7 is the first 10 observations with first columns 0- 99; subset_8 is the first 10 observations with column 1, 100, 101 and 102; the column subset_7 and subset_8 share is column 1, based on which we’ll merge the two subsets.

The merged dataset should have dimension 10*103 (column 0-102)

A few last words

This concludes the Cleaning Data in Python tutorial, but it’s only the beginning for an aspiring Data Analyst. Here we list some resources that you might find helpful:

Enjoy analyzing data!

Technique: 
Utilities: