Although the Map & Data Library is physically closed, we are still available remotely and happy to help. We can conduct consultations using online teleconferencing software. Please feel free to contact us at mdl@library.utoronto.ca or use our help form. We have a number of tutorials available, are still supplying software licenses, and have compiled a list of resources for working with COVID-19 data.

COVID-19: Updates on library services and operations

Cleaning Data in Python

SET UP ENVIRONMENT

Software

If you are using Python for data analysis, we recommend the Anaconda Scientific Python Distribution. It is completely free, and ideal for processing data as well as for performing predictive analysis and scientific computing. You can get the latest version of Anaconda at https://www.anaconda.com/distribution/. It's available for Windows, macOS, and Linux. For more information about getting started with Anaconda and other Anaconda-related support, please refer to https://docs.anaconda.com/anaconda/

After you download the Anaconda distribution to your local machine, install it and open Anaconda Navigator. A number of applications are available by default. For this tutorial, Spyder will be used.

Spyder is an open source cross-platform Integrated Development Environment (IDE) for data science. It integrates essential libraries for data analysis, including NumPy, SciPy, and pandas  all of which will be used in this tutorial.

Launch Spyder.

You will see a similar interface as below: the left part of the application is a text editor. The right part is the console. You can write your code in both these areas. The text editor allows you to write multiple lines of codes, edit them, save them and execute them all together. The console allows the input and execution of (often single lines of) code without the editing or saving functionality.

For this tutorial, we are writing our code directly in the console.

The results of our code will also appear in the console.

If you are brand new to Spyder, the application comes with a helpful tutorial. This can be launched by clicking Help (at the top menu) > Spyder tutorial.

This Help tool can also be accessed by clicking the Help tab (above the console, on the right). This tool allows you to search for different commands or python objects, and get more information and/or instructions about them. Alternatively, you can highlight commands or objects written in the editor/console and type Ctrl + I.

More functions of the Help section will be introduced later in this tutorial.

Lastly, you can personalize the font, background color, theme and other appearance properties of Sypder. This can be done by clicking Tools (at the top menu) > Preferences. Below is an example of a dark background theme in Spyder.

Data analysis packages in Python

For data analysis in Python, we recommend several libraries. A Python library is a collection of functions and methods that allow you to executre complex actions without writing long lines of code. All these libraries are included in the Spyder platform and can simply be imported and used. These include:

  • pandas: a library providing high-performance, easy-to-use data structures and data analysis tools. Complete documentation can be found here (https://pandas.pydata.org/pandas-docs/stable/index.html). You can also check out a 10-minute tour of pandas (https://www.youtube.com/watch?v=_T8LGqJtuGc) from Wes McKinney (https://wesmckinney.com/)
  • NumPy: the fundamental package for scientific computing
  • matplotlib: the most popular Python library for producing graphs and other 2D data visualizations
  • SciPy: a collection of packages addressing a number of different standard problem domains in scientific computing
  • sklearn: a machine learning library

 

CLEANING DATA IN PYTHON

Here is a simple example (from loading data till handling missing values) for how to clean data using Python.

Download Dataset

The dataset we use here is the Canadian Community Health Survey, 2012: Mental Health Component. You can directly access the dataset from here: https://search1.odesi.ca/#/details?uri=%2Fodesi%2Fcchs-82M0013-E-2012-me..., or you can access odesi.ca and search for CCHS 2012, and look for the one with Mental Health Component.

After loading the page, click "Explore & Download"

In the new page, find the "Download" button on the right top corner.

In the download page, select the data format to "Comma Separated Value file" for a csv file that python can work on, check the "Include documentation" box, and click "DOWNLOAD" to download the dataset.

Unzip the files in the folder, you will see a csv file and a PDF file in the folder. The csv file contains the data we will be cleaning, and the PDF file is the metadata contains the codebook for all variables in this dataset. This PDF file helps you to understand all the variables and values better.

Load dataset into Spyder

After you run the program, you can type the scripts below to set up the working directory. In python, codes after "#" symbol are considered as notes for the code.

The first line of code imports the module that contains mischellaneous operating system interaces, including change working directory and print current working directory (Line 2 and Line 3). Line [2] changes the working directory to where the folder path where the files are saved to. Line [3] confirms the working directory by printing it.

Above the console in the Help session, you can take a look of the files loaded in this directory under the "File Explorer" tab.

You can then input the below scripts in the console to load dataset into the program. Note that the csv file has to be in the current working directory, otherwise an error will be raised.

Line [4] imports the pandas library to work with for this tutorial. Line [5] defines the variable "data" and use the read_csv command in pandas to load your CCHS2012 dataset to the system.

Now above the console in the Help section, by clicking on "Variable Explorer" you will be able to see the details of each variable defined in the current console session, including data type and size.

Double-click the variable will open a new window to display the whole dataset.

You can also see the dimension of the dataset using the shape command (in this case, the dataset has 25113 rows and 586 columns)

We can see the descriptive stats of columns with numeric values, using describe command:

If say one of the columns which is expected to contain numbers include letters by mistake, you can set them to NaN (missing data) at the time of reading the data.
Suppose the letter included by mistake is 'g':

For details of how missing data are recorded, refer to the codebook for each variable. Here is an example of a codebook for "Variable AUD_01: Drank alcohol in past 12 months". Among of the responses of this variable, if the value recorded is '8', that means this respondent refused to reply this question.

If the data contains missing data indicated by 'NA', you can read the data in as follows:

You can find all the rows where a specific column holds NaN values as follows. Though in this case, the dataset does not contain any NaN values.

Subset

By observations: e.g. get the first 100 observations (rows)

...                 ...            ...              ...                ...        ...

By variables: e.g. get the first 100 variables (columns)

...                       ...               ...              ...            ...           ...           ...

By both: e.g. get the first 100*100 of the original dataset

...                  ...             ...             ...           ...           ...

By variable name: e.g. get column “VERDATE”, “ADM_RNO”,  and “GEO_PRV”

...            ...                     ...                     ...

Drop data

Drop observations: e.g. drop the row with index 1 and 3

...                       ...               ...           ...               ...               ...        ...

Drop variables: e.g. drop columns with header "GEOGCMA1" and "ADM_N09". Note that in the command, axis=1 means this header is indicating a column, and axis=0 means this header is indicating a row/index.

...                       ...               ...           ...               ...               ...        ...

Transform data

Remove duplicates, if any (in this case not)

...                       ...               ...           ...               ...               ...        ...

Replace values: e.g. replace 1 with 7 for the whole dataset (the dimension doesn’t change – only all the 1’s are replaced with 7)

Rename index: e.g. change index name from "0"to "person1"

Discretization and binning: set up serveral bins (groups)

For this command, we tested dividing groups under the "Variable AUDG06: Number of drinks per day on days when drank in past 12 months"

...

At the bottom of the result, it displays the break values of the dataset for the four categories divided.

Create new variables

E.g. create a new column “CITIZEN” and make the value for all observations 1

...

You can see a new column "CITIZEN" is added in this image at the end of the dataset

Rename variables

E.g. rename two columns: "ADM_RNO" to "ADM"; "GEO_PRV" to "GEO" (the dimension doesn’t change – only the two column names are changed)

Merge two data sets

E.g. merge on the unique shared key: subset_7 is the first 10 observations with first columns 0- 99; subset_8 is the first 10 observations with column 1, 100, 101 and 102; the column subset_7 and subset_8 share is column 1, based on which we’ll merge the two subsets.

The merged dataset should have dimension 10*103 (column 0-102)

A few last words

This concludes the Cleaning Data in Python tutorial, but it’s only the beginning for an aspiring Data Analyst. Here we list some resources that you might find helpful:

Enjoy analyzing data!

Technique: 
Utilities: