Table of Contents
- Set up environments
- Data analysis packages in Python
- Clean data in Python
- Load dataset into Spyder
- Drop data
- Transform data
- Create new variables
- Rename variables
- Merge two datasets
- Handle missing values
- A few last words
Note: pandas is a powerfull open source Python data analysis library that is used for data cleaning. A complete documentation can be found here. Please also check the video of 10-minute tour of pandas from Wes McKinney.
If you are using Python for data analysis, we recommend the Anaconda Scientific Python Distribution. It is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at https://www.continuum.io/downloads. For more information about getting started with Anaconda and other supports, please refer to the links under LOOKING FOR MORE at http://continuum.io/downloads.
After you download the Anaconda distribution to your local machine, install it and run spyder in the command line. You will be prompted to a similar interface as below: the left part is a text editor where you can write scripts, and the right part is the Console where your analysis results will show when you run your scripts.
Note: you can personalize the font, background color, theme, etc. through Tools -> Preferences. Also, you’ll find the tutorial helpful if you are new to Spyder.
For data analysis in Python, we recommend several packages (all included in the Spyder platform):
- NumPy: the fundamental package for scientific computing
- matplotlib: the most popular Python library for producing plots and other 2D data visualizations
- pandas: a library providing high-performance, easy-to-use data structures and data analysis tools
- SciPy: a collection of packages addressing a number of different standard problem domains in scientific computing
- sklearn: machine learning in Python
You’ll also find the book Python for Data Analysis (http://shop.oreilly.com/product/0636920023784.do) helpful.
Here is a simple example (from loading data till handling missing values) for how to clean data using Python.
Note: the dataset we use here is the Canadian Community Health Survey (CCHS), 2012: Mental Health Component. For how to download this dataset from odesi, please refer to this guide: https://data.library.utoronto.ca/finding-and-working-microdata#4d
After you run the program, you can type the below scripts in the console to load dataset into the program. Note that the csv file has to be in the current working directory, otherwise an error will be raised.
We can also see the dimension of the dataset using the shape method (in this case, the dataset has 25113 rows and 586 columns)
We can see the descrptive stat of columns with numeric values, using describe() method:
If say one of the columns which is expected to contain numbers include letters by mistake, you can set them to NaN (missing data) at the time of reading the data.
Suppose the letter included is 'g':
In: cchs = pd.read_csv("cchs2012.csv", header=0, na_values=['g'])
If the data contains missing data indicated by 'NA', you can read the data in as follows:
In: cchs = pd.read_csv("cchs2012.csv", header=0, na_values=['NA'])
You can find all the rows where a specific column holds NaN values as follows:
In: cchs[ pd.isnull(cchs['GEO_PRV'])]
By observations: e.g. get the first 100 observations (rows)
By variables: e.g. get the first 100 variables (columns)
By both: e.g. get the first 100*100 of the original dataset
By variable name: e.g. get column “VERDATE”, “ADM_RNO”, and “GEO_PRV”
Drop observations: e.g. drop the row with index 1 and 3
Drop variables: e.g. drop variables "GEOGCMA1" and "ADM_N09"
Remove duplicates, if any (in this case not)
Replace values: e.g. replace 1 with 7 for the whole dataset (the dimension doesn’t change – only all the 1’s are replaced with 7)
Rename index: e.g. change index from 0 to "person1"
Discretization and binning: set up serveral bins (groups)
Detect and filter outlier: e.g. 5900 -> upper bound
E.g. create a new col “CITIZEN” and make the value for all observations 1
E.g. rename two columns: "ADM_RNO" to "ADM"; "GEO_PRV" to "GEO" (the dimension doesn’t change – only the two column names are changed)
E.g. merge on the unique shared key: subset_7 is the first 10 observations with first columns 0- 99; subset_8 is the first 10 observations with column 1, 100, 101 and 102; the column subset_7 and subset_8 share is column 1, based on which we’ll merge the two subsets.
The merged dataset should have dimension 10*103 (column 0-102)
We use a subset (10*10) of the dataset for this task
Drop missing values: drop the obeservation with missing values
Fill missing values: with column mean
This concludes the Cleaning Data in Python tutorials but it’s only the beginning for aspiring Data Analyst. Here we list some resources that you might find helpful along the way:
- Python for Data Analysis: http://shop.oreilly.com/product/0636920023784.do
- A list of packages’ website:NumPy: http://www.numpy.org/; matplotlib: http://matplotlib.org/; pandas: http://pandas.pydata.org/; SciPy: http://www.scipy.org/;sklearn: http://scikit-learn.org/stable/
- Getting Started with Python for Data Science: https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
- Google’s Python Class (for people new to Python): https://developers.google.com/edu/python/
- OpenIntro Statistics: https://www.openintro.org/stat/textbook.php
- Last but not least, Google is your friend and questions you’ll meet have probably been asked before.
Enjoy analyzing data!