Matching Census Data to Postal Codes using SPSS

 

This guide shows you how to match census data to postal codes, and how to merge them in a SPSS file.  We will select income variables from the 2006 Census and PCCFs from Toronto.

Summary of Steps:

  1. Obtain the census data from CHASS Census Analyzer and save it as a .sav file
  2. Download postal code data through the Map & Data Library website
  3. Merge the two files in SPSS
  4. (Optional) Narrow information to your area of interest

 

Introduction

The Canadian Census does not use postal codes as a geographic unit. If researchers want to match census data to postal codes, they need to use the Postal Code Conversion Files (PCCF). The PCCF is a tool that allows one to combine census data and postal codes in one file (see below). This guide shows you how to match census data to postal codes, by merging two sets of data in a SPSS file. We will select income variables from the 2006 Census for the Toronto region and combine them with the PCCF to obtain an income-related dataset with information down to the postal-code level.

The most appropriate levels of census geographies for linking postal codes are dissemination areas* (DAs), which cover all of Canada, and census tracts (CTs), which occur only in urban areas.

*note that prior to 2001, DAs were called enumeration areas.

The photo below shows the resulting dataset in SPSS after a merge has been performed and extraneous data removed.

 

STEP 1: Obtain the census data from CHASS Census Analyzer and save it as a .sav file

**Please note that part of this step works best in the Firefox browser. If you have it, it is easiest to start this step in this browser. If not, further instructions are provided later on.**

Downloading Data from CHASS

Go to the Map & Data Library homepage and click the link for the CHASS Census Analyzer.

 

To access census data on the DA-level (Dissemination Area), click on the link to Enumeration area/ Dissemination area.

An open browser window containing the CHASS Census Analyzer. Click the "Enumeration area/Dissemination area" link located under "Census Profiles Tables".

CHASS has divided up the census variables into subheadings so as to facilitate selecting the variables. We will be selecting income variables from the 2006 census, so click on Income and earnings and housing and shelter costs under the 2006 heading. 

A browser window with the Profiles of Dissemination Area page. Click "Income and earnings and housing and shelter costs" located under 2006 cumulative.

You can now select the census data and location you are interested in.

In the Census Division window, select Toronto. If you want to add other census divisions, press your Ctrl key and make further selections.

A browser widow with the subsetting selections for Income and earnings and housing and shelter costs and the 2006 census. For census divisions we selected Toronto.

Select the following variables (and/or others of interest to you):
Be sure to check the "DAuid" box as well.

More Subsetting categories. Under "Census Category" we selected: 1. Total income in 2005 of pop. 15 yrs and over. 2. Median Income $. 3. Average Income $. Under "Include with Result" make sure you select: "DAuid". This variable is needed to merge the files.

Under "Select the output format," choose "SPSS". To submit the request for this data, click the "Submit Query" button located on the bottom of the page.

A Browser showing more subsetting options. Under "Select the output format" choose "SPSS". To create the file layout click the "Submit  Query" button located on the bottom of the page.

Once you have submitted your query, the data will be displayed in your browser:

A browser with the results of our query. If you scroll to the bottom the variable labels you will need when formatting you SPSS file are displayed.

It is now time to download the data. This part of our process works most easily in the Firefox browser.

The "Save As" dialog box will pop up. Ensure that the type "All Files" is selected and that the file name ends with the extension ".sav" (you may need to make this change manually). This is the extension used for SPSS data. To save the file click the "Save" button.

Now that you have downloaded your census data and saved it as a .sav file, you will open this file in SPSS to check that it has downloaded correctly and to instruct SPSS on how it should be downloaded. Make sure to keep the browser window open so you can later use it to name and identify your columns.

Go to the location where you have saved your file, and double-click on it. Be patient; SPSS takes a few minutes to load.

 If your default program is not set as SPSS, you may need to specify "Open with" > "SPSS."

Once it is ready, you will see a Text Import Wizard window, as can be seen below.

Importing & Troubleshooting

At the bottom of this window is a box that allows you to preview your data. Scroll through this to check that there is data contained in your file and not just descriptive information, which does happen (especially if you have used a browser other than Firefox). If your data does not seem to be contained in the file you've saved, not to worry! SPSS text that is displayed after the execution of the query in CHASS is completely identical across all browsers. Consider the following instructions between asterisks for more information about how to save your census data from specific browsers:

***  Internet Explorer: Save the data by clicking on "View Source" and then saving it as a .sav from the native text editor in your computer. Chrome, Opera, or other: Copy the text from the browser window into a text editor (for instance, Sublime Text: www.sublimetext.com/) and then save it as a .txt from there. The following videos illustrate how to then make a .txt file into a .sav file in SPSS: On Windows: https://www.youtube.com/watch?v=pCCiVynBoSY  On a Mac: https://www.youtube.com/watch?v=0rPf2uHfzvU  Make sure that you consult the instructions in the tutorial below in the import stage so that your .sav is properly formatted for our use. ***

Once any issues have been resolved, select "Next" to move on to step 2 / 6. Except where specified, keep the SPSS defaults selected during the import stage.

Make sure to select "No" where it asks "Are variable names included at the top of your file." You may also be asked which symbol to use for decimals. "Period" is the convention in English.

Enter the number of the line after the "BEGIN DATA" line in your file.

Step 3 of the Text Import Wizard. This step is important you need to tell SPSS which row the data starts on. You can get this information from the "Data Preview" pain on the bottom of the window. Input the number in the box following the sentence "The first case of data begins on which line number?". Click the "Next" button to move to the next step.

In this step you can specify how you would like certain types of information to be displayed (ie. quotation marks around text) and preview those changes. If unsure of what to do, leave the defaults and continue.

Step 5 involves naming your columns and is critical. Go back to the browser window to see what column displays which data. Then, return to the SPSS Import Wizard to click on that column in the data preview window and change its name in the Variable name box.

Note that SPSS will not allow you to change these names later, and will not let you use spaces in the names of your variables. The naming convention is to use underscores ( _ ) between words instead. Also be aware that the names of columns in SPSS and in the browser might not match up (ie. COL0 corresponds with V1). Just be sure to enter them in order from first to last. Once all have been renamed, select "Next."

Once step 6/6 has been completed, select "Finish" and wait for a Data Editor window to load with your data.

Step 6 of the Text Import Wizard. Click the "Finish" button and your data will be displayed.

The result should look something like the window below.

Dataset Cleanup

Next, we are going to clean up our data by checking for any information in the datasets that is not a part of the datasets themselves. We will begin by scrolling to the end of the Census dataset. You may notice a few extraneous fields here filled with COL0, COL1, and so on. Additional information like this should be removed. They are clearly not observations from our dataset. Select all the rows containing this extra information (even those that may appear blank, in this case rows 3558 and 3565), right click, and select "Clear."

We will then switch into variable view to delete extra variable columns that have been created, with no information in them.

An SPSS data table showing the Census data we downloaded from CHASS. There are empty variable columns. These can be cleaned up in variable view. Click the "Variable View" tab located at the bottom of the screen.

The Variable View. Each variable is represented by a row. Click the numbers corresponding to each variable you are removing and use Ctrl + click to select more than one variable. Remove the variables by right clicking and then selecting "Clear".

Make sure to save your file. Once saved, STEP 1 is complete!

 

 

STEP 2: Download postal code data through the Map and Data Library Website

It is now time to get the Postal Code Conversion Files (PCCF). To do so, go back to the Map & Data Library homepage and enter"PCCF" into the search bar.

A Browser window containing the Map and Data Library, Data Home Page. Find the PCCF file by typing "pccf" in the search box on the right side of the screen and click the "Search" button.

Your search results will be organized into several categories, as can be seen below. Select the "Census of Canada" result which corresponds with the year of the census you are working with.

Once on the page for the appropriate year, select the most recent "SPSS data file." Don't be alarmed if this information is listed under 2006 but found under a subheading with a non-2006 date (ie, May 2011, as can be seen below). This is because the files for different census years are regularly updated. If you would like to better understand the methodology behind PCCFs, you may want to review the reference guide that can also be found in the search results.

On the next page, you will be asked to log on with your UTORid.

Save the data file onto your computer. The file will be zipped and you will need to extract it before use. To unzip, right-click, and use the appropriate "extract" option.

While it used to be that PCCF data consisted of .sps and .txt files when downloaded (more raw, workable files), what we were given in our download, once unzipped, was a .sav file (a ready-to-use SPSS file). Depending on the different census years and which specific results one clicks on, the given formats for PCCF files may vary. A quick search on the internet should provide you with instructions on how to convert .sps and .txt files into a .sav file if required. We will move forward assuming the use of a prepared .sav file.

We will now return to the folder where you extracted and saved the PCCF data file (see below). Good work. STEP 2 is complete!

 

STEP 3: Merging of the two files

Open the PCCF .sav file in SPSS.

If the warning below appears, click "yes" to allow the program to automatically set the width of string variables to the appropriate length.

After a little loading, your dataset should open up. Both a "Data Editor" and "Statistics Viewer" window should be open.

If not already open, open up your census .sav file. To do so, go to File > Open > Data and navigate to the other dataset, selecting open.

Checking your datasets

In order to avoid potential issues, we are going to begin by checking if the DAuid variable has the same "Type" and "Length" in the two different datasets. If these do not correspond, it can make for issues down the road, so it's always good to check. Follow the steps below to do so:

1. Return to the Data Editor window for your PCCF data.

2. Select "Variable View" in the bottom lefthand corner to see a list of all the variables in the dataset.

3. Find the variable named DAuid (which might also be found written in lowercase, ie. dauid) in the "Name" column (given that it is the name of a variable). In the next column over, we can see the "type" of each of our variables. You will notice that in my case, the "type" of dauid is string (meaning text-based). Also take note of the "width" of the dauid variable, which is the maximum number of characters that will be allowed for values in this column.

4. Now, go to our other data set (the census data from CHASS) and check its "type" and width" in the same way by clicking  "Variable View" once more to inspect the properties of the DAuid variable in this other dataset. You will notice that in my case, the "type" of dauid is numeric (meaning based on digits).You will also be able to observe the "width" of the variable, which in my case was 8 in both datasets.

5. It should now be clear that the "different type or length" error can be resolved by making it so that the DAuid variable has the same type in both datasets. We could change the dauid (in PCCF) from string to numeric, or the DAuid (in census data) from numeric to string. While either option could probably work, because we will not be using these variables for numeric analysis beyond sorting, let's do the second of the two. In your census dataset, click on the "Numeric" type in the DAuid row and then on the little blue box with ellipses. In the "Variable Type" window, select string, and press OK.

6. Do one last visual check for each dataset, ensuring that there is no extraneous information at the bottom of the dataset or extra variables in either (You may recall that we did both these things before saving our census dataset the first time). Save your documents.

Sorting Datasets

Next, we are going to sort each dataset into ascending order based on its DAuid column. This step is required to be able to merge the two datasets, which we will get to soon.

I will begin by sorting the Census dataset. To sort a dataset, go to Data > Sort Cases.

In the box that pops up, do the following:

1. Select the DAuid variable from the list. 2. Click the arrow to move it over to the right-hand square. 3. Select "Ascending." 4. Click OK.

You will be presented with the output file for your census data, which simply shows in code what changes were made when you executed the previous step. Now you can move on to sort your second dataset.

Follow the same instructions as above for the PCCF dataset. If the variables show up looking long and unfamiliar, you can change this by right-clicking any variable and selecting "Display Variable Names," allowing you to find dauid more easily.

Perform the same four steps as before.

After having completed this step, you will again be presented with a list like the one below in your Output file. This describes the commands that you have executed thus far through the Data Editor window.

Save both datasets. Merging may fail if you do not.

Merging the Two Datasets

We are now going to try to merge the datasets, based on the DAuid variable.Go back to your PCCF dataset and click on Data > Merge Files > Add Variables.

Use the browse feature to locate the census dataset that you are merging with the PCCF. Click "Open" in this window and continue in the next.

The add variables window.

A window called "Add Variables" should then pop up. Perform the following steps in order to indicate that the two datasets (census data and postal code data) should be matched based on the DAuid variable.

A) Click on DAuid (+) in the "Excluded Variables" box.

B) Select the "Match cases on key variables" check box and make sure the button for "Non-active dataset is keyed table" is selected.

C) Click the bottom right arrow to transfer the DAuid variable to the "Key Variables" box.

If you are presented with a warning at this stage, check the following: a) your DAuid variables have been changed to the same "type" and "width" and this change has been saved for BOTH datasets; and b) your data has been sorted in ascending order according to the dauid variable in BOTH datasets, and saved in BOTH datasets.

D)  (**not pictured below**) Select the "Cases are sorted..." check box and keep the default bubble filled in for "Both files provide cases."

Click OK at the bottom of your "Add Variables" window. You may be given a warning about sorting your datasets: you will remember that we have already sorted each dataset in ascending order based on the dauid variable, so that is fine.

After executing this command, you will be presented with the Output file once again. This may contain a warning (circled in red) along the lines of different cases in the merged datasets are not uniquely identified. In our case, this makes some sense, because PCCF files typically contain several rows that share the same dauid code (many postal codes are within one Dissemination Area [DA]), while the census data downloaded from CHASS contains only one row per DA.

Return to your Data Editor, select Data View, and scroll to the right. Your columns from the CHASS dataset should now be a part of your PCCF file! Check the file to confirm that the variables from each dataset are represented.

These datasets have been merged; save your work and then STEP 3 is complete! You may want to save this file under a new name to indicate to yourself that it contains your merged data.

 

 

STEP 4: (Optional) Narrow information to your area of interest

You might have noticed that the rows at the top of your file do not have any data associated with the census-related columns.You may also have noticed that the PCCF file you are working with is enormous-- mine has over a million rows. The PCCF dataset is so much bigger than your census dataset because 1)  the PCCF dataset contains information for all of Canada, while you might remember that the data we downloaded from CHASS is specific to Toronto and 2) Dissemination Areas (DAs) tend to contain more than one postal code, so although the original census dataset contained one row for each DA, these rows are duplicated for each postal code under that DA when the files are merged.

Depending on the work and analysis you plan to do, it may be useful to narrow your dataset to only the variables of use to you and only the cases that have your data from CHASS associated with them. Limiting the data often allows SPSS to run and load more quickly. A reminder of how to delete variables can be found below, followed by a simple way to get rid of unnecessary cases.

Getting Rid of Unnecessary Variables

A) Begin in Data View. In order to see descriptions of variables beyond their variable names, hover over their title.

B) Once you have determined that a variable will not be relevant to your analysis, right click its title and select "Clear." Repeat as necessary.

C) Note that the same thing can easily be done in Variable View as well, by looking to the "Label" column for descriptors of each variable and selecting "Clear" for the rows (variables) you do not need.

Getting Rid of Unnecessary Cases (Rows)

A) In the Data Editor, go to Data > Select Cases. We are going to select and then extract only the cases (rows) that have information associated with the census data. 

In the window that pops up, click "If condition is satisfied" under Select.

B) We will now construct a query that tells the program how to select cases of interest to us. Using our "DA" or "Province" variable to select the cases we care about makes the most sense, because any of the three income variables pulled from CHASS (and originating from Statistics Canada) might have null values, which could cause us to miss them in the selection. Begin by scrolling through the list of variables to select "DA," then clicking arrow to bring it into the query box at the top.

C) Next, add the "greater than or equal to" and "zero" to the query. This resulting query instructs the software to select the cases that have values greater or equal to zero in the DA column. Because the DA column originates from the census dataset, its value will be "null" (or nothing) for all of the rows not associated with census data, which means we are selecting only the cases that matter to us.

D) Next, we are brought back to the previous window, where we will choose how we will go about isolating the Toronto-specific census data under the "Output" section. There are three options: Filter out cases you have not selected (which essentially strikes them out); Copy the selected cases into a new file; and Delete all the unselected cases.

Which option you choose depends on your aim. If you are wary of replacing the original merged dataset but would like to have a smaller dataset to work with, the copy option is probably safest. After selecting your output option, click "OK" to apply your changes.

Save your new (or pared down) document, and there you have it! Best of luck with your work!

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Data format: 
Utilities: