Accessing and Querying the Revelio Datasets

All Tutorials

The University of Toronto has licensed six separate datasets from Revelio Labs. These are workforce datasets that range in size from 1 to 4 TB. To facilitate easy access and querying across products, all datasets have been loaded into SciNet's supercomputing environment. Access to this environment can only be granted to current University of Toronto Faculty, Staff and Students, after an application process. This page contains additionals details about this process, as well as the data itself and how to query it.

Tables of Contents:

Understanding the Revelio Labs Datasets
Working with the Revelio Labs Datasets
Accessing the Environment
Querying the Datasets

Understanding the Revelio Labs Datasets

Note: Data Dictionaries for the six products listed below can be found here. The University of Toronto does not license the Transitions Dataset.

Workforce Dynamics

This dataset provides an overview of a company's composition. This includes headcounts, inflows, and outflows for every unique position, segmented by job, seniority, geography, salary, education, skills, and gender & ethnicity. Coverage is global, from 2007 - 2024.

Job Postings

This dataset includes active postings, new postings, removed postings, salaries, and full text of postings for any company, segmented by various employee characteristics (occupation, seniority, geography, keywords, skills, etc). This dataset is pulled from over 350 thousand company websites, all major job boards, and staffing firm job boards. Coverage is global, from 2021-2024.

There are three subfolders within this dataset: unified, linkedin, and indeed. The unified job postings is comprehensive of all sources and includes linkedin, indeed, companies sites, and other job aggregators. Note however that this folder contains fewer records as records in the unified folder have been deduplicated not only across sources but also within sources. So, a post that got posted 5 times on LinkedIn for seemingly the same role would only appear once in the unified postings.

Note also that the University of Toronto does not license the aggregate "Job Postings Dynamic" dataset, but only the individual-level data. Finally, although this is not listed in the data dictionary, these datasets do contain a "description" field with the original job description text, when available.

Sentiment

This datasets includes employee reviews for all companies, with the full text of each review split into positive and negative text. Reviews are mapped to various employee characteristics (occupation, seniority, geography). Coverage is global, from 2008-2024.

Layoff Notices

This dataset includes post data and effective date for all layoffs in every company in the United States. Coverage is limited to the US, and dates range from 2008-2020 to 2024, depending on the State.

Individual Level Data

This datasets contains data on the full professional history of individuals, including their roles, education, skills, gender, ethnicity, salary, seniority, and geography. Coverage is global, from 2008-2024.

Company Reference Data

This dataset contains information on companies that are covered by or referenced in the five other datasets listed above.

Working with the Revelio Labs Datasets

These steps only need to be completed once to gain access, and should normally only take a few days at most to be approved.

1. Get a Compute Canada account

Please visit the Compute Canada Database (CCDB) website and apply for an account (takes a day or two to approve).

Note: Students and postdocs need to be sponsored by their supervisor, who would need to already have a Compute Canada account (or create one first). This is a simple process, requiring the student to complete a form and their supervisor to approve the sponsorship. Computing resources would be sharing under the sponsor's allocation. Please contact the Map & Data Library for assistance, or for more information.

2. Opt in to the Trillium service

After the Compute Canada account is approved, you should opt in to the Trillium service on the CCDB website (or use this direct link). The opt-in will be approved manually after one or two days, which will give you access to the Trillium supercomputer and other SciNet systems, as long as the SSH key has been uploaded to the CCDB website (see next step).

3. Upload an SSH key to CCDB

Next, locate the Manage SSH Keys option on you account page on the CCDB website (or use this direct link) and upload your public SSH key. Instructions on creating SSH key pairs from the SciNet Wiki can help you with this process. This wiki also contains pages with more information on creating SSH key pairs specifically on a Windows machine, or on Mac or Linux machines. The Map & Data Library also provides a quick start tutorial for creating SSH key pairs on a Mac, if you need more help.

4. Request access to Revelio

Access to the Revelio dataset is by request only. Please contact us to request access.

Note there are other licensed datasets hosted on SciNet, such as the Web of Science PostgreSQL database, that require a separate approval process. Please see this page for more information on accessing those collections.

Accessing the Environment

If working in high performance computing environment is new to you, we would recommend you attend SciNet workshops to learn more, especially their Intro to SciNet & Triullium workshop (run periodically) or watch a recording of a previous session.

Here are some steps to get your started on a Windows machine:

To access the environment from a Windows machine, you will need an SSH client. We would recommend MobaXterm, and we will be using it in our tutorial examples
Once you have installed MobaXterm, start it up
From the Session menu, select New Session
Select SSH from the top left
For the remote host, use this format <computecanadausername>@trillium.scinet.utoronto.ca, substituting in your Compute Canada account username. For example, doej@trillium.scinet.utoronto.ca
Click on the Advanced SSH settings tab below
For SSH-browser type, select SCP (enhanced speed)
Put a checkmark next to Use private key. Click on the blue page icon to browse to the private key you setup when creating your public key for your Compute Canada account
Then click on OK to connect
Enter in your Compute Canada account password
You are now connected to the server
To log out, type exit and press Enter. Then press Enter again to close the tab

And the same steps, on a Mac:

You do not need to install any programs or clients to access the environment from a Mac. Access is via Terminal.
You will use an SSH key to connect. This requires some initial configuration, but once this is done it is both more secure and more convenient. If you have not already generated a key pair, instructions on how to do so can be found here. More detailed instructions are also available on the SciNet wiki. Remember, you'll need create a key-pair on any systems you intend to connect with!
To login to the remote host, use this command in Terminal: ssh -i .ssh/myprivatekeyname <computercanadausername>@trillium.scinet.utoronto.ca. The system will prompt you to enter the passphrase for your key (Note, -i .ssh/myprivatekeyname is only necessary if you are not using the default key filepath and filename. See complete SSH setup instructions here for more information).
You are now connected to the server
To log out, type exit and press Enter. You are now back in your local environment.

Querying the Datasets

All six datasets are in Apache Parquet format. This is a tabular datatype, similar to an Excel CSV file, but one that is capable of handling much larger files. Parquet files are binary, columnar files that are designed for reading and writing extremely large datasets, have built in compression and are optimized for analytics. For these reasons, these files can't be opened or manually inspected by humans. They need to be worked with using big data tools.

Each dataset / directory is composed of many parquet files, up to 11000 per folder. Company reference data provides for a unique identifier across most datasets in the form of a unique company ID, allowing for querying across directories.

Important Note: These files are read only. Although they can be queried and explored, new copies or subsets will need to be created during analysis.

You will need to use Linux/UNIX command line in order to navigate the environment. Once logged into SciNet, navigate to the following folder using the cd command to change the current directory: cd /project/restricted/mdl/revelio
Type ls to view a list of all folders inside the directory. You can then use the cd command to navigate to an individual directory. For example, cd academic_layoff
1. Please note: some products contain multiple folders. For example, there are three folders related to the job postings dataset: academic_postings_indeed_individual, academic_postings_linkedin_individual and academic_postings_unified_individual. Please see the vendor documentation if the contents of each folder are unclear. Otherwise, contact us for additional support.
2. Please also note: The academic_ prefix does not represent that the data refers only to academic institutions or research centres. The data is comprehensive across employment fields and segments. This denotes only that the data was purchased for academic use.
You will need to use big data tools to work with this data. If working with python, PyArrow + Pandas provide excellent support for querying and analyzing parquet files. You can read more about working with PyArrow and Pandas via their oline documentation. Please note that there are other libraries and tools for working with these files. If you've like to explore other options, this guide is an excellent place to start.
Note: Queries and code run on the login node are for testing purposes only. Once you have compiled and tested your code or workflow on the Trillium login nodes, you will need to submit it as a job to be run on the compute node. Any lengthy process being run on the login node will be automatically stopped.
If using Python, you will need to load a python module into your environment. SciNet provides excellent documentation on this. Note that Python 3.11.5 has been installed as a default module and does come pre-loaded with many useful packages including Pandas. However, as PyArrow is not part of this, you will need to follow the steps outlined by Scinet to set up a virtual environment in order to install PyArrow or other packages as needed. See below for screenshots of an example creationg of a new environment, revelioenv , as well as the activation of that enviromment and installation of Pyarrow. This follows the SciNet instructions linked above:
1. Type module load python in order to load the default Python module.
2. Next type mkdir ~/.virtualenv to create a directory in your environment to store your new virtual environment
3. Next type virtualenv --system-site-packages ~/.virtualenv/[myenvironmentname]env to create your new environment.
4. Next type module load gcc arrow to activate the pyarrow engine. This must be done before activating your new virtual environment and installing pyarrow .
5. Next type source ~/.virtualenv/[myenvironmentname]env/bin/activate to activate your environment. The name of your environment should now appear to the far left of your screen.
6. Finally type pip install pyarrow to install the pyarrow module. There is no need to install pandas as this is one of the default packages.
  1. Note: You can install as many modules as needed. You'll need to activate this environment every time you log in, and at the start of all your jobs scripts. However, the creation of then environment and installation of packages only needs to be done once. The next time you log in, you can skip steps A > C above and simply type steps D & E:
    1. module load gcc arrow
    2. source ~/.virtualenv/[myenvironmentname]env/bin/activate
Once PyArrow is installed, simply type python3 within your virtual enviromment to begin writing code. See below for an example of running python code line by line to examine an individual file within the revelioenv.
1. Type python3 to run Python code within your environment
2. Next type import pandas as pd to import the pandas module under the shorthand pd
3. Next type education = pd.read_parquet('individual_user_education_0009_part_00.parquet', engine='pyarrow'). This is creating a dataframe - panda's version of a spreadsheet - , and assigning it to a variable called education. This will allow us to visualize, sort and query our data as an object within python.
4. Next type education.head(). The .head() function in pandas provides a preview of the first 5 rows of our dataframe.
Instead of running code line by line, you could also choose to upload a script and run this within the environment. In order to do this on a Windows machine:
1. From the MobaXterm interface, you should see a sidebar to the left of your terminal window. Click on the orange globe icon on the far left to open the file explorer tab. This should now list all the files in your personal directory on the Trillium server
2. Click on the upload icon at the top (looks like an arrow pointing up)
3. You should be prompted to select the file you want to upload from your local computer. Select the file and then click on OK
4. Make sure your python environment is activated, and type python [nameofyourscript].py
In order to upload a script on a Mac:
1. Open a new Terminal window that is not connected to Niagara (ie. your local directory), and run the following command: scp /your/local/directory/:[filename and extension] <computecanadausername>@trillium.scinet.utoronto.ca:/home/[firstinitialofyourlastname]/<computecanadausername>/<computecanadausername>. Note: If you are not the Principal Investigator ie. your account was sponsored by another user, you'll need to substitute that person's username in place of the first<computecanadausername>, as well as their first initial in [firstinitialofyourlastname]. In this case: scp /your/local/directory/:[filename and extension] <computecanadausername>@trillium.scinet.utoronto.ca:/home/[firstinitialofyoursponsorslastname]/<sponsorscomputecanadausername>/<computecanadausername>
  1. For example: scp /Users/user/Documents/SciNet/myfirstpythonscript.py doej@trillium.scinet.utoronto.ca:/home/d/doej/doej
  2. For example, for a sponsored account (smithp sponsored by doej): scp /Users/user/Documents/SciNet/myfirstpythonscript.py smithp@trillium.scinet.utoronto.ca:/home/d/doej/smithp
  3. If prompted, enter your SSH key passphrase
2. Once your script has been uploaded, connect to trillium, activate your python environment, and type python [nameofyourscript].py
Once you have compiled and tested your script on a smaller subset of the data within the login node, you will need to submit it as a job to be run on one of SciNet's compute nodes, per SciNet's instructions. These instructions also include example submission scripts. The output will be written to your $SCRATCH directory.
If you would prefer to use a different language to query the data, please see the relevant documentation on SciNet's website. For example, Parquet files can also be queried using C++ or Java.