Tables of Contents:
Understanding the Revelio Labs Datasets
Working with the Revelio Labs Datasets
Accessing the Environment
Querying the Datasets
Understanding the Revelio Labs Datasets
Note: Data Dictionaries for the six products listed below can be found here. The University of Toronto does not license the Transitions Dataset.
Workforce Dynamics
This dataset provides an overview of a company's composition. This includes headcounts, inflows, and outflows for every unique position, segmented by job, seniority, geography, salary, education, skills, and gender & ethnicity. Coverage is global, from 2007 - 2024.
Job Postings
This dataset includes active postings, new postings, removed postings, salaries, and full text of postings for any company, segmented by various employee characteristics (occupation, seniority, geography, keywords, skills, etc). This dataset is pulled from over 350 thousand company websites, all major job boards, and staffing firm job boards. Coverage is global, from 2021-2024. Note that the University of Toronto does not license the aggregate "Job Postings Dynamic" dataset, but only the individual-level data. Please also note that although this is not listed in the data dictionary, these datasets do contain a "description" field with the original job description text, when available.
Sentiment
This datasets includes employee reviews for all companies, with the full text of each review split into positive and negative text. Reviews are mapped to various employee characteristics (occupation, seniority, geography). Coverage is global, from 2008-2024.
Layoff Notices
This dataset includes post data and effective date for all layoffs in every company in the United States. Coverage is limited to the US, and dates range from 2008-2020 to 2024, depending on the State.
Individual Level Data
This datasets contains data on the full professional history of individuals, including their roles, education, skills, gender, ethnicity, salary, seniority, and geography. Coverage is global, from 2008-2024.
Company Reference Data
This dataset contains information on companies that are covered by or referenced in the five other datasets listed above.
Working with the Revelio Labs Datasets
Creating an Account
- Get a Compute Canada account
- Opt into the Niagara & Mist service
- Upload an SSH key to CCDB
- Request Access to Revelio
These steps only need to be completed once to gain access, and should normally only take a few days at most to be approved.
1. Get a Compute Canada account
Please visit the Compute Canada Database (CCDB) website and apply for an account (takes a day or two to approve).
Note: Students and postdocs need to be sponsored by their supervisor, who would need to already have a Compute Canada account (or create one first). This is a simple process, requiring the student to complete a form and their supervisor to approve the sponsorship. Computing resources would be sharing under the sponsor's allocation. Please contact the Map & Data Library for assistance, or for more information.
2. Opt in to Niagara & Mist service
After the Compute Canada account is approved, you should opt in to the Niagara & Mist service on the CCDB website (or use this direct link). The opt-in will be approved manually after one or two days, which will give you access to the Niagara supercomputer and other SciNet systems, as long as the SSH key has been uploaded to the CCDB website (see next step).
Next, locate the Manage SSH Keys option on you account page on the CCDB website (or use this direct link) and upload your public SSH key. Instructions on creating SSH key pairs from the SciNet Wiki can help you with this process. This wiki also contains pages with more information on creating SSH key pairs specifically on a Windows machine, or on Mac or Linux machines. The Map & Data Library also provides a quick start tutorial for creating SSH key pairs on a Mac, if you need more help.
Access to the Revelio dataset is by request only. Please contact us to request access.
Note there are other licensed datasets hosted on SciNet, such as the Web of Science PostgreSQL database, that require a separate approval process. Please see this page for more information on accessing those collections.
Accessing the Environment
If working in high performance computing environment is new to you, we would recommend you attend SciNet workshops to learn more, especially their Intro to SciNet/Niagara/Mist workshop (run periodically) or watch a recording of a previous session.
Here are some steps to get your started on a Windows machine:
- To access the environment from a Windows machine, you will need an SSH client. We would recommend MobaXterm, and we will be using it in our tutorial examples
- Once you have installed MobaXterm, start it up
- From the Session menu, select New Session
- Select SSH from the top left
- For the remote host, use this format <computecanadausername>@niagara.scinet.utoronto.ca, substituting in your Compute Canada account username. For example, doej@niagara.scinet.utoronto.ca
- Click on the Advanced SSH settings tab below
- For SSH-browser type, select SCP (enhanced speed)
- Put a checkmark next to Use private key. Click on the blue page icon to browse to the private key you setup when creating your public key for your Compute Canada account
- Then click on OK to connect
- Enter in your Compute Canada account password
- You are now connected to the server
- To log out, type
exit
and press Enter. Then press Enter again to close the tab
And the same steps, on a Mac:
- You do not need to install any programs or clients to access the environment from a Mac. Access is via Terminal.
- You will use an SSH key to connect. This requires some initial configuration, but once this is done it is both more secure and more convenient. If you have not already generated a key pair, instructions on how to do so can be found here. More detailed instructions are also available on the SciNet wiki. Remember, you'll need create a key-pair on any systems you intend to connect with!
- To login to the remote host, use this command in Terminal:
ssh -i .ssh/myprivatekeyname <computercanadausername>@niagara.scinet.utoronto.ca
. The system will prompt you to enter the passphrase for your key (Note,-i .ssh/myprivatekeyname
is only necessary if you are not using the default key filepath and filename. See complete SSH setup instructions here for more information). - You are now connected to the server
- To log out, type
exit
and press Enter. You are now back in your local environment.
Querying the Datasets
All six datasets are in Apache Parquet format. This is a tabular datatype, similar to an Excel CSV file, but one that is capable of handling much larger files. Parquet files are binary, columnar files that are designed for reading and writing extremely large datasets, have built in compression and are optimized for analytics. For these reasons, these files can't be opened or manually inspected by humans. They need to be worked with using big data tools.
Each dataset / directory is composed of many parquet files, up to 11000 per folder. Company reference data provides for a unique identifier across most datasets in the form of a unique company ID, allowing for querying across directories.
Important Note: These files are read only. Although they can be queried and explored, new copies or subsets will need to be created during analysis.
- You will need to use Linux/UNIX command line in order to navigate the environment. Once logged into SciNet, navigate to the following folder using the cd command to change the current directory:
cd /project/restricted/mdl/revelio
-
Type
ls
to view a list of all folders inside the directory. You can then use thecd
command to navigate to an individual directory. For example,cd academic_layoff
- Please note: some products contain multiple folders. For example, there are three folders related to the job postings dataset: academic_postings_indeed_individual, academic_postings_linkedin_individual and academic_postings_unified_individual. Please see the vendor documentation if the contents of each folder are unclear. Otherwise, contact us for additional support.
- Please also note: The academic_ prefix does not represent that the data refers only to academic institutions or research centres. The data is comprehensive across employment fields and segments. This denotes only that the data was purchased for academic use.
- You will need to use big data tools to work with this data. If working with python, PyArrow + Pandas provide excellent support for querying and analyzing parquet files. You can read more about working with PyArrow and Pandas via their oline documentation. Please note that there are other libraries and tools for working with these files. If you've like to explore other options, this guide is an excellent place to start.
-
Note: Queries and code run on the login node are for testing purposes only. Once you have compiled and tested your code or workflow on the Niagara login nodes, you will need to submit it as a job to be run on the compute node. Any lengthy process being run on the login node will be automatically stopped.
- If using Python, you will need to load a python module into your environment. SciNet provides excellent documentation on this. Note that Python 3.11.5 has been installed as a default module and does come pre-loaded with many useful packages including Pandas. However, as PyArrow is not part of this, you will need to follow the steps outlined by Scinet to set up a virtual environment in order to install PyArrow or other packages as needed. See below for screenshots of an example creationg of a new environment, revelioenv , as well as the activation of that enviromment and installation of Pyarrow. This follows the SciNet instructions linked above:
- Type
module load NiaEnv/2019b python/3.11.5
in order to load the default Python module. - Next type
mkdir ~/.virtualenv
to create a directory in your environment to store your new virtual environment - Next type
virtualenv --system-site-packages ~/.virtualenv/[myenvironmentname]env
to create your new environment. - Next type
source ~/.virtualenv/[myenvironmentname]env/bin/activate
to activate your environment. The name of your environment should now appear to the far left of your screen. - Finally type
pip install pyarrow
to install the pyarrow module. There is no need to install pandas as this is one of the default packages.- Note: You can install as many modules as needed. You'll need to activate this environment every time you log in, and at the start of all your jobs scripts. However, the creation of then environment and installation of packages only needs to be done once. The next time you log in, you can skip steps A > C above and simply type
source ~/.virtualenv/[myenvironmentname]env/bin/activate
to activate your environment.
- Note: You can install as many modules as needed. You'll need to activate this environment every time you log in, and at the start of all your jobs scripts. However, the creation of then environment and installation of packages only needs to be done once. The next time you log in, you can skip steps A > C above and simply type
- Type
- Once PyArrow is installed, simply type
python3
within your virtual enviromment to begin writing code. See below for an example of running python code line by line to examine an individual file within the revelioenv.- Type
python3
to run Python code within your environment - Next type
import pandas as pd
to import the pandas module under the shorthand pd - Next type
education = pd.read_parquet('individual_user_education_0009_part_00.parquet', engine='pyarrow')
. This is creating a dataframe - panda's version of a spreadsheet - , and assigning it to a variable called education. This will allow us to visualize, sort and query our data as an object within python. - Next type
education.head()
. The .head() function in pandas provides a preview of the first 5 rows of our dataframe.
- Type
- Instead of running code line by line, you could also choose to upload a script and run this within the environment. In order to do this on a Windows machine:
- From the MobaXterm interface, you should see a sidebar to the left of your terminal window. Click on the orange globe icon on the far left to open the file explorer tab. This should now list all the files in your personal directory on the Niagara server
- Click on the upload icon at the top (looks like an arrow pointing up)
- You should be prompted to select the file you want to upload from your local computer. Select the file and then click on OK
- Make sure your python environment is activated, and type
python [nameofyourscript].py
- In order to upload a script on a Mac:
- Open a new Terminal window that is not connected to Niagara (ie. your local directory), and run the following command:
scp /your/local/directory/:[filename and extension] <computecanadausername>@niagara.scinet.utoronto.ca:/home/[firstinitialofyourlastname]/<computecanadausername>/<computecanadausername>.
Note: If you are not the Principal Investigator ie. your account was sponsored by another user, you'll need to substitute that person's username in place of the first<computecanadausername>, as well as their first initial in [firstinitialofyourlastname]. In this case:scp /your/local/directory/:[filename and extension] <computecanadausername>@niagara.scinet.utoronto.ca:/home/[firstinitialofyoursponsorslastname]/<sponsorscomputecanadausername>/<computecanadausername>
- For example:
scp /Users/user/Documents/SciNet/myfirstpythonscript.py doej@niagara.scinet.utoronto.ca:/home/d/doej/doej
- For example, for a sponsored account (smithp sponsored by doej):
scp /Users/user/Documents/SciNet/myfirstpythonscript.py smithp@niagara.scinet.utoronto.ca:/home/d/doej/smithp
- If prompted, enter your SSH key passphrase
- For example:
- Once your script has been uploaded, connect to Niagara, activate your python environment, and type
python [nameofyourscript].py
- Open a new Terminal window that is not connected to Niagara (ie. your local directory), and run the following command:
- Once you have compiled and tested your script on a smaller subset of the data within the login node, you will need to submit it as a job to be run on one of SciNet's compute nodes, per SciNet's instructions. These instructions also include example submission scripts. The output will be written to your $SCRATCH directory.
- If you would prefer to use a different language to query the data, please see the relevant documentation on SciNet's website. For example, Parquet files can also be queried using C++ or Java.
If you have any question, feel free to contact us.