Please see below for the University of Toronto's major Text and Data Mining (TDM) platforms and collections. For help regarding the platforms or collections below, please contact Digital Scholarship Services. Exception: please contact the Map & Data Library regarding LDC questions.
Text Analysis Tools Comparison
To aid you in deciding which tools meet your text analysis needs, please feel free to consult this Text Analysis Tools Comparison Cheat Sheet, which compares four of the tools below: the Digital Scholar Lab, Constellate, TDM Studio (both Visualization and Workbench), and the HathiTrust Research Center (both Algorithms and Data Capsule).
Application Programming Interfaces, or APIs, are a common way to access large amounts of data.
- Introduction to Text and Data Mining, with many APIs available to University of Toronto community members
- UTSC's Introduction to APIs
- Video: Introduction to Web APIs (captioned video with slides)
Constellate is a browser-based tool for creating datasets from collections, such as JSTOR, and then teaches and facilitates text analysis on those datasets. It has a number of tutorials, including well-documented Jupyter notebooks.
- Information on Constellate (including links to additional training)
- Accessing Constellate
- Building a Dataset in Constellate
- View a short demo of Constellate
- Workshop: Constellate: A New Platform for Text Analysis (Nov. 30, 2021): Recording - 55:30 & Setup Instructions (please note that Constellate login instructions have changed slightly since this video was recorded)
Gale Digital Scholar Lab (DSL)
The Gale Digital Scholar Lab is a platform that allows users to discover and create collections of digitized texts from the Gale Historical Collections, run a variety of statistical analyses on them, and visualize the resulting data.
- View a short demo of the Digital Scholar Lab
- Digital Scholar Lab Access Instructions (UTORid required)
- Digital Scholar Lab Tutorial
- Overview of the Digital Scholar Lab’s features (captioned video with slides)
HathiTrust Research Center (HTRC)
The University of Toronto is a member of the HathiTrust Research Center, which allows researchers to run text analysis scripts on the HathiTrust corpus.
- Information on HathiTrust Research Centre (including links to additional training)
- HathiTrust Research Center
- HTRC Analytics
- HTRC Documentation
- HTRC Tutorial List
- Demo of HTRC Algorithms
- Demo of HTRC Data Capsule
Linguistic Data Consortium (LDC)
The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources.
- For more information about the LDC, please visit the LDC website
- Access the University of Toronto’s LDC Holdings (requires UTORid login)
ProQuest TDM Studio
TDM Studio is a web platform for running text analyses on thousands of ProQuest datasets, including, but not limited to, such databases as ProQuest Dissertations & Theses and the New York Times. TDM Studio has two components:
- Visualizations, which allow for working with 10 000 results and are entirely in a point-and-click interface with data visualization
- The Workbench, for researchers and their teams coding with R or Python in Jupyter notebooks
If you wish to have access to both components, you must request them separately.
- See our TDM Studio information page for more information and to sign up
- See a demo of TDM Studio - Visualizations
- See a demo of TDM Studio - Workbench
Web of Science (WoS)
The Web of Science (WoS) Raw Data Product includes metadata from over 12,500 journals from around the world in over 250 Science, Social Science and Humanities disciplines. Conference proceedings and book data are also included. Data are available from 1900 and currently include over 63 million article records and 1 billion cited references (as of 2018).
- The XML has been converted into a PostgreSQL database. You can query the data through SQL statements