Text and Data Mining Tools Overview

Please see below for the University of Toronto's major Text and Data Mining (TDM) platforms and collections. For help regarding the platforms or collections below, please contact Digital Scholarship Services. Exception: please contact the Map & Data Library regarding LDC questions.

The University of Toronto Libraries also organized a Colloquium on Text & Data Mining (TDM) in Libraries from May 2-3, 2023, in Toronto, Ontario, Canada. Presentation slides and recordings from TDM tool providers and some TDM researchers are available on the colloquium website.

Text Analysis Tools Comparison

To aid you in deciding which tools meet your text analysis needs, please feel free to consult this Text Analysis Tools Comparison Cheat Sheet, which compares four of the tools below: the Digital Scholar Lab, Constellate (now replaced by JSTOR Text Analysis, see below), TDM Studio (both Visualization and Workbench), and the HathiTrust Research Center (both Algorithms and Data Capsule).

This chart was developed as part of our workshop, Text Analysis Tasting Menu: A Sampling of Available Tools (recording and slides).

APIs

Application Programming Interfaces, or APIs, are a common way to access large amounts of data.

Introduction to Text and Data Mining, with many APIs available to University of Toronto community members
UTSC's Introduction to APIs
Video: Introduction to Web APIs (captioned video with slides)

Canadian Intellectual Property Office (CIPO)

The Canadian Intellectual Property Office (CIPO) Patent PostgreSQL Database is a metadata extract from CIPO's IP Horizon's XML Databank that contains information on almost 2.5 million patent documents filed in Canada. Data are available from 1870 to present, and include metadata as well as the full text of patent descriptions and claims information. These documents represent both patent applications and patent grants, as well as patents that have expired. Many patents provide references to equivalent patents filed in other countries, via the World Intellectual Property Office's (WIPO) Patent Cooperation Treaty (PCT).

Access:

The XML has been converted into a PostgreSQL database. You can query the data through SQL statements.

Gale Digital Scholar Lab (DSL)

The Gale Digital Scholar Lab is a platform that allows users to discover and create collections of digitized texts from the Gale Historical Collections, run a variety of statistical analyses on them, and visualize the resulting data.

View a short demo of the Digital Scholar Lab
Digital Scholar Lab Access Instructions (UTORid required)
Digital Scholar Lab Tutorial

HathiTrust Research Center (HTRC)

The University of Toronto is a member of the HathiTrust Research Center, which allows researchers to run text analysis scripts on the HathiTrust corpus.

Please note that the HTRC's funding ends at the end of 2026. Data capsules and most other services end by September, 2026. See the HTRC transition guide for a timeline.

HathiTrust will continue to provide public domain research datasets to eligible researchers through their dataset request process. They plan to eventually offer plans for accessing copyrighted data, though that service is not currently available.

JSTOR Text Analysis and Constellate

JSTOR offers a python-based text analysis service - see our JSTOR Text Analysis Support information page for more details and resources.

Previously, JSTOR/ITHAKA also offered text analysis via the now-sunset Constellate platform.

Linguistic Data Consortium (LDC)

The University of Toronto is a subscriber to the Linguistic Data Consortium which licenses language corpora and other language resources.

For more information about the LDC, please visit the LDC website
Access the University of Toronto’s LDC Holdings (requires UTORid login)

ProQuest TDM Studio

TDM Studio is a web platform for running text analyses on thousands of ProQuest datasets, including, but not limited to, such databases as ProQuest Dissertations & Theses and the New York Times. TDM Studio has two components:

Visualizations, which allow for working with 10 000 results and are entirely in a point-and-click interface with data visualization
The Workbench, for researchers and their teams coding with R or Python in Jupyter notebooks

If you wish to have access to both components, you must request them separately.

See our TDM Studio information page for more information and to sign up
See a demo of TDM Studio - Visualizations
See a demo of TDM Studio - Workbench

Web of Science (WoS)

The Web of Science (WoS) Raw Data Product includes metadata from over 12,500 journals from around the world in over 250 Science, Social Science and Humanities disciplines. Conference proceedings and book data are also included. Data are available from 1900 and currently include over 63 million article records and 1 billion cited references (as of 2018).

Access:

The XML has been converted into a PostgreSQL database. You can query the data through SQL statements

Text and Data Mining Resources - Books Available through UTL

Book Title	Author(s)	Location at UTL
An introduction to text mining : research design, data collection, and analysis	Ignatow, Gabe; Mihalcea, Rada	Engineering & Computer Science Library, UTSC Storage
Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning	Bengfort, Benjamin; Bilbro, Rebecca	UTSC - The Bridge
Data mining concepts and techniques	Han, Jiawei.; Kamber, Micheline.; Pei, Jian.	Online Only
Data Mining The Textbook	Aggarwal, Charu C.	Online Only
Discovering knowledge in data: an introduction to data mining	Larose, Daniel T.; Larose, Chantal D.	Online Only
Pacific-Asia Conference on Knowledge Discovery and Data Mining	Various	Online Only
Practical data mining	Hancock, Monte.	Robarts Storage, UTSC Storage
Text Analysis with R For Students of Literature	Jockers, Matthew L.; Thalken, Rosamond.	Online Only
Text mining and visualization : case studies using open-source tools	Hofmann, Markus.; Chisholm, Andrew	Online Only
Text Mining Concepts, Implementation, and Big Data Challenge	Jo, Taeho	Online Only
Text Mining for Information Professionals	Lamba, Manika; Madhusudhan, Margam	Online Only
Text Mining in Practice with R	Kwartle, Tedr	Online Only
The text mining handbook : advanced approaches in analyzing unstructured data	Feldman, Ronen,; Sanger, James	Engineering & Computer Science Library, UTM

Text and Data Mining Tools Overview

Text Analysis Tools Comparison

APIs

Canadian Intellectual Property Office (CIPO)

Access:

Gale Digital Scholar Lab (DSL)

HathiTrust Research Center (HTRC)

JSTOR Text Analysis and Constellate

Linguistic Data Consortium (LDC)

ProQuest TDM Studio

Web of Science (WoS)

Access:

Text and Data Mining Resources - Books Available through UTL

Library links

Libraries

Contact