Table of Contents
What is the HTRC?
What is Text Analysis?
What Tools and Resources are Available for me to Access?
Web-based Text-Analysis Algorithms
HTRC Extracted Features Files
HathiTrust Data API
HathiTrust Bibliographic API
HTRC in-development and beta tools
How Can I Receive Training?
The HathiTrust Research Center (HTRC) is the research arm of HathiTrust. It develops tools and resources that enable text or computational analysis of the HathiTrust corpus. This corpus or digital library includes over 10 million volumes (mostly books and journals), 3 million of which are in the public domain. It covers 400 languages and publication dates from 1500 to the present day, representing a broad variety of subjects.
Text Analysis, using the tools available through the HTRC, allows researchers to quickly analyze a large number of documents (more than a human could read, and faster). Some questions that can be answered using text analysis would be:
- What are these texts about?
- How are these texts connected?
- What emotions (or affects) are found within these texts?
- What names are used in these texts?
- Which of these texts are most similar?
These questions can be addressed using techniques such as finding word frequencies, performing topic modelling, or performing named entity recognition. HTRC documentation can help you learn more about performing these tasks and answering these questions.
HTRC provides several ways of accessing and analyzing texts. While some require knowledge of the Python and R programming languages, others are point and click. Please note that the options below may include different subsets of the corpus, as well as different levels of access (metadata vs. full type) and data types.
Please note that signing in to either the Digital Library or Analytics will not automatically sign you in to the other, and these accounts are not linked (for example, collections build in your Digital Library account will not automatically be transferred to Analytics).
On the Hathi Trust Digital Library, you can build "collections" of texts that can then be transferred into HTRC as "worksets". These worksets can then be analyzed in HTRC using a variety of approaches, either using web-based algorithms, or in a Data Capsule environment. Worksets therefore have two primary functions: they're both organizational, gathering materials of interest together in once place, and algorithmic.
In addition to importing a collection from HathiTrust, you can also create worksets in HTRC by uploading a file containing HathiTrust volumeIDs, or using the new WorkSet Building Tool (Beta).
Visit the HTRC documentation for more information on how to create a workset.
Access HTRC Algorithms by selecting "Algorithms" on the HTRC home page.
Worksets in HTRC can be analyzed from your browser using a set of algorithms. You must create a workset prior to running any algorithms in HTRC. Currently, these algorithms include Name Entity Recognizer, Token Count and Tag Cloud Creator, and InPhO Topic Model Explorer. Documentation on each tool, including information on the code and libraries behind each one, is available from the HTRC Algorithms page.
Algorithms are limited to working on public-domain content, and worksets of up to 3000 records or 3GB.
Select your algorithm, and then select the workset you would like to run this on. In some cases, you may need to set additional parameters such as identifying the primary language in your text or providing a list of stop words. Once a job is executed, it will appear as queued in the "Jobs" page accessible from the the HTRC Algorithms Page. Depending on size, a job may take several hours to complete.
Once complete, the status will change to finished. You will now be able to select the Job Name to visualize the results or download the underlying data.
Note that additional tools exist in Beta, and may be added to Algorithms in future.
Access Data Capsules by selecting "Data Capsules" on the HTRC home page.
HTRC Data Capsules provide a secure, virtual computer for non-consumptive analytical access to the full OCR text of works in the HathiTrust Digital Library. To explore the Capsule environment on a small virtual machine (VM), create a Demo Capsule. To run analysis on large datasets, create a Research Capsule. Creation of a Research Capsule requires you to input information on your technical requirements in terms of VCPUs and RAM, as well as to describe your project and the derived products you intend to export.
Note that access to the full corpus, not just public-domain content, is available in the Capsule Environment. This must be requested via an additional consent form when completing the request to create your Research Capsule.
A Capsule is a collaborative environment, and can be shared with up to five collaborators via their e-mail. Each e-mail must be linked to a HTRC account.
The Capsule Environment
Data capsules are restricted, particularly in limiting how and when the products created by analysis tools leave the capsule. Capsules have two modes, Maintenance and Secure. The environment must be in Secure mode in order to work with HathiTrust Data; while in Maintenance mode, custom scripts and external datasets may be imported. Note that any data products leaving a data capsule must undergo results review prior to release. Once review is complete, data will be released via an e-mail link to all collaborators, which remains active for 12 hours.
Capsule environments come loaded with Jupyter Notebooks and Voyant Tools for running data analysis. All common Python libraries are pre-installed in Jupyter, and others can be added. Command line access is also enabled.
Dataset size is limited to 50,000 records, however larger datasets are available on request by contacting HTRC at firstname.lastname@example.org.
More information on importing worksets into your Data Capsule, and working with them, can be found on in the HTRC Data Capsule Tutorial.
The HTRC Extracted Features Dataset is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. It contains non-consumptive features for both public-domain and in-copyright books. These features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and more.
Files in the dataset are structured as one volume per file. Each file contains both a metadata and features block. The metadata section contains basic bibliographic metadata derived from MARC and transformed to Bibframe, while the features section contains all of the unigram tokens, token counts, and other calculated or algorithmically-derived data from the HathiTrust volume. Files are divided into JSON arrays for each page, with header body and footer. Empty values tagged as NULL .
Note that because word order is not preserved, Extracted Features cannot be used for things where that information is important, for example in sentiment analysis. However, you could conduct analysis such as page-level co-occurences.
The complete dataset can be downloaded as JSON via rsync (4TB total). Alternatively, you can interact with this data directly via the HTRC Feature Reader Python Library . An excellent guide on working with Extracted Features is available from the Programming Historian.
The Data API allows you to retrieve page images, OCR text for individual pages, and METS metadata for 3 million public-domain volumes (this does not include any content digitized by Google).
There are two methods of accessing the Data API: via a Web client, requiring authentication (users who are not members of a HathiTrust partner institution must sign up for a University of Michigan "Friend" Account), and programmatically using an access key that can be obtained on request. The Data API allows hits up to 10,000 volumes.
Note this API is distinct from the HTRC Data API which is only accessible within the secure Data Capsule environment.
The HathiTrust Bibliographic API to do real-time querying against the HathiTrust collection and to retrieve a limited number of bibliographic records. It can use a variety of common identifiers such as ISBN, ISSN, LCCN and OCLC, as well as HathiTrust identifiers, to retrieve information about any works associated with those identifiers. The API can provide you with brief or full JSON bibliographic records.
HTRC continues to develop new tools and resources to support text or computational analysis of the HathiTrust corpus. A list of tools in-development can be found on the HTRC website.
The HathiTrust Research Centre maintains an extensive documentation guide at the University of Illinois. This guide includes documentation on each service, as well as step-by-step tutorials on getting started. In some cases, video tutorials and sample scripts are available.
The HTRC also offers occasional workshop series, which are open to all interested attendees. Please see their workshop page for upcoming sessions and more information. Some materials from previous workshops are also available on Google Drive and via the University of Illinois' "train the trainer" curriculum (please note these materials may be slightly out of date).
How do I cite a dataset and tool?
HathiTrust is working to create online documentation around dataset and tool citation. The following advice has been provided in the interim by the HTRC. Examples use the Chicago Manual of Style, but any citation guidelines may be used.
Datasets created from Hathi Trust Digital Library Collections can be cited as datasets, for example:
Stevens, G. Early American Cookbooks. December 2016. Distributed by the Hathi Trust Digital Library. https://babel.hathitrust.org/cgi/mb?a=listis&c=1934413200.
For algorithms, you can simply site HTRC Analytics as a website and the name of the algorithm , for example:
“HTRC Analytics.” Named Entity Recognizer (v2.0). Accessed February 16, 2022. https://analytics.hathitrust.org/algorithms.
HTRC Data Capsules can be cited in a similar manner as Algorithms above, for example:
HTRC Data Capsules. Accessed February 16, 2022. https://analytics.hathitrust.org/capsules.
For their derived datasets, they try to make citations for each, including DOI. For example, for the Extracted Features Dataset:
Jacob Jett, Boris Capitanu, Deren Kudeki, Timothy Cole, Yuerong Hu, Peter Organisciak, Ted Underwood, Eleanor Dickson Koehl, Ryan Dubnicek, J. Stephen Downie (2020). The HathiTrust Research Center Extracted Features Dataset (2.0).HathiTrust Research Center.https://doi.org/10.13012/R2TE-C227
If I upload my own data to a Data Capsule, is it secure?
If you require help, either accessing HTRC or with its contents, please feel free to contact us at the Map & Data Library. You can either email email@example.com or reach out with the MDL contact form.
There are a number of resources available at the University of Toronto for learning and working with tools for text analysis, as well as the broader field of text and data mining. If you would like to learn more, see this link.