LibGuides: Text as Data: Finding and Mining: HathiTrust Digital Library and Research Center

HathiTrust Digital Library

The HathiTrust Digital Library is a collaborative repository containing over 17.5 million titles (and counting) digitized by academic and research institutions across the U.S., Canada and Europe. HathiTrust Digital Library specializes in literature and U.S. Government document collections. Cornell University is a member institution, which allows Cornell students, faculty and staff to access books and create collections using their own free personal accounts. Creating a collection with the HathiTrust Digital Library is like creating your own text corpus, which can be used for computational text analysis via the HathiTrust Research Center. Collections can be made public for other researchers to access, as well.

HathiTrust Digital Library facilitates full-view access to items in the public domain and items allowing access through Creative Commons licensing. These items are marked as "Full View" in HathiTrust Digital Library's catalog. In-copyright materials may not be accessible in full view, but their text is fully searchable to users. This approach has been legally deemed transformative and compliant with copyright law (Author's Guild, Inc. v. HathiTrust, 2014). This ruling is especially helpful for computational text analysis research purposes.

Learn how to create a free user account and access items in HathiTrust's Digital Library collections in our HathiTrust research guide.

HathiTrust Resource Center Analytics Portal

The HathiTrust Research Center (HTRC) facilitates large-scale computational analysis of the HathiTrust text corpus (including in-copyright materials). Analysis must comply with the Non-Consumptive Use Research Policy.

HTRC believes that the following text cleaning and analysis techniques are non-consumptive:

text extraction
textual analysis and information extraction
linguistic analysis
automated translation
image analysis
file manipulation
OCR correction
indexing and search

The HathiTrust Research Center (HTRC) Analytics portal features several services for facilitating computational text analysis. There is no cost to accessing the HTRC Analytics portal or exporting the results of analyses. This sets HTRC apart from other service providers that restrict access to texts for computational analyses with paywalls to access their content (e.g., ProQuest's TDM Studio or NewsBank's text mining tool).

HathiTrust Research Center Analytics tools

You must be logged in to access the following tools available via the HathiTrust Research Center Analytics portal.

Easy-to-use tools:

HTRC+Bookworm

For a low-stakes analysis, explore the HTRC+Bookworm tool to create line graphs of word use trends within millions of HathiTrust volumes.

Point-and-click web algorithms

These tools can be run on corpora with 3000 volumes or less (limit of 3 GB):

InPHO Topic Model Explorer: Trains unsupervised topic models of your corpus and allows you to export files containing the word-topic and topic-document distributions. You can even create an interactive visualization to accompany the data.
Named Entity Recognizer: Generates a list of the names of people and places, as well as times, percentages and monetary terms in your corpus. Keep in mind that not all the terms of interest may appear unless they are specified.
Token Count and Cloud Count Creator: Identifies the tokens (words) that occur most often in a corpus and the number of times they occur. You can create a tag cloud visualization of the most frequently occurring words in a corpus, where the size of the word is displayed in proportion to the number of times it occurred.

Tools requiring Intermediate and advanced technical skills:

Extracted Features Dataset

A derived dataset consisting of metadata and data elements extracted from over 17 million volumes in the HathiTrust Digital Library, including both in-copyright and public domain materials. The extracted features were gathered at the page level of all the works within the corpus, supplying tokenized tags for parts-of-speech tags, headers, footers and page numbers. You can perform word counts and analyses based on the token tags; essentially, any bag-of-words approach to analysis. Keep in mind that there may be some errors in the dataset, particularly if you are working with works with unclear text and font formats (especially historical documents) that might have been misread from the Optical Character Recognition (OCR) scanning.

Since programming knowledge is necessary to work with the Extracted Features dataset, there is a learning curve involved for researchers that do not know how to code. However, there are resources if you'd still like to use the tool: HathiTrust provides scripts you can run from your computer to work with the Extracted Features dataset (see the HTRC Derived Datasets page) as well as tutorials for Extracted Features Use Cases and Examples.

Data Capsules

Data capsules are designed for advanced researchers that need flexibility in their workflows. This service allows researchers to work with large datasets on secure, remote desktops. Researchers can import their own code and non-HathiTrust corpora into the Data Capsule environment. The results of analyses can also be exported from the secure environment. Note that a human HTRC staff member reviews the request and ensures that the exported data complies with non-consumptive use policy. Learn about how to access a Data Capsule in the HTRC Data Capsule Tutorial.

HathiTrust Help & News

HathiTrust's help pages

Explore HathiTrust’s help pages if you have general questions about using its services:

A User's Guide to HathiTrust
Getting started with the HathiTrust Research Center
HTRC User Getting Started FAQ

HathiTrust Newsletter

Subscribe to the low-traffic HathiTrust newsletter to stay updated on the latest news and features within HathiTrust Digital Library and HathiTrust Research Center.

Need additional support?

Contact the Digtal CoLab or HTRC directly: htrc-help@hathitrust.org.

How Researchers are Using HathiTrust

Explore projects that have used HathiTrust Digital Library's collections and the HTRC Analytics resources:

History

Stevens, G. (2017). New Metadata Recipes for Old Cookbooks: Creating and Analyzing a Digital Collection Using the HathiTrust Research Center Portal. Code4Lib Journal, (37). https://journal.code4lib.org/articles/12548.

Information Science

Murdock J, Allen C, Börner K, Light R, McAlister S, Ravenscroft A, et al. (2017) Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library. PLoS ONE 12(9): e0184188. https://doi-org.proxy.library.cornell.edu/10.1371/journal.pone.0184188

Linguistics

Soto-Corominas, A., De la Rosa, J., & Suárez, J. L. (2018). What Loanwords Tell Us about Spanish (and Spain). Digital Studies/le Champ Numérique, 8(1), 4. DOI: http://doi.org/10.16995/dscn.297

Musicology

Downie, J. S., Bhattacharyya, S., Giannetti, F., Koehl, E. D., & Organisciak, P. (2020). The HathiTrust Digital Library’s potential for musicology research. International Journal on Digital Libraries, 1-16. https://doi.org/10.1007/s00799-020-00283-7

Sample HTRC datasets

See sample literature datasets created and stored in the HathiTrust Research Center:

Ted Underwood, Boris Capitanu, Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, J. Stephen Downie (2015). Word Frequencies in English-Language Literature, 1700-1922 (0.2) [Dataset]. HathiTrust Research Center.http://dx.doi.org/10.13012/J8JW8BSJ.

Matthew Wilkens and Guangchen Ruan. “Geographic Locations in English-Language Literature, 1701-2011 (1.0) [Dataset].” HathiTrust Research Center.https://doi.org/10.13012/2K5C-RF13