Purpose of this guide

This guide is intended to support researchers in finding already-existing text datasets (also referred to as text corpora) or mining text for analysis, particularly in the humanities and interpretive social sciences fields. 

Please contact the Digital CoLab at Olin Library if you have questions regarding text mining, locating text corpora, or preparing for computational text analysis.

Introduction to text analysis and locating text corpora

Computational text analysis is “the process of deriving information by way of statistical pattern learning” from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methodologies are used to find patterns in large amounts of texts that are too time-intensive to manage on their own. Those texts can be found in many places, including:

  • Books (digital editions or print copies)
  • Newspaper articles (more information to come!)
  • Social media content (more information to come!)

Some texts are readily available digitally with full-text readable content (e.g., digital collections in HathiTrust), other texts may need to be scanned with Optical Character Recognition software (e.g., physical collections), and others still may need to be digitally scraped using an API (e.g., Twitter tweets).  

For any texts that you work with, you should also consider copyright & license restrictions, depending on where you would like to collect the texts. For example, some newspapers and articles available in Cornell University Library’s databases have restrictions on what and how you text mine from their collections. 

The goal of this guide is to help you locate corpora for analysis while navigating these complex considerations.

Is text analysis right for your project?

If you are interested in finding patterns in a large volume of texts, text analysis may be the right methodology for your project. If you want to perform a close reading analysis to derive meaning from a large body of texts, you might be better off using your skills to read and manually code the texts. If you are unsure of whether text analysis is right for your project, contact the Digital CoLab for support. 

Note that text analysis is one methodology for exploring a research question. To produce robust research, it is helpful to triangulate the results of any text analysis project with different data sources or methodologies.

Resources on text mining and analysis