Purpose of this guide
This guide is intended to support researchers in finding already-existing text datasets (also referred to as text corpora) or mining text for analysis, particularly in the humanities and interpretive social sciences fields.
Please contact the Digital CoLab at Olin Library if you have questions regarding text mining, locating text corpora, or preparing for computational text analysis.
Introduction to text analysis and locating text corpora
Computational text analysis is “the process of deriving information by way of statistical pattern learning” from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methodologies are used to find patterns in large amounts of texts that are too time-intensive to manage on their own. Those texts can be found in many places, including:
- Books (digital editions or print copies)
- Newspaper articles (more information to come!)
- Social media content (more information to come!)
Some texts are readily available digitally with full-text readable content (e.g., digital collections in HathiTrust), other texts may need to be scanned with Optical Character Recognition software (e.g., physical collections), and others still may need to be digitally scraped using an API (e.g., Twitter tweets).
For any texts that you work with, you should also consider copyright & license restrictions, depending on where you would like to collect the texts. For example, some newspapers and articles available in Cornell University Library’s databases have restrictions on what and how you text mine from their collections.
The goal of this guide is to help you locate corpora for analysis while navigating these complex considerations.
Is text analysis right for your project?
If you are interested in finding patterns in a large volume of texts, text analysis may be the right methodology for your project. If you want to perform a close reading analysis to derive meaning from a large body of texts, you might be better off using your skills to read and manually code the texts. If you are unsure of whether text analysis is right for your project, contact the Digital CoLab for support.
Note that text analysis is one methodology for exploring a research question. To produce robust research, it is helpful to triangulate the results of any text analysis project with different data sources or methodologies.
Resources on text mining and analysis
- Introduction to Cultural Analytics and Python (Melanie Walsh)This resource, built with Jupyter Book and intended to engage learners with no prior programming experience, is a critical deep-dive into learning what text analysis is and how to perform a variety of text mining and analysis techniques. The workbook features Python code snippets and exercises to put skills into practice. There are also resources for analyzing texts in non-English languages, including Spanish, Chinese, Russian, Portuguese and Danish.
- Text Analysis with R for Students of Literature (2nd Ed.) byISBN: 9783030396435Publication Date: 2020Text Analysis with R provides a practical introduction to computational text analysis using the open source programming language R. Each chapter builds on its predecessor as readers move from small scale “microanalysis” of single texts to large scale “macroanalysis” of text corpora, and each concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book’s focus is on making the technical palatable and making the technical useful and immediately gratifying. Text Analysis with R is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text.
- Text mining: a guidebook for the social sciences byCall Number: H61.3.I395 2017ISBN: 9781483369358Publication Date: 2017A SAGE Publications Research Methods resource, this work overviews various approaches to text mining from social sciences and humanities disciplinary perspectives. It covers the fundamentals of text mining and introduces for compiling and analyzing a corpus. Available online and in print editions at Cornell University Library.