LibGuides: Text as Data: Finding and Mining: Home

Purpose of this guide

This guide is intended to support researchers in finding already-existing text datasets (also referred to as text corpora) or mining text for analysis, particularly in the humanities and interpretive social sciences fields.

Please contact the Digital CoLab at Olin Library if you have questions regarding text mining, locating text corpora, or preparing for computational text analysis.

Introduction to text analysis and locating text corpora

Computational text analysis is “the process of deriving information by way of statistical pattern learning” from a body of text, often called a corpus (or, corpora for multiple bodies of text). Text analysis methodologies are used to find patterns in large amounts of texts that are too time-intensive to manage on their own. Those texts can be found in many places, including:

Books (digital editions or print copies)
Newspaper articles (more information to come!)
Social media content (more information to come!)

Some texts are readily available digitally with full-text readable content (e.g., digital collections in HathiTrust), other texts may need to be scanned with Optical Character Recognition software (e.g., physical collections), and others still may need to be digitally scraped using an API (e.g., Twitter tweets).

For any texts that you work with, you should also consider copyright & license restrictions, depending on where you would like to collect the texts. For example, some newspapers and articles available in Cornell University Library’s databases have restrictions on what and how you text mine from their collections.

The goal of this guide is to help you locate corpora for analysis while navigating these complex considerations.

Is text analysis right for your project?

If you are interested in finding patterns in a large volume of texts, text analysis may be the right methodology for your project. If you want to perform a close reading analysis to derive meaning from a large body of texts, you might be better off using your skills to read and manually code the texts. If you are unsure of whether text analysis is right for your project, contact the Digital CoLab for support.

Note that text analysis is one methodology for exploring a research question. To produce robust research, it is helpful to triangulate the results of any text analysis project with different data sources or methodologies.

Resources on text mining and analysis

Introduction to Cultural Analytics and Python (Melanie Walsh)
In this handy open source digital textbook, Melanie Walsh presents introductory Python tutorials and text analysis methods as quantitative approaches to cultural study. It's intended for learners with no prior programming experience. There are also resources for analyzing texts in non-English languages, including Spanish, Chinese, Russian, Portuguese and Danish.

Text Analysis with R for Students of Literature (2nd Ed.) by Matthew L. Jockers & Rosamond Thalken
ISBN: 9783030396435

Publication Date: 2020

Text Analysis with R provides a practical introduction to computational text analysis using the open source programming language R. Each chapter builds on its predecessor as readers move from small scale “microanalysis” of single texts to large scale “macroanalysis” of text corpora, and each concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book’s focus is on making the technical palatable and making the technical useful and immediately gratifying. Text Analysis with R is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological toolkit to include quantitative and computational approaches to the study of text.
Text As Data by Justin Grimmer; Margaret E. Roberts; Brandon M. Stewart
ISBN: 9780691207551

Publication Date: 2022-03-29

A guide for using computational text analysis to learn about the social world From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights. Text as Data is organized around the core tasks in research projects using text--representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research. Bridging many divides--computer science and social science, the qualitative and the quantitative, and industry and academia--Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain. Available in print at Cornell University Library.
Humanities Data Analysis: Case Studies with Python by Folgert Karsdorp; Mike Kestemont; Allen Riddell
ISBN: 9780691172361

Publication Date: 2021-01-12

A practical guide to data-intensive humanities research using the Python programming language The use of quantitative methods in the humanities and related social sciences has increased considerably in recent years, allowing researchers to discover patterns in a vast range of source materials. Despite this growth, there are few resources addressed to students and scholars who wish to take advantage of these powerful tools. Humanities Data Analysis offers the first intermediate-level guide to quantitative data analysis for humanities students and scholars using the Python programming language. This practical textbook, which assumes a basic knowledge of Python, teaches readers the necessary skills for conducting humanities research in the rapidly developing digital environment. Available as an open access textbook and in print at Cornell University Library.
Text mining: a guidebook for the social sciences by Ignatow, Gabe; Mihalcea, Rada
Call Number: H61.3.I395 2017

ISBN: 9781483369358

Publication Date: 2017

A SAGE Publications Research Methods resource, this work overviews various approaches to text mining from social sciences and humanities disciplinary perspectives. It covers the fundamentals of text mining and introduces for compiling and analyzing a corpus. Available online and in print editions at Cornell University Library.

Librarian

Iliana Burgos

she/her/ella

Email Me

Contact:

Emerging Data Practices Librarian,
Digital Scholarship Services
Olin Library