Linguistics corpora

The Linguistics Department at Cornell University feature several text analysis resources. Learn more about their resources on the Department of Linguistics’ Language Corpora page.

Linguistics Data Consortium

The Cornell Linguistics Department, in collaboration with the Computer Science and Information Sciences Departments, provides free access to more than 800 language corpora from the LDC Linguistics Data Consortium (LDC) and other sources. These text, audio, and video databases span more than 60 languages, and are available to Cornell researchers and students for faculty-supervised not-for-profit research. Learn more on the How to Access LDC Corpora page.

CQPweb Server

The CQPweb server was created by the Linguistics Department for conducting fast linguistics searches in very large corpora.  It does this by combining a query language with pre-indexed corpora with annotations such as parts-of-speech. The interface is good for exploratory analysis, such as finding examples of some syntactic or semantic phenomenon you are interested in.  There are also advanced modes involving statistics.

The system also features ready-to-use data, including 17 years and 2 billion words of NY Times, Associated Press, and Agence France-Presse text news corpora extracted from the Annotated English Gigaword corpus for years 1994 through 2010. 

View an introductory tutorial video on CQPweb to learn more about how the server works.

If you have a particular corpus you'd like to work with or additional questions, contact Dr. Mats Rooth, Professor of Linguistics or Bruce McKee, Systems Administrator.

English Corpora (formerly BYU Corpora)

English Corpora, managed by Brigham Young University and formerly known as BYU Corpora, provides the following corpora for Cornell faculty-supervised, non-profit research:

  • Corpus of American SoapsWith over 100 million words of data from 22,000 transcripts of American soap operas from the early 2000s, this corpus serves as a great resource to look at very informal language.

  • Corpus del EspanolA resource containing about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries from the past three to four years.

  • TV CorpusThis corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc.