Guide to Corpora and Text Analysis

Cornell has more than 880 Linguistic Data Consortium text, audio, and video corpora in more than 60 languages. Visit the LDC website for more information.

In addition, Brigham Young University hosts corpora which are available here. More details below.

Please visit the Department of Linguistics Language Corpora website for more information on how scholars can access these and other corpora.

For guidance on and research resources for text analysis, refer to this guide: Text as Data: Finding and Mining. This guide is intended to support researchers in finding already-existing text datasets (also referred to as text corpora) or mining text for analysis, particularly in the humanities and interpretive social sciences fields.

Text Analysis: Corpora

  • ARTFL
    A Cooperative Project of the Centre National de la Recherche Scientifique and the University of Chicago, ARTFL is a research tool for scholars and students in all areas of French Studies. It evolved from the construction of the dictionary Trésor de la Langue Française: Dictionnaire de la langue du XIXe et du XXe siècle, 1789-1960, publié sous la direction de Paul Imbs, Paris: Éditions du Centre national de la recherche scientifique, 1971-1994. 16 volumes. (Olin Reference ++ PC 2625 .I32)

    At present, ARTFL's main corpus, ARTFL-FRANTEXT, consists of nearly 3,000 texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The eighteenth, nineteenth and twentieth centuries are about equally represented, with a smaller selection of seventeenth century texts as well as some medieval and Renaissance texts. In addition to FRANTEXT, ARTFL has built hundreds of databases for researchers and students working in specialized disciplines and languages other than French.
  • Corpas Na Gaeilge, see the Electronic Text Center information below.
  • Corpora. Brigham Young University.
    Free online searching of thirteen large corpora--collections of words--from Spanish, Portuguese, and various dialects of English. Here is a table of the titles of the corpora, the number of words in each, and the dates covered is here.
  • Cornell NLP Linguistic Data Resources.
    Creates, collects, and distributes speech and text databases, lexicons, and other resources" for research use. Develops tools to collect and organize linguistic data. A membership organization of universities and research laboratories, located at the University of Pennsylvania. Cornell is a member and data sets can be obtained through the NLP Group. More information on this wiki page. "This page is meant to list what corpora, software, and other resources we [NLP] have available and where they are." Cornell only.

The Electronic Text Center

Databases in the Electronic Text Center lists the full-text sources available electronic form (online access, CD-ROMs, DVDs) selected for the Electronic Text Center over the years. Ask the reference staff in 104 Olin if you need assistance or for more information.
Example of a corpus available on CD-ROM:

Corpas Na Gaeilge, 1600-1882 = The Irish Language Corpus. Baile Átha Cliath : Acadamh Ríoga na hÉireann, 2004.
(Olin Reference Disk PB 1345 .C67 2004) Shelved in the Electronic Text Center.
A searchable collection of printed texts in Irish, 1600-1882. Includes 705 texts consisting of prose, poetry, folklore, religious works, historical documents, translations, etc. Also includes an index with frequencies, a reverse index, a custom search facility, and an index nominum of 270,000 place and personal names. (With accompanying user's guide in Irish and English).