Digital text corpora at Cornell University Library

Cornell University Library hosts several spaces online and on-site for accessing digital text collections outside the catalog. 

Due to multimodal formats of some of these resources, you may need support from a Cornell University Library staff member in accessing some content. Ask a librarian if you encounter questions.

Historical texts in The Cornell University Library Digital Collections

Cornell University Library's Digital Collections features content in nearly a hundred collections that can be used for research and teaching purposes, including for text corpus creation.  These resources can be accessed remotely.

Note that text resources may need to be processed with online Optical Character Recognition (OCR) software. Contact a librarian if you have questions about this process.

Learn more about select digital collections in then Digital Rare and Distinctive Collections at Cornell research guide. 

Browse the full Digital Collections catalog for content of interest to your project.

Electronic Text Center

The Electronic Text Center at Olin Library (ETC) hosts databases of scholarly indexes, bibliographies, reference materials, and electronic text and multimedia documents. The ETC’s collections feature historical documents and content on disks, which may need particular software or systems to run and read.

The ETC has dedicated computer stations in the basement of Olin Library where you can access the databases only available on disk or drive. These computers require library staff support to access. Inquire about staff availability to access the stations in-person at the Reference Desk or online via chat or emailDepending on the database you would like to work with, accessing the texts for computational analysis may require additional support from a library staff member. Learn more about accessing the Electronic Text Center’s collections

The following databases are available online via the Electronic Text Center's Online Resources page:

  • The Bible in English: Contains 21 different versions of the English Bible, 13 complete Bibles, 5 texts of the New Testament only, 2 texts of just the Gospels, and William Tyndale's translation of the Pentateuch, Jonah, and New Testament.

  • The Database of African-American Poetry: Covers the works of 54 African-American poets writing in the 18th and 19th centuries and, through their writings, provides a unique portrait of early America.

  • Editions [and Adaptations] of Shakespeare: Contains 11 major editions from the First Folio of 1623 to the Cambridge edition of 1863-6, 28 separate contemporary printings of individual plays and poems, selected apocrypha and related works, 100 adaptations, sequels and burlesques from the 17th, 18th, and 19th centuries, including the whole of Bell's Acting Edition of Shakespeare's Plays (1774).

  • Goethes Werke: Consists of the Weimar Edition of Goethe's works, published between 1887-1919 by Hermann Böhlau (and Nachfolger), the Goethes Gespräche, edited by Woldemar Freiherr von Biedermann (Leipzig, 1889-1896) and all the letters discovered since the completion of the Weimar Edition: Goethes Werke, Nachträge zur Weimarer Ausgabe, edited by Paul Raabe (dtv, München, 1990).

  • Oxford English Dictionary, 2nd edition: Presents in alphabetical series the words that have formed the English vocabulary from the time of the earliest records down to the present day. This version of OED is the Second Edition from 1995.

  • Past Masters (Intelex): Consists of over 100 political and philosophical texts ranging from the works of ancient Greece to the early twentieth century. The collection is particularly strong in eighteenth and nineteenth century English philosophy.

  • Patrologia Latina Database: Contains 221 volumes and represents a complete electronic version of the first edition of Jacques-Paul Migne's Patrologia Latina (1844-1855 and 1862-1865).

Micrographic collections

Cornell University Library's Micrographic Collections provide access to historical documents available on microfilm and microfiche. Content on micrographic formats can be scanned as high-resolution images and then be processed with OCR to make the texts computationally accessible. This is a time-consuming option that may not work for developing large-scale corpora, but this is a helpful alternative if you would like to analyze a corpus that might be otherwise restricted by electronic database licenses.

There are no limits to the number of scans a patron can make from our collection. The microfilm readers have impressive and user-friendly features, including options for preferred file formats (e.g., TIFF), DPI resolution, or processing images in color, graycale or black and white. There are five micrograph reading stations that operate on a first come, first serve basis. Their heaviest-trafficked times are during school breaks.