LibGuides: Zotero at Cornell: Optical Character Recognition (OCR)

Optical Character Recognition (OCR)

Cornell University-affiliated researchers can scan text from physical collections at Cornell University Library using Optical Character Recognition (OCR) software. OCR software examines a scanned image or document for text and creates a text-readable digital copy, often as a PDF, Microsoft Word document or Microsoft Excel sheet.

OCR software can be a time-saving alternative to manually transcribing typed text into a word processor. But OCR software is not perfect. Depending on the image quality of the original document and the corresponding scans you make, OCR software may not produce completely accurate renditions of the original document. It also does not work well for handwritten documents.

Learn about the OCR software available at Cornell University Library and how to produce high-quality scans for best OCR results.

OCR software at Cornell University Library

Adobe Acrobat Pro: You can use Adobe Acrobat Pro on any Cornell University Library computers to make image files into PDFs and perform OCR on the PDF versions of your documents. Acrobat Pro is only available for Cornellians that log in on the public computers across all the libraries, so unfortunately you can't access it by logging into Adobe using your Cornell emails on your own devices. However, students should be able to access it if they log in on any of the public computers.
ABBYY FineReader: This is one of the strongest OCR software options we have publicly available. ABBYY FineReader has almost 200 language settings and can produce high-quality results. It can be a little finicky and sometimes bugs come up with large PDF files or poor-quality images. ABBYY is on a few library computers, including in Olin Library B12, the Digital CoLab in Olin Library 703, as well as a few other library locations listed on this page (under "Assistive Technology in the Library").
Overhead Book Scanner at Mui Ho Fine Arts Library: If you want to scan physical documents with OCR, go to the Mui Ho Fine Arts Library and use the overhead scanner in the lobby. It has a built-in OCR feature, so you can produce a PDF scan with OCR in one step.

If you have more specific questions for the PDFs (each case can be a little different depending on image quality, language, etc.), contact digitalcolab@cornell.edu.

Scanning documents with OCR

Create a digital image of the document. Depending on the size of a document, that may require scanning or taking a high-quality photo of the document. Keep in mind:

The clearer your image is, the better the text output will be
Resolution
Ensure that there is no blur and that your text is in focus on the image

Process the scanned images and perform OCR. Software available at Cornell University Library like ABBYY Fine Reader can help you edit the page by:

Straightening the text (deskewing)
Removing curves shapes of book pages (de-warping)
Removing speckles (de-speckling)
Cropping and rotating
Adjusting text layout

Evaluate: How accurately did the OCR perform?

If there was a significant amount of error, try rescanning the document. Consider the best practices for OCR and note if there are any limitations with the original document you are working with.
If problems persist, contact the Digital CoLab for support.

Best practices for OCR

Keep in mind the following tips on successful OCR scanning:

Scan the document at a high resolution (600 dpi is recommended, but 300 dpi works, too)
Scan documents with 10-point text size or larger
Use grayscale instead of black and white
Ensure the lines of text are parallel with the bottom of the page
Screen captures (images created using the “print screen” function) do not perform well with OCR

OCR software may have trouble reading the document if:

The document was scanned at a low resolution, upside down or sideways
The document has lines, stains or other marks that obscure words
The document’s font spacing is tight and the letters run close together

More OCR Software options

Two high-capacity OCR software to know are Tesseract and ABBYY FineReader. Learn more about their advantages and limitations below:

**Quick reference chart of strengths and limitations of tools**, adapted from a chart created by Cornell University Library staff member Michelle Paolillo.

**Source:** Guidance for Optical Character Recognition (OCR).
OCR engine	Advantages	Limitations
Tesseract	Open source (no fee to use) Active community of developers Supports Unicode (UTF-8) OCR results for fraktur (multiple languages) are very good	Has no graphical interface; command line only. Input is limited to image file formats like .png; does not take PDF as source No direct acquisition of scan Has some integrated image improvement features, but images may have to be optimized before OCR via other means.
ABBYY FineReader	Available at select public computers across Cornell University Library locations. (See our Disability Services page, under the "Assistive Technology at the Library" section.) Graphical interface, easier to learn and use Editor window allows for editing of output Compare documents feature allows for side by side comparison Supports Unicode (UTF-8) Accepts wide set of input file formats (including PDF) Support for ~190 different languages	Commercially available, requires purchase of a license Throttled throughput: Single license limited to 5000 pages per month Fraktur available as a purchased add-on; without it, engine gives poor results with fraktur source material