LibGuides: Text as Data: Finding and Mining: Optical Character Recognition (OCR)

Optical Character Recognition (OCR)

Cornell University-affiliated researchers can scan text from physical collections at Cornell University Library using Optical Character Recognition (OCR) software. OCR software examines a scanned image or document for text and creates a text-readable digital copy, often as a PDF, Microsoft Word document or Microsoft Excel sheet.

OCR software can be a time-saving alternative to manually transcribing typed text into a word processor. But OCR software is not perfect. Depending on the image quality of the original document and the corresponding scans you make, OCR software may not produce completely accurate renditions of the original document. It also does not work well for handwritten documents.

Learn about the OCR software available at Cornell University Library and how to produce high-quality scans for best OCR results.

Scraping text from physical documents using OCR

Create a digital image of the document. Depending on the size of a document, that may require scanning or taking a high-quality photo of the document. Keep in mind:

The clearer your image is, the better the text output will be
Resolution
Ensure that there is no blur and that your text is in focus on the image

Process the scanned images and perform OCR. Software available at Cornell University Library like ABBYY Fine Reader can help you edit the page by:

Straightening the text (deskewing)
Removing curves shapes of book pages (de-warping)
Removing speckles (de-speckling)
Cropping and rotating
Adjusting text layout

Evaluate: How accurately did the OCR perform?

If there was a significant amount of error, try rescanning the document. Consider the best practices for OCR and note if there are any limitations with the original document you are working with.
If problems persist, contact the Digital CoLab for support.

Best practices for OCR

Keep in mind the following tips on successful OCR scanning:

Scan the document at a high resolution (600 dpi is recommended, but 300 dpi works, too)
Scan documents with 10-point text size or larger
Use grayscale instead of black and white
Ensure the lines of text are parallel with the bottom of the page
Screen captures (images created using the “print screen” function) do not perform well with OCR

OCR software may have trouble reading the document if:

The document was scanned at a low resolution, upside down or sideways
The document has lines, stains or other marks that obscure words
The document’s font spacing is tight and the letters run close together

OCR software at Olin Library

You can use Adobe Acrobat Pro on any Cornell University Library computers to make image files into PDFs and perform OCR on the PDF versions of your documents.

There is one ABBYY Fine Reader-equipped scanning station on Olin Library’s main floor along the east wall near the Olin 107 Reading Room. The software is installed on the computer station with a ScanSnap scanner. The station operates on a first-come, first-served basis. We recommend coming to Olin Library in the morning to use this scanning station. Learn more about the scanners and photocopiers in Olin & Uris Libraries.

More OCR Software options

Two additional high-capacity OCR software to know are Tesseract and ABBYY FineReader. Learn more about their advantages and limitations below:

**Quick reference chart of strengths and limitations of tools**, created by Cornell University Library staff member Michelle Paolillo.

**Source:** Guidance for Optical Character Recognition (OCR).
OCR engine	Advantages	Limitations
Tesseract	Open source (no fee to use) Active community of developers Supports Unicode (UTF-8) OCR results for fraktur (multiple languages) are very good	Has no graphical interface; command line only. Input is limited to image formats; does not take PDF as source No direct acquisition of scan Has some integrated image improvement features, but images may have to be optimized before OCR via other means.
ABBYY FineReader	Graphical interface, easier to learn and use Editor window allows for editing of output Compare documents feature allows for side by side comparison Supports Unicode (UTF-8) Accepts wide set of input file formats (including PDF) Support for ~190 different languages	Commercially available, requires purchase of a license Throttled throughput: Single license limited to 5000 pages per month Fraktur available as a purchased add-on; without it, engine gives poor results with fraktur source material