Optical Character Recognition (OCR)
Cornell University-affiliated researchers can scan text from physical collections at Cornell University Library using Optical Character Recognition (OCR) software. OCR software examines a scanned image or document for text and creates a text-readable digital copy, often as a PDF, Microsoft Word document or Microsoft Excel sheet.
OCR software can be a time-saving alternative to manually transcribing typed text into a word processor. But OCR software is not perfect. Depending on the image quality of the original document and the corresponding scans you make, OCR software may not produce completely accurate renditions of the original document. It also does not work well for handwritten documents.
Learn about the OCR software available at Cornell University Library and how to produce high-quality scans for best OCR results.
Scraping text from physical documents using OCR
- Create a digital image of the document. Depending on the size of a document, that may require scanning or taking a high-quality photo of the document. Keep in mind:
- The clearer your image is, the better the text output will be
- Resolution
- Ensure that there is no blur and that your text is in focus on the image
- Process the scanned images and perform OCR. Software available at Cornell University Library like ABBYY Fine Reader can help you edit the page by:
- Straightening the text (deskewing)
- Removing curves shapes of book pages (de-warping)
- Removing speckles (de-speckling)
- Cropping and rotating
- Adjusting text layout
- Evaluate: How accurately did the OCR perform?
- If there was a significant amount of error, try rescanning the document. Consider the best practices for OCR and note if there are any limitations with the original document you are working with.
- If problems persist, contact the Digital CoLab for support.
Best practices for OCR
Keep in mind the following tips on successful OCR scanning:
- Scan the document at a high resolution (600 dpi is recommended, but 300 dpi works, too)
- Scan documents with 10-point text size or larger
- Use grayscale instead of black and white
- Ensure the lines of text are parallel with the bottom of the page
- Screen captures (images created using the “print screen” function) do not perform well with OCR
OCR software may have trouble reading the document if:
- The document was scanned at a low resolution, upside down or sideways
- The document has lines, stains or other marks that obscure words
- The document’s font spacing is tight and the letters run close together
OCR software at Olin Library
You can use Adobe Acrobat Pro on any Cornell University Library computers to make image files into PDFs and perform OCR on the PDF versions of your documents.
There is one ABBYY Fine Reader-equipped scanning station on Olin Library’s main floor along the east wall near the Olin 107 Reading Room. The software is installed on the computer station with a ScanSnap scanner. The station operates on a first-come, first-served basis. We recommend coming to Olin Library in the morning to use this scanning station. Learn more about the scanners and photocopiers in Olin & Uris Libraries.
More OCR Software options
Two additional high-capacity OCR software to know are Tesseract and ABBYY FineReader. Learn more about their advantages and limitations below:
OCR engine | Advantages | Limitations |
---|---|---|
Tesseract |
|
|
ABBYY FineReader |
|
|