Using Acrobat Pro DC for OCR
Overview
Most of us use PDF files all the time without giving them too much thought. Everybody can open and view the content and it gives us a [mostly] consistent view regardless of device or software. That was Adobe’s mission when they created the pdf file format, according to their website:
That’s why we invented the Portable Document Format (PDF), to present and exchange documents reliably — independent of software, hardware, or operating system. . . .The goal was to enable anyone to capture documents from any application, send electronic versions of these documents anywhere, and view and print them on any machine.
Early PDFs were image-only. Think of a Polaroid snapshot of a file. It took an image of the text and displayed it. However, that means you can’t work with the text. You can only read it. It is just an image of the text. Here is an example of an image-only PDF that I used in the demonstration video below.
For years, this was the way most PDF files were generated. Many journals, databases, and other online sites used image only PDFs to digitize their content. Many PDFs are still generated as image-only, especially free PDF creation solutions. In fact, the default setting for our Xerox devices on campus is to generate image-only PDFs for scanning (it is faster).
The Problem
Usually that isn’t a problem in our classroom activities. However, there are some scenarios where image-only PDFs just do not work.
- Accessibility and Screen Readers
There is no text for them to work with, which means people who rely on screen readers can not read that pdf file. That’s a major accessibility issue - Hypothesis (and the Canvas integration)
The Hypothesis integration can not use image-only PDF documents. There is no text for the students to select and comment. - File sizes
Image-only pdfs are comparatively larger sized pdf files.
How do you know if you have an image-only PDF? Open it up in the PDF reader of your choice and try to select a single word. Can you select one word and copy/paste it into another document?
- YES! Congratulations, you have an accessible pdf. You can stop now.
- NO! Your PDF is not accessible. Follow the steps below under The Solution.
The Solution
Image-only PDFs have to go through a process called Optical Character Recognition (OCR) to transform the image into actual text. At WFU we have the Adobe Creative Cloud suite. Creative Cloud includes Adobe Acrobat Professional, which has OCR capabilities. All WFU faculty, staff, and students are able to install Adobe Acrobat Professional as part of the Adobe Creative Cloud suite from software.wfu.edu.
After installing Adobe Creative Cloud and Adobe Acrobat Pro (note: this does not work with the free Adobe Reader – it does not have OCR capabilities):
- Open the image-only PDF with Adobe Acrobat Professional
- From the menu, select
- View
- Tools
- Scan & OCR
- Open
- In the Scan & OCR interface select
- Recognize Text button
- In This File
- Accept the defaults (unless you need to specify a different language or change another setting)
- The OCR process runs
- File > Save As to save as a new, OCR’d PDF.
Depending on the number of pages and amount of text it finds, this process can take anywhere from a few seconds to many minutes. Once the process has been completed, you should Save As to save the file as a different copy. Here is the OCR version of the PDF from above, as a comparison.
Video walk through of this process is at https://youtu.be/xUL29y9dNFY
Categories: Accessibility, Acrobat, Adobe, Hypothesis, Tech Tip