Checkpoint 08: OCR Validation

When PDFs are created from scanned documents, Optical Character Recognition (OCR) converts images of text into actual text characters. This OCR-generated text must be accurate and properly tagged to be accessible.

What This Means

Many PDFs originate as scanned paper documents. A simple scan creates a PDF containing only images; the text visible in those images is not real, selectable text. OCR technology analyzes these images and attempts to recognize the letters, words, and paragraphs.

For a scanned PDF to be accessible:

OCR must be applied: Converting image-of-text into actual text
OCR must be accurate: The recognized text must correctly match the original
OCR text must be tagged: The recognized text needs proper structure tags

Without proper OCR:

Screen readers see only images, not text
Users cannot search the document
Text cannot be selected or copied
The document is completely inaccessible to many users

Even with OCR, if the recognition is inaccurate, users receive wrong information. A financial report with OCR errors could show "$1,000" as "$7,000" or transpose digits in account numbers.

Why It Matters

Scanned documents are incredibly common in business, government, and education:

Legacy documents digitized from paper archives
Signed contracts and legal documents
Historical records and archives
Forms filled out by hand
Documents received via fax or physical mail

For organizations digitizing paper archives, proper OCR is essential:

Legal compliance: Many accessibility laws require text alternatives for images of text
Information accuracy: OCR errors can change meaning and cause serious problems
Searchability: Users need to find information within large document collections
Content reuse: Text must be selectable for copying, quoting, or translation

A scanned PDF without OCR is like a photograph of a book page; you can see it, but a screen reader cannot read it. OCR transforms that photograph into readable text.

Common Violations

The Matterhorn Protocol defines two failure conditions for OCR content. Both require human testing because automated tools cannot verify text accuracy or check that all recognized text is tagged.

08-001: OCR-Generated Text Contains Significant Errors (Human Testing)

What's Wrong: The OCR process has produced text that does not accurately represent the original document. Errors may include:

Misrecognized characters (0 vs O, 1 vs l, m vs rn)
Missing words or paragraphs
Jumbled word order
Incorrect punctuation
Wrong numbers or dates
Garbled special characters

How to Identify:

Select and copy text from the PDF, paste into a text editor
Compare the extracted text against the visual document
Look for obvious errors, especially in:
- Numbers and dates
- Names and proper nouns
- Technical terms
- Tables and columns
- Text near images or in margins

Impact Examples:

"$10,000" recognized as "$70,000"
"John Smith" recognized as "Jolm Srnith"
"2024" recognized as "2O24"
Phone numbers with transposed digits
Medical terms misspelled, potentially dangerous

Quality Thresholds: While PDF/UA does not define a specific accuracy percentage, significant errors include:

Any error that changes meaning
Errors in critical data (names, numbers, dates)
Patterns of errors that reduce comprehension
Text so garbled it cannot be understood

08-002: OCR-Generated Text Is Not Tagged (Human Testing)

What's Wrong: OCR has been applied and text exists in the PDF, but this text is not included in the document's tag structure. The text may be selectable but is invisible to assistive technology navigating by tags.

How to Identify:

Open the Tags panel in Acrobat
Compare tagged content to visible content
Select text on the page and check if it appears in the tag tree
Run accessibility checker looking for untagged content
Use screen reader to verify all text is announced

Common Causes:

OCR applied after tagging, without re-tagging
OCR tool that does not create tags
Partial OCR coverage (some pages processed, others not)
Tagged structure does not encompass OCR layer

How to Fix in Adobe Acrobat

Adobe Acrobat Pro includes OCR capabilities and tools for correcting recognition errors.

Applying OCR to a Scanned Document

Open the scanned PDF in Adobe Acrobat Pro
Go to Tools > Scan & OCR
Click Recognize Text > In This File
Configure settings:
- Language: Select the document's primary language
- Output: Choose "Searchable Image" or "Searchable Image (Exact)"
- Downsample: Set based on quality needs
Click Recognize Text
Wait for processing to complete

OCR Settings for Best Results

In Scan & OCR tool, click Settings
Document Language: Match the document's language exactly
PDF Output Style:
- "Searchable Image": Places invisible text behind image (preserves appearance)
- "Searchable Image (Exact)": Higher fidelity but larger file
- "Editable Text and Images": Converts to editable content (may change appearance)
Downsample To: Higher DPI = better accuracy, larger file

Reviewing and Correcting OCR Errors

After OCR completes, go to Tools > Scan & OCR
Click Recognize Text > Correct Recognized Text
Acrobat highlights suspected errors
For each highlighted word:
- Compare the OCR text to the image
- Click on the text to edit it
- Type the correct text
- Move to next suspect
Review entire document, not just flagged items

Manual Text Correction

For documents with many errors:

Go to Tools > Edit PDF
Click on text areas to edit
Correct errors directly
Be aware this may affect the underlying image layer

Ensuring OCR Text Is Tagged

After OCR, run the accessibility wizard:
- Go to Tools > Accessibility
- Select Autotag Document
Or use the Reading Order tool:
- Go to Tools > Accessibility > Reading Order
- Draw regions around content
- Assign appropriate tags (text, heading, table, etc.)
Verify tags include OCR text:
- Open Tags panel
- Expand tags and verify content matches OCR text

Re-tagging After OCR

If OCR was applied to an already-tagged document:

Remove existing tags:
- In Tags panel, select root tag
- Right-click > Delete Tag (keep content)
Re-run autotagging
Or manually rebuild tag structure with Reading Order tool

How to Improve OCR Quality

Source Document Quality

Better scans produce better OCR:

Resolution: Scan at 300 DPI minimum; 600 DPI for fine print
Contrast: Ensure good contrast between text and background
Alignment: Keep pages straight; skewed text reduces accuracy
Cleanliness: Remove dust, marks, and stains before scanning
Lighting: Even lighting without shadows or hot spots

Pre-Processing Techniques

Before running OCR:

Deskew: Straighten rotated pages
- In Acrobat: Tools > Scan & OCR > Enhance > Deskew
Despeckle: Remove noise and artifacts
- In Acrobat: Tools > Scan & OCR > Enhance > Despeckle
Sharpen: Improve text edge definition
Adjust contrast: Make text darker, background lighter

Language and Font Considerations

Correct language: Always select the right OCR language
Multiple languages: Process multilingual documents carefully
Unusual fonts: Decorative or handwritten fonts reduce accuracy
Small text: Very small text may not recognize well
Damaged text: Faded or damaged text needs manual review

Testing Your Fix

Manual Text Verification

Sample testing: For long documents, thoroughly check representative sections
Critical content: Always verify:
- Document title and headings
- Names of people and organizations
- Numbers: dates, amounts, phone numbers, IDs
- Tables and data
- Legal or technical terms
Copy-paste test:
- Select all text (Ctrl/Cmd + A)
- Copy and paste into a text editor
- Review for obvious errors and garbled text

Comparison Workflow

Print or display the original image
Read the OCR text aloud or compare side-by-side
Mark any discrepancies
Correct errors in the PDF
Re-verify corrected sections

Tag Verification

Open View > Show/Hide > Navigation Panes > Tags
Expand the tag tree
Click on tags to highlight corresponding content
Verify all visible text is tagged
Check that tagged text matches OCR output

Screen Reader Testing

Open the PDF with NVDA, JAWS, or VoiceOver
Read through the document
Listen for:
- Garbled or nonsensical words (OCR errors)
- Sections of silence where text is visible (untagged content)
- Correct pronunciation of names and terms
Navigate by headings and paragraphs to ensure structure is preserved

Automated Checking

Adobe Acrobat:

Go to Tools > Accessibility > Accessibility Check
Run full check
Review "Alternate Text and Headings" and "Text" categories
Note: Cannot detect OCR accuracy, only structural issues

PAC (PDF Accessibility Checker):

Run PDF/UA validation
Check for untagged content warnings
Review structure for complete coverage

Validation Checklist

OCR has been applied to all scanned pages
Text is selectable throughout the document
OCR text matches the original with acceptable accuracy
Critical data (names, numbers, dates) is verified correct
All OCR text is included in tag structure
Screen reader can access all text content
Document structure (headings, lists, tables) is preserved
No significant passages are missing or garbled

Working with Different Document Types

Simple Text Documents

Standard letters, memos, reports
Usually high OCR accuracy
Focus on numbers and proper nouns

Complex Layouts

Multi-column documents, newsletters, brochures
May require manual region definition
Check column reading order carefully

Tables and Forms

Tables often have OCR and tagging challenges
Verify cell contents and structure
Check that form labels match fields
Numbers in tables need careful verification

Historical Documents

Older typefaces may not recognize well
Faded or damaged text needs manual review
Consider professional transcription services
May need custom OCR training for unusual fonts

Handwritten Content

Standard OCR cannot read handwriting reliably
Consider Intelligent Character Recognition (ICR) tools
Often requires manual transcription
Provide text alternative for handwritten annotations

Additional Resources

Official Standards and Guidelines

OCR Technology

Adobe Acrobat: Scan and OCR
ABBYY FineReader - Professional OCR software
Tesseract OCR - Open-source OCR engine
Google Cloud Vision OCR - Cloud-based OCR

Best Practices

Tools

PAC (PDF Accessibility Checker) - Free PDF/UA validation
NVDA Screen Reader - Free screen reader for testing
veraPDF - Open-source PDF validator

This documentation is based on the Matterhorn Protocol 1.02, the definitive reference for PDF/UA validation. OCR validation requires human testing because automated tools cannot verify text accuracy or determine if OCR coverage is complete. For the most current information, consult the PDF Association and W3C WCAG guidelines.