Skip to main content
Checkpoint 08Medium Priority2 failure conditions

Checkpoint 08: OCR Validation

OCR-generated text must be accurate and properly tagged to ensure scanned documents are accessible.

Related WCAG:1.1.11.3.1

Checkpoint 08: OCR Validation

When PDFs are created from scanned documents, Optical Character Recognition (OCR) converts images of text into actual text characters. This OCR-generated text must be accurate and properly tagged to be accessible.

What This Means

Many PDFs originate as scanned paper documents. A simple scan creates a PDF containing only images; the text visible in those images is not real, selectable text. OCR technology analyzes these images and attempts to recognize the letters, words, and paragraphs.

For a scanned PDF to be accessible:

  1. OCR must be applied: Converting image-of-text into actual text
  2. OCR must be accurate: The recognized text must correctly match the original
  3. OCR text must be tagged: The recognized text needs proper structure tags

Without proper OCR:

  • Screen readers see only images, not text
  • Users cannot search the document
  • Text cannot be selected or copied
  • The document is completely inaccessible to many users

Even with OCR, if the recognition is inaccurate, users receive wrong information. A financial report with OCR errors could show "$1,000" as "$7,000" or transpose digits in account numbers.

Why It Matters

Scanned documents are incredibly common in business, government, and education:

  • Legacy documents digitized from paper archives
  • Signed contracts and legal documents
  • Historical records and archives
  • Forms filled out by hand
  • Documents received via fax or physical mail

For organizations digitizing paper archives, proper OCR is essential:

  • Legal compliance: Many accessibility laws require text alternatives for images of text
  • Information accuracy: OCR errors can change meaning and cause serious problems
  • Searchability: Users need to find information within large document collections
  • Content reuse: Text must be selectable for copying, quoting, or translation

A scanned PDF without OCR is like a photograph of a book page; you can see it, but a screen reader cannot read it. OCR transforms that photograph into readable text.

Common Violations

The Matterhorn Protocol defines two failure conditions for OCR content. Both require human testing because automated tools cannot verify text accuracy or check that all recognized text is tagged.

08-001: OCR-Generated Text Contains Significant Errors (Human Testing)

What's Wrong: The OCR process has produced text that does not accurately represent the original document. Errors may include:

  • Misrecognized characters (0 vs O, 1 vs l, m vs rn)
  • Missing words or paragraphs
  • Jumbled word order
  • Incorrect punctuation
  • Wrong numbers or dates
  • Garbled special characters

How to Identify:

  • Select and copy text from the PDF, paste into a text editor
  • Compare the extracted text against the visual document
  • Look for obvious errors, especially in:
    • Numbers and dates
    • Names and proper nouns
    • Technical terms
    • Tables and columns
    • Text near images or in margins

Impact Examples:

  • "$10,000" recognized as "$70,000"
  • "John Smith" recognized as "Jolm Srnith"
  • "2024" recognized as "2O24"
  • Phone numbers with transposed digits
  • Medical terms misspelled, potentially dangerous

Quality Thresholds: While PDF/UA does not define a specific accuracy percentage, significant errors include:

  • Any error that changes meaning
  • Errors in critical data (names, numbers, dates)
  • Patterns of errors that reduce comprehension
  • Text so garbled it cannot be understood

08-002: OCR-Generated Text Is Not Tagged (Human Testing)

What's Wrong: OCR has been applied and text exists in the PDF, but this text is not included in the document's tag structure. The text may be selectable but is invisible to assistive technology navigating by tags.

How to Identify:

  • Open the Tags panel in Acrobat
  • Compare tagged content to visible content
  • Select text on the page and check if it appears in the tag tree
  • Run accessibility checker looking for untagged content
  • Use screen reader to verify all text is announced

Common Causes:

  • OCR applied after tagging, without re-tagging
  • OCR tool that does not create tags
  • Partial OCR coverage (some pages processed, others not)
  • Tagged structure does not encompass OCR layer

How to Fix in Adobe Acrobat

Adobe Acrobat Pro includes OCR capabilities and tools for correcting recognition errors.

Applying OCR to a Scanned Document

  1. Open the scanned PDF in Adobe Acrobat Pro
  2. Go to Tools > Scan & OCR
  3. Click Recognize Text > In This File
  4. Configure settings:
    • Language: Select the document's primary language
    • Output: Choose "Searchable Image" or "Searchable Image (Exact)"
    • Downsample: Set based on quality needs
  5. Click Recognize Text
  6. Wait for processing to complete

OCR Settings for Best Results

  1. In Scan & OCR tool, click Settings
  2. Document Language: Match the document's language exactly
  3. PDF Output Style:
    • "Searchable Image": Places invisible text behind image (preserves appearance)
    • "Searchable Image (Exact)": Higher fidelity but larger file
    • "Editable Text and Images": Converts to editable content (may change appearance)
  4. Downsample To: Higher DPI = better accuracy, larger file

Reviewing and Correcting OCR Errors

  1. After OCR completes, go to Tools > Scan & OCR
  2. Click Recognize Text > Correct Recognized Text
  3. Acrobat highlights suspected errors
  4. For each highlighted word:
    • Compare the OCR text to the image
    • Click on the text to edit it
    • Type the correct text
    • Move to next suspect
  5. Review entire document, not just flagged items

Manual Text Correction

For documents with many errors:

  1. Go to Tools > Edit PDF
  2. Click on text areas to edit
  3. Correct errors directly
  4. Be aware this may affect the underlying image layer

Ensuring OCR Text Is Tagged

  1. After OCR, run the accessibility wizard:
    • Go to Tools > Accessibility
    • Select Autotag Document
  2. Or use the Reading Order tool:
    • Go to Tools > Accessibility > Reading Order
    • Draw regions around content
    • Assign appropriate tags (text, heading, table, etc.)
  3. Verify tags include OCR text:
    • Open Tags panel
    • Expand tags and verify content matches OCR text

Re-tagging After OCR

If OCR was applied to an already-tagged document:

  1. Remove existing tags:
    • In Tags panel, select root tag
    • Right-click > Delete Tag (keep content)
  2. Re-run autotagging
  3. Or manually rebuild tag structure with Reading Order tool

How to Improve OCR Quality

Source Document Quality

Better scans produce better OCR:

  1. Resolution: Scan at 300 DPI minimum; 600 DPI for fine print
  2. Contrast: Ensure good contrast between text and background
  3. Alignment: Keep pages straight; skewed text reduces accuracy
  4. Cleanliness: Remove dust, marks, and stains before scanning
  5. Lighting: Even lighting without shadows or hot spots

Pre-Processing Techniques

Before running OCR:

  1. Deskew: Straighten rotated pages
    • In Acrobat: Tools > Scan & OCR > Enhance > Deskew
  2. Despeckle: Remove noise and artifacts
    • In Acrobat: Tools > Scan & OCR > Enhance > Despeckle
  3. Sharpen: Improve text edge definition
  4. Adjust contrast: Make text darker, background lighter

Language and Font Considerations

  1. Correct language: Always select the right OCR language
  2. Multiple languages: Process multilingual documents carefully
  3. Unusual fonts: Decorative or handwritten fonts reduce accuracy
  4. Small text: Very small text may not recognize well
  5. Damaged text: Faded or damaged text needs manual review

Testing Your Fix

Manual Text Verification

  1. Sample testing: For long documents, thoroughly check representative sections

  2. Critical content: Always verify:

    • Document title and headings
    • Names of people and organizations
    • Numbers: dates, amounts, phone numbers, IDs
    • Tables and data
    • Legal or technical terms
  3. Copy-paste test:

    • Select all text (Ctrl/Cmd + A)
    • Copy and paste into a text editor
    • Review for obvious errors and garbled text

Comparison Workflow

  1. Print or display the original image
  2. Read the OCR text aloud or compare side-by-side
  3. Mark any discrepancies
  4. Correct errors in the PDF
  5. Re-verify corrected sections

Tag Verification

  1. Open View > Show/Hide > Navigation Panes > Tags
  2. Expand the tag tree
  3. Click on tags to highlight corresponding content
  4. Verify all visible text is tagged
  5. Check that tagged text matches OCR output

Screen Reader Testing

  1. Open the PDF with NVDA, JAWS, or VoiceOver
  2. Read through the document
  3. Listen for:
    • Garbled or nonsensical words (OCR errors)
    • Sections of silence where text is visible (untagged content)
    • Correct pronunciation of names and terms
  4. Navigate by headings and paragraphs to ensure structure is preserved

Automated Checking

Adobe Acrobat:

  1. Go to Tools > Accessibility > Accessibility Check
  2. Run full check
  3. Review "Alternate Text and Headings" and "Text" categories
  4. Note: Cannot detect OCR accuracy, only structural issues

PAC (PDF Accessibility Checker):

  1. Run PDF/UA validation
  2. Check for untagged content warnings
  3. Review structure for complete coverage

Validation Checklist

  • OCR has been applied to all scanned pages
  • Text is selectable throughout the document
  • OCR text matches the original with acceptable accuracy
  • Critical data (names, numbers, dates) is verified correct
  • All OCR text is included in tag structure
  • Screen reader can access all text content
  • Document structure (headings, lists, tables) is preserved
  • No significant passages are missing or garbled

Working with Different Document Types

Simple Text Documents

  • Standard letters, memos, reports
  • Usually high OCR accuracy
  • Focus on numbers and proper nouns

Complex Layouts

  • Multi-column documents, newsletters, brochures
  • May require manual region definition
  • Check column reading order carefully

Tables and Forms

  • Tables often have OCR and tagging challenges
  • Verify cell contents and structure
  • Check that form labels match fields
  • Numbers in tables need careful verification

Historical Documents

  • Older typefaces may not recognize well
  • Faded or damaged text needs manual review
  • Consider professional transcription services
  • May need custom OCR training for unusual fonts

Handwritten Content

  • Standard OCR cannot read handwriting reliably
  • Consider Intelligent Character Recognition (ICR) tools
  • Often requires manual transcription
  • Provide text alternative for handwritten annotations

Additional Resources

Official Standards and Guidelines

OCR Technology

Best Practices

Tools


This documentation is based on the Matterhorn Protocol 1.02, the definitive reference for PDF/UA validation. OCR validation requires human testing because automated tools cannot verify text accuracy or determine if OCR coverage is complete. For the most current information, consult the PDF Association and W3C WCAG guidelines.

Scan Your PDFs for Accessibility Issues

Beacon automatically detects PDF accessibility violations and shows you exactly how to fix them.

Start Free Scan