How to Detect Whether a PDF has been OCR'd

ccstone · December 6, 2018, 8:35pm

Hey Folks,

Is there a quick (speedy) and easy way to detect whether or not a PDF has been OCR’d?

The only way I know of presently is to try to extract the text from it, and this can be a trifle slow with large files.

The only way I now of to mitigate that is to extract a page from the PDF and then extract the text from it.

I have some command line PDF tools that can test for things like the fonts in a PDF, so I’m wondering if AppleScriptObjC has access to this sort of thing.

TIA.

-Chris

ccstone · December 6, 2018, 9:05pm

Here are links to the executables and a small example.

http://www.xpdfreader.com/download.html

I’ve used Xpdf tools since about 2012 – mostly pdftotext which has a -layout switch that attempts to preserve the document layout unlike other text extraction methods.

I really like this critter, but it can have problems with accented characters – and I’d like to find a solution that didn’t have that problem.

I realize there’s a means to extract text using AppleScriptObjC, but it does nothing to preserve the layout.

Sometimes that’s not a problem, but at other times the layout is key to being able to successfully parse the text.

-Chris

suzume · December 6, 2018, 11:30pm

Not really helping, but OCRd PDFs are not the only PDFs that don’t let you extract text. As Illustrator specialists will tell you here, as soon as you create the outline of a document, the resulting PDF won’t let you extract text because text has just become an image.

JMichaelTX · December 7, 2018, 12:36am

So, Chris provided a link to the CLI tool that will do the job.

But I’m still interested in know if ASObjC can detect whether a PDF has been OCR’d quickly and efficiently, other than getting the text of the PDF.

So can ASObjC get the font list, or other property to do this job?

Why? I’d rather not have to use a 3rd party tool that end-users will have to download and install properly.

ccstone · December 7, 2018, 12:44am

Hey JM,

Here’s the clunky method. It extracts page one of the PDF to a temp file and then attempts to extract text from it.

On my system it takes 1 minute 15 seconds to scan 502 PDF files with an aggregate file size of 962.3 MB.

That’s not super fast, but it’s not horribly slow either.

I haven’t compared this to the command line utilities yet, but I probably will.

-Chris

----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2018/12/06 17:16
# dMod: 2018/12/06 18:39
# Appl: AppleScriptObjC, Finder
# Task: Determine if Selected PDF Files have been OCR'd.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Finder, @Determine, @Selected, @PDF, @Files, @OCR'd
----------------------------------------------------------------
use AppleScript version "2.3.1"
use scripting additions
use framework "Foundation"
use framework "Quartz" -- for PDF stuff
----------------------------------------------------------------

set tempFilePath to (POSIX path of (path to temporary items folder from user domain)) & "temp.pdf"

# Selected files in the Finder are the target.
tell application "Finder"
   set finderSelectionList to selection as alias list
   if length of finderSelectionList = 0 then error "No files were selected in the Finder!"
end tell

# Tranform alias list to POSIX Paths.
repeat with i in finderSelectionList
   set contents of i to POSIX path of (contents of i)
end repeat

set nonOcrList to {}

repeat with pdfFilePath in finderSelectionList
   # Extract page 1 of the selected PDF file to a temp file.
   set pdfTempFilePath to (my extractPages:1 thruTo:1 ofPDFDocAt:pdfFilePath usingTempFile:tempFilePath)
   set pdfText to (its pdf2Text:pdfTempFilePath)
   if pdfText = "" then set end of nonOcrList to contents of pdfFilePath
end repeat

nonOcrList

----------------------------------------------------------------
--» HANDLERS
----------------------------------------------------------------
on extractPages:firstPage thruTo:lastPage ofPDFDocAt:posixPath usingTempFile:tempFilePath
   --  make URL of the first PDF
   set inNSURL to current application's class "NSURL"'s fileURLWithPath:posixPath
   -- make PDF document from the URL
   set theDoc to current application's PDFDocument's alloc()'s initWithURL:inNSURL
   -- count the pages
   set pageCount to theDoc's pageCount()
   -- delete pages at end
   if lastPage < pageCount then
      repeat with i from pageCount to (lastPage + 1) by -1
         (theDoc's removePageAtIndex:(i - 1)) -- zero-based indexes
      end repeat
   end if
   -- delete pages at start
   if firstPage > 1 then
      repeat with i from (firstPage - 1) to 1 by -1
         (theDoc's removePageAtIndex:(i - 1)) -- zero-based indexes
      end repeat
   end if
   
   # Write to temporary file.
   theDoc's writeToFile:tempFilePath
   
   return tempFilePath
   
end extractPages:thruTo:ofPDFDocAt:usingTempFile:
----------------------------------------------------------------
on pdf2Text:thePath
   set theText to current application's NSMutableString's |string|()
   set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
   set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
   set theCount to theDoc's pageCount() as integer
   
   repeat with i from 1 to theCount
      set thePage to (theDoc's pageAtIndex:(i - 1))
      (theText's appendString:(thePage's |string|()))
   end repeat
   
   return theText as text
   
end pdf2Text:
----------------------------------------------------------------

ShaneStanley · December 7, 2018, 1:09am

Why not just check the first page in situ? Something like:

if (theDoc's pageAtIndex:0)'s |string|()'s |length|() < 1 then
...

ShaneStanley · December 7, 2018, 3:59am

Not directly. You could do something like get the contents of pages as NSAttributedStrings using attributedString rather than string, and then check the fonts used.

ccstone · December 7, 2018, 7:06am

Hey Shane,

Because I didn’t know how.

But that’s certainly more efficient than what I was doing.

What if the given page has no text? You can still get at least one character (\uFFFC), so length of string can be problematic.

Probably if the string is longer than 3-4 characters it’s safe, but I haven’t tested nearly enough to know for certain.

I suppose the only bombproof way would be to page through the document testing for a string of word characters using regex and exit if you find same.

Is there a better test available than regex? One that can test for a string in any language?

-Chris

ccstone · December 7, 2018, 8:24am

Hey Folks,

I’m not entirely satisfied with the text-length test in the appended script, but when run it finds all the non-ocr’d pdfs in my 500 file test set in only 25 seconds (less than half the time of first script).

My first script had a number of false negatives due to \uFFFC characters showing up in NON ocr’d files.

I’d still like a better test than text-length, but at least this script is much less clumsy than the first one.

Thanks Shane!

-Chris

----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2018/12/06 17:16
# dMod: 2018/12/07 02:00
# Appl: AppleScriptObjC, Finder
# Task: Determine if Selected PDF Files have been OCR'd.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Finder, @Determine, @Selected, @PDF, @Files, @OCR'd
# Vers: 1.01
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------

tell application "Finder"
    set finderSelectionList to selection as alias list
    if length of finderSelectionList = 0 then error "No files were selected in the Finder!"
end tell
repeat with theFile in finderSelectionList
    set (contents of theFile) to POSIX path of (contents of theFile)
end repeat

set nonOcrList to {}

repeat with thePath in finderSelectionList
    if (its pdfHasBeenOcrd:thePath) = false then
        set end of nonOcrList to contents of thePath
    end if
end repeat

return nonOcrList

----------------------------------------------------------------
--» HANDLERS
----------------------------------------------------------------
on pdfHasBeenOcrd:thePath
    set theText to current application's NSMutableString's |string|()
    set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
    set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
    set theCount to theDoc's pageCount() as integer
    
    set OcrFlag to false
    
    repeat with i from 1 to theCount
        set thePage to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
        
        # Test for Text Content (I'm not satisfied with this yet). •••••
        if (length of thePage) > 20 then
            set OcrFlag to true
            exit repeat
        end if
        
    end repeat
    
    return OcrFlag
    
end pdfHasBeenOcrd:
----------------------------------------------------------------

ShaneStanley · December 7, 2018, 10:41am

Shouldn’t that be < rather than >?

ccstone · December 7, 2018, 10:45am

Nyet.

I’m setting the threshold to greater than 20 characters.

For now.

I expect I can set it to less, but this is working so far.

-Chris

ShaneStanley · December 7, 2018, 11:06am

But you’re setting OcrFlag to true in that case — shouldn’t it be the other way around?

ccstone · December 7, 2018, 11:13am

No.

The PDF is OCR’d if there are more than 20 characters.

-Chris

ionah · December 7, 2018, 1:27pm

In my case, I had to determine if some scanned documents had been “ocr’d”.
I found that every page that only contains the scanned image has a length of 2.
So the length should be the double of the page count (-1 because of zero indexation in objC).

Here is my script:

use AppleScript version "2.4"
use framework "Foundation"
use framework "AppKit"
use framework "Quartz"
use scripting additions

tell application "Finder" to set finderSel to selection as alias list
if finderSel = {} then return beep

set withoutText to {}
repeat with theFile in finderSel
	set theURL to (current application's NSURL's fileURLWithPath:(POSIX path of theFile))
	if (theURL's pathExtension()'s isEqualToString:"pdf") then
		set theDoc to (current application's PDFDocument's alloc()'s initWithURL:theURL)
		set textLength to theDoc's |string|()'s |length|()
		set pageCount to theDoc's pageCount()
		if (pageCount * 2) - 1 = textLength then set end of withoutText to contents of theURL as text
	end if
end repeat

return withoutText

JMichaelTX · December 7, 2018, 11:13pm

@ionah, thanks for sharing.

Unfortunately, your script takes 10X as long a Chris’ (@ccstone) last script.
Script Geek Results: 2.374 vs 0.239 min (avg of 5 runs) for 15 files

Well, that’s too bad. I’m thinking that would take longer than Chris’ method – would you agree?

ShaneStanley · December 7, 2018, 11:30pm

Considerably. It’s a trade-off of (potential) accuracy vs speed.

JMichaelTX · December 7, 2018, 11:48pm

Chris, thanks for a great script, but you’ve got a slow Mac.
That’s 0.05 sec/PDF

When I run your script, but also using Shane’s MetaLib lib to search a folder recursively that contains PDFs and other stuff, it took only 0.012 sec/PDF:

Total Time: 2.28 sec
Total PDFs: 184
= 0.012 sec/PDF

PDFs that Need OCR: 23

Here’s the script I used:

property ptyScriptName : "Detect Whether a PDF has been OCRd -- Chris"
property ptyScriptVer : "2.0"
property ptyScriptDate : "2018-12-07"
property ptyScriptAuthor : "Christopher Stone" -- mod by JMichaelTX

(*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PURPOSE:
  • Detect Whether a PDF has been OCRd -- Chris
  
RETURNS:  List of PDFs (Posix Path) that Need to be OCR'd

REQUIRED:
  1.  macOS 10.11.6+

TAGS:  @Lang.AS @Lang.ASObjC @CAT.PDF @type.Example @Auth.Chis

REF:  The following were used in some way in the writing of this script.

  1.  2018-12-07, Christopher Stone, Microsoft Outlook
      Re: How to Detect Whether a PDF has been OCR'd - AppleScript - Late Night Software Ltd.
      geode@thestoneforge.com

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*)
----------------------------------------------------------------
# Auth: Christopher Stone with Mod by JMichaelTX
# dCre: 2018/12/06 17:16
# dMod: 2018/12/07 02:00
# Appl: AppleScriptObjC, Finder
# Task: Determine if Selected PDF Files have been OCR'd.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Finder, @Determine, @Selected, @PDF, @Files, @OCR'd
# Vers: 1.01
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions

use script "Metadata Lib"
----------------------------------------------------------------

###set theFolder to choose folder with prompt "Choose Folder for Spotlight Search" ###  path to desktop

set theFolder to "/Users/Shared/Dropbox/Mac Only/Scan Inbox"

--- Search Folder & Sub-Folders to Find All PDF Files ---

set pdfFileList to perform search in folders {theFolder} predicate string "kMDItemContentType == %@" search arguments {"com.adobe.pdf"}

set numPDFs to count of pdfFileList

set nonOcrList to {}

repeat with thePath in pdfFileList
  set pdfOCRdBool to (its pdfHasBeenOcrd:thePath) ### Use Chris' handler
  ##  set pdfOCRdBool to my hasPdfBeenOCRd2(thePath) ### Use MY Shell Script handler
  
  if (pdfOCRdBool) = false then
    set end of nonOcrList to contents of thePath
  end if
end repeat

set numPDFsToOCR to count of nonOcrList

return nonOcrList


----------------------------------------------------------------
--» HANDLERS
----------------------------------------------------------------

on hasPdfBeenOCRd2(pPosixPath)
  
  set PDFPath to pPosixPath
  set cmdStr to "/usr/local/bin/pdffonts " & quoted form of PDFPath
  set fontListStr to do shell script cmdStr
  set numLines to count (paragraphs of fontListStr)
  
  if (numLines > 2) then
    set OCRdBool to true
  else
    set OCRdBool to false
  end if
  
  return OCRdBool
  
end hasPdfBeenOCRd2

on pdfHasBeenOcrd:thePath
  set theText to current application's NSMutableString's |string|()
  set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
  set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
  set theCount to theDoc's pageCount() as integer
  
  set OcrFlag to false
  
  repeat with i from 1 to theCount
    set thePage to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
    
    # Test for Text Content (I'm not satisfied with this yet). •••••
    if (length of thePage) > 20 then
      set OcrFlag to true
      exit repeat
    end if
    
  end repeat
  
  return OcrFlag
  
end pdfHasBeenOcrd:
----------------------------------------------------------------

ionah · December 8, 2018, 11:54am

@JMichaelTX, I think you’re misunderstanding my point.

I tried to bring solution to @ccstone’s concern:

With my script, if your PDFs are scanned docs with a unique area per page, you are sure about the result.
With this advantage, speed is not relevant…

ComplexPoint · December 8, 2018, 12:38pm

I find DEVONthink a helpful context for this kind of thing:

tell application "DEVONthink Pro"
    kind of item 1 of (selection as list)
    --> 'PDF+Text'  or just 'PDF'
end tell

estockly · December 9, 2018, 5:39pm

I once had to do something very similar. The source of the text in the PDF wasn’t OCR, but we had to look at PDFs and tell if it had been created from an empty template or if it had been created from a partially completed or fully completed document, the difference being the number and lengths of text blocks.

So this is what the pure appleScript quick and dirty temporary solution that we started using on Mac OS 8, and kept using until a few years ago that never once gave us a bad result.

I actually didn’t have a copy of the script here, but it was pretty trivial to rewrite it and test it using SD.
You can drag and drop pdf files into an SD window and execute the open handler, or open a PDF file as text in TextWrangle and execute the run handler.

The first part of the result is the length of text runs. For our purposes if there were less than 20 text blocks it was generated from a blank template; if there were hundreds of text blocks it was generated from a document in progress; if any of the text block lengths were over 200 it was an unfinished document.

--get text from TextWrangler (or replace with BBEdit or any text editor that can read a PDF file

tell application "TextWrangler" to set myText to text of window 1
set PDFTextInfo to my GetPDFTexts(myText)
display dialog PDFTextInfo as text
--Or Drag&Drop pdf files
on open pdfFiles
   set allPDFInfo to {}
   repeat with thisFile in pdfFiles
      set pdfText to read thisFile
      
      set AppleScript's text item delimiters to {":"}
      set fileName to the last text item of (thisFile as text)
      set PDFTextInfo to my GetPDFTexts(pdfText)
      
      set PDFTextInfo to {fileName} & PDFTextInfo
      set the end of allPDFInfo to PDFTextInfo as text
   end repeat
   display dialog allPDFInfo as text
end open

on GetPDFTexts(myText)
   local allTextLengths, foundTypeBlocks, errorsFound, myText
   set myText to paragraphs of myText
   
   set AppleScript's text item delimiters to {return}
   
   set myText to myText as text
   set AppleScript's text item delimiters to {return & "<< /Type /Font /Subtype /"}
      set typeObjects to the the rest of text items of myText
   set allTextLengths to {"Text Lengths: "}
   set foundTypeBlocks to {"", "Type Blocks:"}
   set errorsFound to {"", "Errors:"}
   repeat with thisTypeBlock in typeObjects
      set AppleScript's text item delimiters to {" >>" & return}
      set thisTypeBlock to text item 1 of thisTypeBlock
            try

      set AppleScript's text item delimiters to {"/FirstChar ", "/LastChar"}
      set charRangeItems to text items of thisTypeBlock
      set startChar to item 2 of charRangeItems
      set endChar to word 1 of item 3 of charRangeItems
         set textLength to (endChar as number) - (startChar as number)
         set the end of allTextLengths to textLength
         
      on error errMsg number errNum
         set errorString to ("Error Number: " & errNum as text) & return & return & "Error Message: " & return & return & "\"" & errMsg & "\""
         
         set the end of errorsFound to errorString & return & tab & thisTypeBlock
      end try
      
      set the end of foundTypeBlocks to thisTypeBlock
   end repeat
   set AppleScript's text item delimiters to {return}

   return {allTextLengths, foundTypeBlocks, errorsFound} as text
end GetPDFTexts
--made a small edit, moving the "try" up a few lines