How to Detect Whether a PDF has been OCR'd

Thanks Ed,

I’ll have to play with this.

I started out with a similar idea in mind with my first attempt, but I found some PDFs on my system that didn’t have any font info even though they contained text.

I’ll have to see if I can find them and run your script against them.

In the meantime I’ve altered my script to use a regex test instead of a character count. It’s only a fraction slower and gives me a lot more control of what’s tested for.

-Chris

AppleScript Code
----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2018/12/06 17:16
# dMod: 2018/12/15 04:24
# Appl: AppleScriptObjC, Finder
# Task: Determine if Selected PDF Files have been OCR'd.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Finder, @Determine, @Selected, @PDF, @Files, @OCR'd, @RegEx
# Vers: 1.03
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------

tell application "Finder"
   set finderSelectionList to selection as alias list
   if length of finderSelectionList = 0 then error "No files were selected in the Finder!"
end tell

repeat with theFile in finderSelectionList
   set (contents of theFile) to POSIX path of (contents of theFile)
end repeat

set nonOcrList to {}

repeat with thePath in finderSelectionList
   if (its pdfHasBeenOcrd:thePath) = false then
      set end of nonOcrList to contents of thePath
   end if
end repeat

return nonOcrList

----------------------------------------------------------------
--» HANDLERS
----------------------------------------------------------------
on pdfHasBeenOcrd:thePath
   set theText to current application's NSMutableString's |string|()
   set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
   set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
   set theCount to theDoc's pageCount() as integer
   
   set OcrFlag to false
   
   repeat with i from 1 to theCount
      set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
      
      set foundText to (my reMatch:"(?m-s)\\w{4,}.*" inText:thePageText)
      
      if (length of foundText) > 0 then
         set OcrFlag to true
         exit repeat
      end if
      
   end repeat
   
   return OcrFlag
   
end pdfHasBeenOcrd:
----------------------------------------------------------------
on reMatch:findPattern inText:theText
   set theNSString to current application's NSString's stringWithString:theText
   set theOptions to ((current application's NSRegularExpressionDotMatchesLineSeparators) as integer) + ((current application's NSRegularExpressionAnchorsMatchLines) as integer)
   set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:findPattern options:theOptions |error|:(missing value)
   set theFinds to theRegEx's matchesInString:theNSString options:0 range:{location:0, |length|:theNSString's |length|()}
   set theFinds to theFinds as list -- so we can loop through
   set theResult to {} -- we will add to this
   
   repeat with i from 1 to count of items of theFinds
      set theRange to (item i of theFinds)'s range()
      set end of theResult to (theNSString's substringWithRange:theRange) as string
   end repeat
   
   return theResult
   
end reMatch:inText:
----------------------------------------------------------------

I know this thread is seeking an ASOC solution for this problem, but I am a pragmatist. If AS can do it with less fuss and ‘just enough’ speed, that’s what I’m doing! This is from a Tips & Tricks blog page for Houdah Spot. That utility uses Spotlight index for searches, so what they are doing here is certainly something you could mimic in plain vanilla Applescript using a shell call to search Spotlight. Since it is Spotlight, likely to be reasonably fast, but I didn’t test this.

HoudahSpot find PDFs needing OCR

Actually, the only requirement is find a relatively fast way to determine if a PDF has been OCR’d. Don’t really care what tool is used.

Thanks for this suggestion, and the key point here is that the Spotlight index is being used, rather than getting all of the text from a PDF, then searching it.

So using AppleScript to use the Spotlight index on a given folder would be perfect.

Anyone know how to do that?

Use shell. Command Spotlight - MacWorld

OK, thanks to an old script that @ShaneStanley posted, I was able to cobble together this script. Thanks, Shane. It seems to work, and finds 33 PDFs out of 300 that need to be OCR’d.

It runs in about 1.5 sec.

If anyone can improve on it, or see any issues with it, please reply.

One Question: How can I limit the search to only the specified folder (NOT recursive)?

UPDATED: 2019-01-15 17:45 GMT-6
Fixed the script issue the @ShaneStanley pointed out, so that it is searching for both “.” and " ".

property ptyScriptName : "Find PDF Files that Need OCR Using Spotlight Index"
property ptyScriptVer : "1.1"
property ptyScriptDate : "2019-01-15"
property ptyScriptAuthor : "JMichaelTX" -- based on script by Shane Stanley

(*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PURPOSE/METHOD:
  • Instead of extracting all text from a PDF to search, use existing Spotlight index
      to find all PDF files that do NOT contain either a space or a period.


REQUIRED:
  1.  macOS 10.11.6+

REF:  The following were used in some way in the writing of this script.
  1.  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

--- GET SELECTED FOLDER, or FINDER WINDOW TARGET ---

tell application "Finder" to set fItemList to (selection as alias list)

if ((count of fItemList) = 0 or (kind of (info for (item 1 of fItemList)) ≠ "folder")) then
  tell application "Finder" to set folderPath to POSIX path of (target of front window as text)
else
  set folderPath to POSIX path of (item 1 of fItemList)
end if

--- SEARCH FOR PDF Files That Do NOT Contain Either (" " OR ".") ---

set filePathList to my searchPath:folderPath searchPredicate:"(kMDItemContentType == 'com.adobe.pdf' AND NOT (kMDItemTextContent CONTAINS '.' AND kMDItemTextContent CONTAINS ' '))" predicateArgs:{"find"}


on searchPath:thePath searchPredicate:predString predicateArgs:argList
  (*
    PUPPOSE:  Using Compound Spotlight Search Terms
    DATE:      2017-03-14
    AUTHOR:   Shane Stanley
  *)
  set thePred to current application's NSPredicate's predicateWithFormat:predString argumentArray:argList
  set targetURL to current application's |NSURL|'s fileURLWithPath:thePath
  set theQuery to current application's NSMetadataQuery's new()
  theQuery's setPredicate:thePred
  theQuery's setSearchScopes:{targetURL}
  theQuery's startQuery()
  repeat while theQuery's isGathering() as boolean
    delay 0.01
  end repeat
  theQuery's stopQuery()
  set theCount to theQuery's resultCount()
  set theResults to current application's NSMutableArray's array()
  repeat with i from 1 to theCount
    set aResult to (theQuery's resultAtIndex:(i - 1))
    set thePath to (aResult's valueForAttribute:(current application's NSMetadataItemPathKey))
    (theResults's addObject:thePath)
  end repeat
  return (theResults's sortedArrayUsingSelector:"compare:") as list
end searchPath:searchPredicate:predicateArgs:


You could always use my Metadata Lib. You can look at the code, and see that it actually gets them all then filters the results.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use libVariable : script "Metadata Lib" version "2.0.1"
perform search just in "/Users/shane/Documents/Whatever" predicate string "kMDItemContentType == %@ AND NOT kMDItemTextContent LIKE %@ AND NOT kMDItemTextContent CONTAINS %@" search arguments {"com.adobe.pdf", " ?", "."}

You might also be able to use kMDItemPath in the predicate to filter out unwanted items.

That’s too bad if it means that the search engine is going to search all sub-folders regardless – wasted time if all we want is the parent folder.

Doesn’t look hopeful, given this from
https://developer.apple.com/library/archive/documentation/CoreServices/Reference/MetadataAttributesRef/Reference/CommonAttrs.html

kMDItemPath
Complete path to the file. This value of this attribute can be retrieved, but can’t be used in a query or to sort search results. This attribute can’t be used as a member of the valueListAttrs array parameter for MDQueryCreate or MDQueryCreateSubset.

The Spotlight database is optimized to work that way — the potential time-waster is discarding unwanted results, because we have to use a repeat loop and slower AppleScript code.

Indeed.

Incidentally, the code you posted above has CONTAINS '.' twice. I think you mean one of them to contain a space, although I found that problematic with CONTAINS (hence my use of LIKE and a wildcard above).

1 Like

Referring to File Metadata Query Expression Syntax it seems obvious that CONTAINS is not compliant or, at least, deprecated. I’m I wrong?

I have some questions regarding the very interesting script provided by @JMichaelTX:
Is the argumentArray parameter required here? If I understand well what I read, Spotlight Queries only need the predicate string?
How to return the file URL instead of the path? I tried NSMetadataItemURLKey after setting valueListAttributes but it always returns missing value.

Here is what I’ve done, so far:

set needOCR to my searchPath:{"/Some Folder/Path"} searchPredicate:"(kMDItemContentType == 'com.adobe.pdf' AND NOT kMDItemTextContent == ' ' AND NOT kMDItemTextContent == '.' AND NOT kMDItemTextContent == ',')" --predicateArgs:{""}

on searchPath:theTargets searchPredicate:predString --predicateArgs:argList
	set theQuery to current application's NSMetadataQuery's new()
	theQuery's setPredicate:(current application's NSPredicate's predicateWithFormat:predString) --argumentArray:argList)
	theQuery's setSearchScopes:theTargets
	theQuery's startQuery()
	repeat while theQuery's isGathering() as boolean
		delay 0.01
	end repeat
	theQuery's stopQuery()
	set theCount to theQuery's resultCount()
	set theResults to current application's NSMutableArray's array()
	repeat with iQuery from 0 to (theCount - 1)
		set thePath to ((theQuery's resultAtIndex:iQuery)'s valueForAttribute:(current application's NSMetadataItemPathKey))
		set theURL to (current application's NSURL's fileURLWithPath:thePath)
		(theResults's addObject:theURL)
	end repeat
	return theResults
end searchPath:searchPredicate:

No.

You have to use the path.

Thanks for catching that typo. Fixed in above script.

Thanks for pointing that out. I was just using an existing script.

Does your script changes to replace “CONTAINS” actually work for contains?
Looks like “equals” to me.

set needOCR to my searchPath:{"/Some Folder/Path"} ¬
  searchPredicate:("(kMDItemContentType == 'com.adobe.pdf' " & ¬
  "AND NOT kMDItemTextContent == ' ' " & ¬
  "AND NOT kMDItemTextContent == '.' " & ¬
  "AND NOT kMDItemTextContent == ',')") --predicateArgs:{""}

From your reference, looks like we would need to put the text between asterisks:

kMDItemTextContent == "*paris*"    -- Matches attributes that contain "paris" anywhere within the value. For example, matches “paris” and “comparison”.

@ShaneStanley, I don’t see how to use “LIKE”.

Yep. Contains is implicit in Metadata Query. The keyword here is attribute.


Comparison Syntax

The file metadata query expression syntax is a simplified form of filename globbing familiar to shell users. Queries have the following format:

attribute == value

where attribute is a standard metadata attribute (see “ File Metadata Attributes Reference ”) or a custom metadata attribute defined by an importer.

For example, to query Spotlight for all the files authored by “Steve” the query would look like the following:

kMDItemAuthors ==[c] “Steve”


We can change the last lines by:
For example, to query Spotlight for all the files that contains the name “Steve” [or steve] the query would look like the following:

kMDItemTextContent ==[c] “Steve”

That seems to contradict this section from the
File Metadata Query Expression Syntax

That clearly shows we need to put the target text between asterisks to find the target text anywhere within the Item Content.

The documentation is confusing (and sometimes contradictory). There are two APIs for doing Spotlight searches: the original C-based API, and the subsequent NSMetadataQuery Objective-C API. What the document you’re quoting from doesn’t make clear is that it’s talking about the C-based API. This page:

https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Predicates/Articles/pSpotlightComparison.html

says in part:

Typically all you have to do to convert the Spotlight query into a predicate format string is make sure the predicate does not start with * (this is not supported by NSMetadataQuery when parsing a predicate). In addition, when you want to use a wildcard, you should use LIKE , as shown in the following example.

The page you quote says:

You cannot use an MDQuery operator as the VALUE of an NSPredicate object "KEY operator VALUE" string. For example, you write an “is-substring-of” expression in Spotlight like this: "myAttribute = '*foo*'" ; in NSPredicate strings you use the contains operator, like this: "myAttribute contains 'foo'" . Spotlight takes glob-like expressions, NSPredicate uses a different operator.

If you use “ * ” as left-hand-side key in a comparison expression, in Spotlight it means “any key in the item” and can only be used with == . You could only use this expression in an NSPredicate object in conjunction with an NSMetadataQuery object.

If I understand well, this means: CONTAINS and LIKE must not be used with NSMetadataQuery.
In the following, lefthand NSMetadataQuery syntaxes are equivalent to the righthand NSPredicate syntaxes.

	=="*substring*" is equivalent to CONTAINS "substring" OR LIKE "substring"
	=="substring*" is equivalent to BEGINSWITH "substring" OR LIKE "substring*"
	=="*substring" is equivalent to ENDSWITH "substring" OR LIKE "*substring"

Honestly, I suspect the best way to find out is to make a test file and see what works. But I wouldn’t be surprised if the behavior is a bit different where the string is a space or “.”.