I started out with a similar idea in mind with my first attempt, but I found some PDFs on my system that didn’t have any font info even though they contained text.
I’ll have to see if I can find them and run your script against them.
In the meantime I’ve altered my script to use a regex test instead of a character count. It’s only a fraction slower and gives me a lot more control of what’s tested for.
-Chris
AppleScript Code
----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2018/12/06 17:16
# dMod: 2018/12/15 04:24
# Appl: AppleScriptObjC, Finder
# Task: Determine if Selected PDF Files have been OCR'd.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Finder, @Determine, @Selected, @PDF, @Files, @OCR'd, @RegEx
# Vers: 1.03
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------
tell application "Finder"
set finderSelectionList to selection as alias list
if length of finderSelectionList = 0 then error "No files were selected in the Finder!"
end tell
repeat with theFile in finderSelectionList
set (contents of theFile) to POSIX path of (contents of theFile)
end repeat
set nonOcrList to {}
repeat with thePath in finderSelectionList
if (its pdfHasBeenOcrd:thePath) = false then
set end of nonOcrList to contents of thePath
end if
end repeat
return nonOcrList
----------------------------------------------------------------
--» HANDLERS
----------------------------------------------------------------
on pdfHasBeenOcrd:thePath
set theText to current application's NSMutableString's |string|()
set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theCount to theDoc's pageCount() as integer
set OcrFlag to false
repeat with i from 1 to theCount
set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
set foundText to (my reMatch:"(?m-s)\\w{4,}.*" inText:thePageText)
if (length of foundText) > 0 then
set OcrFlag to true
exit repeat
end if
end repeat
return OcrFlag
end pdfHasBeenOcrd:
----------------------------------------------------------------
on reMatch:findPattern inText:theText
set theNSString to current application's NSString's stringWithString:theText
set theOptions to ((current application's NSRegularExpressionDotMatchesLineSeparators) as integer) + ((current application's NSRegularExpressionAnchorsMatchLines) as integer)
set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:findPattern options:theOptions |error|:(missing value)
set theFinds to theRegEx's matchesInString:theNSString options:0 range:{location:0, |length|:theNSString's |length|()}
set theFinds to theFinds as list -- so we can loop through
set theResult to {} -- we will add to this
repeat with i from 1 to count of items of theFinds
set theRange to (item i of theFinds)'s range()
set end of theResult to (theNSString's substringWithRange:theRange) as string
end repeat
return theResult
end reMatch:inText:
----------------------------------------------------------------
I know this thread is seeking an ASOC solution for this problem, but I am a pragmatist. If AS can do it with less fuss and ‘just enough’ speed, that’s what I’m doing! This is from a Tips & Tricks blog page for Houdah Spot. That utility uses Spotlight index for searches, so what they are doing here is certainly something you could mimic in plain vanilla Applescript using a shell call to search Spotlight. Since it is Spotlight, likely to be reasonably fast, but I didn’t test this.
Actually, the only requirement is find a relatively fast way to determine if a PDF has been OCR’d. Don’t really care what tool is used.
Thanks for this suggestion, and the key point here is that the Spotlight index is being used, rather than getting all of the text from a PDF, then searching it.
So using AppleScript to use the Spotlight index on a given folder would be perfect.
OK, thanks to an old script that @ShaneStanley posted, I was able to cobble together this script. Thanks, Shane. It seems to work, and finds 33 PDFs out of 300 that need to be OCR’d.
It runs in about 1.5 sec.
If anyone can improve on it, or see any issues with it, please reply.
One Question: How can I limit the search to only the specified folder (NOT recursive)?
UPDATED: 2019-01-15 17:45 GMT-6
Fixed the script issue the @ShaneStanley pointed out, so that it is searching for both “.” and " ".
property ptyScriptName : "Find PDF Files that Need OCR Using Spotlight Index"
property ptyScriptVer : "1.1"
property ptyScriptDate : "2019-01-15"
property ptyScriptAuthor : "JMichaelTX" -- based on script by Shane Stanley
(*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PURPOSE/METHOD:
• Instead of extracting all text from a PDF to search, use existing Spotlight index
to find all PDF files that do NOT contain either a space or a period.
REQUIRED:
1. macOS 10.11.6+
REF: The following were used in some way in the writing of this script.
1.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*)
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
--- GET SELECTED FOLDER, or FINDER WINDOW TARGET ---
tell application "Finder" to set fItemList to (selection as alias list)
if ((count of fItemList) = 0 or (kind of (info for (item 1 of fItemList)) ≠ "folder")) then
tell application "Finder" to set folderPath to POSIX path of (target of front window as text)
else
set folderPath to POSIX path of (item 1 of fItemList)
end if
--- SEARCH FOR PDF Files That Do NOT Contain Either (" " OR ".") ---
set filePathList to my searchPath:folderPath searchPredicate:"(kMDItemContentType == 'com.adobe.pdf' AND NOT (kMDItemTextContent CONTAINS '.' AND kMDItemTextContent CONTAINS ' '))" predicateArgs:{"find"}
on searchPath:thePath searchPredicate:predString predicateArgs:argList
(*
PUPPOSE: Using Compound Spotlight Search Terms
DATE: 2017-03-14
AUTHOR: Shane Stanley
*)
set thePred to current application's NSPredicate's predicateWithFormat:predString argumentArray:argList
set targetURL to current application's |NSURL|'s fileURLWithPath:thePath
set theQuery to current application's NSMetadataQuery's new()
theQuery's setPredicate:thePred
theQuery's setSearchScopes:{targetURL}
theQuery's startQuery()
repeat while theQuery's isGathering() as boolean
delay 0.01
end repeat
theQuery's stopQuery()
set theCount to theQuery's resultCount()
set theResults to current application's NSMutableArray's array()
repeat with i from 1 to theCount
set aResult to (theQuery's resultAtIndex:(i - 1))
set thePath to (aResult's valueForAttribute:(current application's NSMetadataItemPathKey))
(theResults's addObject:thePath)
end repeat
return (theResults's sortedArrayUsingSelector:"compare:") as list
end searchPath:searchPredicate:predicateArgs:
You could always use my Metadata Lib. You can look at the code, and see that it actually gets them all then filters the results.
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
use libVariable : script "Metadata Lib" version "2.0.1"
perform search just in "/Users/shane/Documents/Whatever" predicate string "kMDItemContentType == %@ AND NOT kMDItemTextContent LIKE %@ AND NOT kMDItemTextContent CONTAINS %@" search arguments {"com.adobe.pdf", " ?", "."}
You might also be able to use kMDItemPath in the predicate to filter out unwanted items.
kMDItemPath
Complete path to the file. This value of this attribute can be retrieved, but can’t be used in a query or to sort search results. This attribute can’t be used as a member of the valueListAttrs array parameter for MDQueryCreate or MDQueryCreateSubset.
The Spotlight database is optimized to work that way — the potential time-waster is discarding unwanted results, because we have to use a repeat loop and slower AppleScript code.
Indeed.
Incidentally, the code you posted above has CONTAINS '.' twice. I think you mean one of them to contain a space, although I found that problematic with CONTAINS (hence my use of LIKE and a wildcard above).
I have some questions regarding the very interesting script provided by @JMichaelTX:
Is the argumentArray parameter required here? If I understand well what I read, Spotlight Queries only need the predicate string?
How to return the file URL instead of the path? I tried NSMetadataItemURLKey after setting valueListAttributes but it always returns missing value.
Here is what I’ve done, so far:
set needOCR to my searchPath:{"/Some Folder/Path"} searchPredicate:"(kMDItemContentType == 'com.adobe.pdf' AND NOT kMDItemTextContent == ' ' AND NOT kMDItemTextContent == '.' AND NOT kMDItemTextContent == ',')" --predicateArgs:{""}
on searchPath:theTargets searchPredicate:predString --predicateArgs:argList
set theQuery to current application's NSMetadataQuery's new()
theQuery's setPredicate:(current application's NSPredicate's predicateWithFormat:predString) --argumentArray:argList)
theQuery's setSearchScopes:theTargets
theQuery's startQuery()
repeat while theQuery's isGathering() as boolean
delay 0.01
end repeat
theQuery's stopQuery()
set theCount to theQuery's resultCount()
set theResults to current application's NSMutableArray's array()
repeat with iQuery from 0 to (theCount - 1)
set thePath to ((theQuery's resultAtIndex:iQuery)'s valueForAttribute:(current application's NSMetadataItemPathKey))
set theURL to (current application's NSURL's fileURLWithPath:thePath)
(theResults's addObject:theURL)
end repeat
return theResults
end searchPath:searchPredicate:
Yep. Contains is implicit in Metadata Query. The keyword here is attribute.
Comparison Syntax
The file metadata query expression syntax is a simplified form of filename globbing familiar to shell users. Queries have the following format:
attribute == value
where attribute is a standard metadata attribute (see “ File Metadata Attributes Reference ”) or a custom metadata attribute defined by an importer.
For example, to query Spotlight for all the files authored by “Steve” the query would look like the following:
kMDItemAuthors ==[c] “Steve”
We can change the last lines by:
For example, to query Spotlight for all the files that contains the name “Steve” [or steve] the query would look like the following:
The documentation is confusing (and sometimes contradictory). There are two APIs for doing Spotlight searches: the original C-based API, and the subsequent NSMetadataQuery Objective-C API. What the document you’re quoting from doesn’t make clear is that it’s talking about the C-based API. This page:
Typically all you have to do to convert the Spotlight query into a predicate format string is make sure the predicate does not start with * (this is not supported by NSMetadataQuery when parsing a predicate). In addition, when you want to use a wildcard, you should use LIKE , as shown in the following example.
You cannot use an MDQuery operator as the VALUE of an NSPredicate object "KEY operator VALUE" string. For example, you write an “is-substring-of” expression in Spotlight like this: "myAttribute = '*foo*'" ; in NSPredicate strings you use the contains operator, like this: "myAttribute contains 'foo'" . Spotlight takes glob-like expressions, NSPredicate uses a different operator.
If you use “ * ” as left-hand-side key in a comparison expression, in Spotlight it means “any key in the item” and can only be used with == . You could only use this expression in an NSPredicate object in conjunction with an NSMetadataQuery object.
If I understand well, this means: CONTAINS and LIKE must not be used with NSMetadataQuery.
In the following, lefthand NSMetadataQuery syntaxes are equivalent to the righthand NSPredicate syntaxes.
=="*substring*" is equivalent to CONTAINS "substring" OR LIKE "substring"
=="substring*" is equivalent to BEGINSWITH "substring" OR LIKE "substring*"
=="*substring" is equivalent to ENDSWITH "substring" OR LIKE "*substring"
Honestly, I suspect the best way to find out is to make a test file and see what works. But I wouldn’t be surprised if the behavior is a bit different where the string is a space or “.”.