I have the following code that extracts the text from a PDF file, using an automator action save to the desktop:
set workflowpath to (path to desktop folder as text) & "ExtractPDFtext.workflow" -- From https://discussions.apple.com/thread/6847664
set thePDFfile to (choose file of type {"PDF"} default location (path to desktop folder))
set theCommand to "/usr/bin/automator -i " & (quoted form of (POSIX path of thePDFfile)) & " " & (quoted form of (POSIX path of workflowpath))
set output to do shell script theCommand
set outputTextFile to (path to desktop folder as text) & "Extract Text Output.txt"
set theText to (read file outputTextFile)
How would I use ASObjC to do the same, without the automator .workflow file dependency? Bonus points if it doesn’t need to save an interim txt file, just to be read back into the script again. I can run circles around Applescript, but am at a total loss when it comes to ASObjC.
Shane’s ASObjC book contains a script that does what you want. I modified the script to include an option to return text from a particular page if that’s desired. This script assumes that the PDF contains selectable text–it will not do OCR.
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
set thePDF to POSIX path of (choose file of type {"pdf"})
set pageNumber to 0 -- set to 0 to get all pages
set theText to getTextFromPDF(thePDF, pageNumber)
on getTextFromPDF(posixPath, pageNumber)
set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
if pageNumber = 0 then
return (thePDF's |string|()) as text
else
return ((thePDF's pageAtIndex:(pageNumber - 1))'s |string|()) as text
end if
end getTextFromPDF
The script included below is similar to that in post 2 except that it trims whitespace characters from the beginning and end of each line of the extracted string. I ran timing tests with Shane’s ASObjC book, and the results with one page and with the entire PDF (159 pages) were 28 milliseconds and 1.3 seconds.
use framework "Foundation"
use framework "Quartz"
use scripting additions
set pdfFile to POSIX path of (choose file of type {"com.adobe.pdf"})
set pageNumber to 0 -- set to 0 for all pages
set pdfText to getText(pdfFile, pageNumber)
on getText(theFile, pageNumber) -- theFile is POSIX path
set theFile to current application's |NSURL|'s fileURLWithPath:theFile
set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
if pageNumber = 0 then
set theText to theDocument's |string|()
else
set theText to (theDocument's pageAtIndex:(pageNumber - 1))'s |string|()
end if
set theWhitespace to "(?m)^\\h+|\\h+$"
return (theText's stringByReplacingOccurrencesOfString:theWhitespace withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}) as text
end getText
2024-05-31T18:08:00Z
There is a more succinct way to read all of the text in a selected PDF into an AppleScript variable:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "PDFKit"
use scripting additions
property PDFDocument : a reference to current application's PDFDocument
property NSURL : a reference to current application's NSURL
set thePDF to POSIX path of (choose file of type "PDF") as text
set pdf to PDFDocument's alloc()'s initWithURL:(NSURL's fileURLWithPath:thePDF)
set pdfText to (pdf's selectionForEntireDocument)'s |string|() as text
-- pdfText's writeToFile:"/dev/stdout" atomically:false
return
Saved as pdfStr.scpt, it can be run from the Terminal as:
osascript pdfStr.scpt
Tested in Script Debugger 8.07 on macOS Sonoma 14.5