Getting text of PDF w/ ASObjC?

scottdye · March 16, 2023, 3:40am

I have the following code that extracts the text from a PDF file, using an automator action save to the desktop:

set workflowpath to (path to desktop folder as text) & "ExtractPDFtext.workflow" -- From https://discussions.apple.com/thread/6847664
set thePDFfile to (choose file of type {"PDF"} default location (path to desktop folder))

set theCommand to "/usr/bin/automator -i " & (quoted form of (POSIX path of thePDFfile)) & " " & (quoted form of (POSIX path of workflowpath))
set output to do shell script theCommand

set outputTextFile to (path to desktop folder as text) & "Extract Text Output.txt"
set theText to (read file outputTextFile)

How would I use ASObjC to do the same, without the automator .workflow file dependency? Bonus points if it doesn’t need to save an interim txt file, just to be read back into the script again. I can run circles around Applescript, but am at a total loss when it comes to ASObjC.

peavine · March 16, 2023, 12:29pm

Shane’s ASObjC book contains a script that does what you want. I modified the script to include an option to return text from a particular page if that’s desired. This script assumes that the PDF contains selectable text–it will not do OCR.

use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions

set thePDF to POSIX path of (choose file of type {"pdf"})
set pageNumber to 0 -- set to 0 to get all pages
set theText to getTextFromPDF(thePDF, pageNumber)

on getTextFromPDF(posixPath, pageNumber)
	set theURL to current application's |NSURL|'s fileURLWithPath:posixPath
	set thePDF to current application's PDFDocument's alloc()'s initWithURL:theURL
	if pageNumber = 0 then
		return (thePDF's |string|()) as text
	else
		return ((thePDF's pageAtIndex:(pageNumber - 1))'s |string|()) as text
	end if
end getTextFromPDF

scottdye · March 16, 2023, 6:08pm

Thanks! That is exactly what I was looking for!

peavine · April 9, 2023, 9:12pm

The script included below is similar to that in post 2 except that it trims whitespace characters from the beginning and end of each line of the extracted string. I ran timing tests with Shane’s ASObjC book, and the results with one page and with the entire PDF (159 pages) were 28 milliseconds and 1.3 seconds.

use framework "Foundation"
use framework "Quartz"
use scripting additions

set pdfFile to POSIX path of (choose file of type {"com.adobe.pdf"})
set pageNumber to 0 -- set to 0 for all pages
set pdfText to getText(pdfFile, pageNumber)

on getText(theFile, pageNumber) -- theFile is POSIX path
	set theFile to current application's |NSURL|'s fileURLWithPath:theFile
	set theDocument to current application's PDFDocument's alloc()'s initWithURL:theFile
	if pageNumber = 0 then
		set theText to theDocument's |string|()
	else
		set theText to (theDocument's pageAtIndex:(pageNumber - 1))'s |string|()
	end if
	set theWhitespace to "(?m)^\\h+|\\h+$"
	return (theText's stringByReplacingOccurrencesOfString:theWhitespace withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}) as text
end getText

VikingOSX · May 31, 2024, 5:49pm

2024-05-31T18:08:00Z
There is a more succinct way to read all of the text in a selected PDF into an AppleScript variable:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use framework "PDFKit"
use scripting additions

property PDFDocument : a reference to current application's PDFDocument
property NSURL : a reference to current application's NSURL

set thePDF to POSIX path of (choose file of type "PDF") as text
set pdf to PDFDocument's alloc()'s initWithURL:(NSURL's fileURLWithPath:thePDF)
set pdfText to (pdf's selectionForEntireDocument)'s |string|() as text
-- pdfText's writeToFile:"/dev/stdout" atomically:false
return

Saved as pdfStr.scpt, it can be run from the Terminal as:

osascript pdfStr.scpt

Tested in Script Debugger 8.07 on macOS Sonoma 14.5

emendelson · December 18, 2024, 2:06pm

When I optimize a PDF with Adobe Acrobat - and I’m not at all sure which setting caused this - the resulting PDF produces nonsense when using these methods for extracing text.

ZadieSmith-FascinatedToPresume-Resized.pdf (332.5 KB)

All I can guess is that there is some encoding trick in the file that makes it possible to display text, but not extract it straightforwardly. Is there any way to get the text out of a PDF like this one programmatically?

peavine · December 18, 2024, 3:43pm

emendelson. I have three methods of getting text from PDFs and none of them worked with your example. I do have a few miscellaneous suggestions, though.

The Preview info sheet states that your PDF was created with Bullzip PDF Printer. Is it possible that this and not the Adobe software is the issue?

The Adobe optimization has options for fonts. Have you tried not optimizing the fonts to see if that makes a difference?

If all else fails, OCR can be used. The following shortcut created reasonably accurate (but not perfect) text files with the contents of your example PDF. It creates one text document per page but these can be merged in the shortcut. There are also AppleScript OCR scripts in the MacScripter forum.

OCR File.shortcut (23.0 KB)

emendelson · December 18, 2024, 5:04pm

I use that file for experimenting, and there’s no way to be sure when the fonts became inaccessible. BullZip uses GhostScript under the hood, so it’s unlikely that things went wrong there. I’m not trying to fix the file - I’m trying to find a more reliable way of getting word counts from files that have the same problem. OCR seems to be the best way to proceed with this.

I’ve been working on a PDF wordcount script, and I think I may add a feature that displays the first few words of the text so that the user can decide whether or not to use OCR instead of simply extracting text. My next stop is the MacScripter forum. Thank you for this!