Some PDF manipulation question

Hello,
I’m writing some script to manipulate PDFs with ApplescriptObjC commands.
I’ve started with a script found on MacScripter Forum:
https://www.macscripter.net/viewtopic.php?id=49176
to crop PDfs.
I plans to use this on scanned PDF that only contain pixmap images (without any text)
The script modify the kPDFDisplayBoxMediaBox rectangle of the document.
This seems to modify the bounding of the whole pages.

repeat with i in pageNumbers
	set aPage to (sourceDoc's pageAtIndex:((i as integer) - 1))
	set {{aE, bE}, {cE, dE}} to (aPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set pageSize to {{aC, (dE - bC - dC)}, {cC, dC}}
	(aPage's setBounds:pageSize forBox:(current application's kPDFDisplayBoxMediaBox))
	(targetDoc's insertPage:aPage atIndex:targetPageCount)
	set targetPageCount to targetPageCount + 1
end repeat

I’ve modified the script to change the kPDFDisplayBoxCropBox instead.
As both seems to just modify the viewing aspect without modifying the size of the file here are my questions:
What is the difference and implications of both options ?
Is there a way to do a crop that discard the unused portions thus reducing the file size.
(Can this be achieved without recompressing the images of each pages and reinserting them in the doc. ? If so , how and if not ,how to do this recompression . I suppose that some extra commands will be needed like ImageMagik)
Any help or comment welcomed.
Regards

You might want to try sips. In my testing, in.pdf contained 684 KB and out.pdf contained 37 KB. The options are a bit arcane but cropOffset apparently has to contain at least one value of 1 or more–otherwise the area cropped is from the center of the PDF. You would have to process each page of the PDF separately, though.

do shell script "sips --cropOffset 1 0 -c 500 500 -o /Users/Robert/Working/out.pdf /Users/Robert/Working/in.pdf"

Thank you for your reply and suggestion (and for the original script example)
Just tried your line but got an error > unknown function “–cropOffset”
Do you need a specific version for this. Forgot to mention that I’m with Mojave.
I only see the -c for cropToHeightWidth H W option in the -h help for sips
also I have to use --out for output as -o is used for optimizeColorForSharing
I also don’t see the --cropOffset option the the man page.
Am I missing something ?
With just the -c option , indeed it seems to crop from the center , not very handy and yes it seems to process only the first page of a pdf so I guess I have to extract each page to a pdf file (unless I can pass to sips a value from a pageAtIndex:(x) function of a PDFDocument variable.)

This leave the main question : is there some difference/advantage on doing the crop with kPDFDisplayBoxMediaBox or with kPDFDisplayBoxCropBox ?

Regards.

xpab. I retested my script and it worked as expected. I also checked the sips options against the sips man page and help page, and they were correct. I’m running Monterey, and I guess that’s the reason for the differences we are seeing. I edited the script slightly, but it almost certainly will not work on your computer.

do shell script "/usr/bin/sips --cropOffset 1 1 --cropToHeightWidth 200 200 --out ~/Desktop/out.pdf ~/Desktop/in.pdf"

Unfortunately, I do not know the functional difference between kPDFDisplayBoxMediaBox and kPDFDisplayBoxCropBox (both seem to work). Hopefully another forum member who is more knowledgeable on this topic will be able to help.

Thanks.
I will experiment tomorrow to see if I can extract the sips command from Monterey and make it work in Mojave otherwise I will investigate some other options.
Being able to do normal crops in batch is already quite useful.

When you create a PDF they’re the same. When you crop a PDF, all you actually do is change the display crop box, so what is displayed is just the “cropped” area.

Thanks Shane–that explains things. It made me wonder why sips reduces file size when used to crop a 1-page PDF, and I assume that it actually crops the underlying image.

I think the sips size reduction is actually just other stuff being stripped out, but I can’t recall the exact details.

1 Like

To really “crop” PDF (that is reduce the actual document dimensions, not just mask an area with crop/media boxes) you’ll need to use PDF-related CGContext... stuff. And let me assure you that if you want to preserve your own sanity you don’t want to deal with it (plus I don’t know if it can be controlled from AppleScriptObjC anyway.)

But since your PDF is basically a raster image (as far as I understand), you may want to try this:

•Convert PDF to image (jpeg, png, tiff)
•Crop the image
•Convert it back to PDF (if you really need to have it as PDF).

I can’t provide any detailed instructions on this right now (sorry!), but I believe that sips should be able to do all this.

Of course , it is not possible to use the Monterey sips command in Mojave due to the dependencies on other frameworks.

Regarding the sips cropping I managed to find a workaround by some trial and error testing but I must admit I’m not sure I understand exactly why it works. (and perhaps it rely on some local conditions that may differ on other contexts)

For those interested here is the piece of code that crop a multipage PDF and reducing the file size.
It is in an early , unpolished and unoptimized state so be indulgent.
It works under Mojave but I have not tested it with later version so it may not work for you.
It is meant for bitmap Pdfs. I have not tested it with other kind of content.
The same crop is applied to all pages. It leaves behind the folder with intermediary work.
Typical workflow would be to open the pdf in Preview. Make a selection for the crop. Open the inspector and note the selection values to be entered in the script.
A pdf (300 dpi color scan) went from 51 MB to 6.8 MB.
Regards

use framework "Foundation"
use framework "Quartz"
use scripting additions

property NSString : a reference to current application's NSString
property NSURL : a reference to current application's NSURL
property PDFDocument : a reference to current application's PDFDocument

on main()
	tell application "Finder" to set sourceFiles to selection as alias list
	if (count sourceFiles) ≠ 1 then return
	tell application "Finder" to set LeFi to (item 1 of sourceFiles)
	set sourceFile to POSIX path of LeFi
	if sourceFile does not end with ".pdf" then return
	set sourceFile to current application's |NSURL|'s fileURLWithPath:sourceFile
	set sourceDoc to (current application's PDFDocument's alloc()'s initWithURL:sourceFile)
	-- Create Temporary folder
	tell application "Finder"
		set FoldNam to (name of LeFi) & " Pages"
		set TargetNam to name of LeFi
		try
			make new folder at parent of LeFi with properties {name:FoldNam}
		end try
		set DestFold to POSIX path of (((parent of LeFi) as text) & FoldNam & ":") as text
		set TargetFold to POSIX path of ((parent of LeFi) as text)
	end tell
	-- Get the Croping rectangle in a quick and ugly way
	-- corner plus height and width as displayed in Preview inspector
	display dialog "x1:" default answer ""
	copy text returned of the result to Lex1
	display dialog "y1:" default answer ""
	copy text returned of the result to Ley1
	display dialog "Width:" default answer ""
	copy (text returned of the result) + Lex1 to Lex2
	display dialog "Height:" default answer ""
	copy (text returned of the result) + Ley1 to Ley2
	-- Setup a few vars
	set sourceFolder to sourceFile's URLByDeletingLastPathComponent
	set sourceFolder to sourceFolder's URLByAppendingPathComponent:FoldNam
	set sourcePageCount to sourceDoc's pageCount()
	-- Loop through the pages, change the cropbox and write each page to a pdf file in Temp folder
	set LaList to {}
	repeat with i from 1 to sourcePageCount
		set targetDoc to current application's PDFDocument's new()
		set aPage to (sourceDoc's pageAtIndex:((i as integer) - 1))
		set {{aE, bE}, {cE, dE}} to (aPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
		set pageSize to {{Lex1, dE - Ley1 - Ley2}, {Lex2, Ley2}}
		(aPage's setBounds:pageSize forBox:(current application's kPDFDisplayBoxCropBox))
		(targetDoc's insertPage:aPage atIndex:0)
		-- Magic trick for actual croping with sips (May not work for you)
		copy ((Lex2 - Lex1) * 2.09) as integer to LaW
		copy ((Ley2 - Ley1) * 2.09) as integer to LaH
		set targetFile to (sourceFolder's URLByAppendingPathComponent:("Page " & i & ".pdf"))
		(targetDoc's writeToURL:targetFile)
		-- Feed it each single page pdf to sips that create a croped version
		-- Under Mojave there is no SetCropOffset option so we have to revert to some ugly trick to make it work the way we want
		do shell script "/usr/bin/sips --cropToHeightWidth " & LaH & " " & LaW & " --out '" & DestFold & "Page " & i & " (c).pdf' '" & DestFold & "Page " & i & ".pdf'"
		set LaList to LaList & (DestFold & "Page " & i & " (c).pdf")
	end repeat
	copy (POSIX file DestFold) as alias to LeFold
	-- Merge all the pages to rebuild the full pdf using the list of sipsed files
	set merged_pdf to (NSString's stringWithString:(TargetFold & TargetNam & " (Trimed).pdf"))'s stringByStandardizingPath()
	set outurl to NSURL's fileURLWithPath:(item 1 of LaList)
	set outpdf to PDFDocument's alloc()'s initWithURL:outurl
	set LaList to rest of LaList
	set lastPage to outpdf's pageCount()
	repeat with pdfdoc in LaList
		set thisURL to (NSURL's fileURLWithPath:(pdfdoc as text))
		set thisPDF to (PDFDocument's alloc()'s initWithURL:thisURL)
		repeat with n from 1 to thisPDF's pageCount()
			set this_page to (thisPDF's pageAtIndex:(n - 1)) -- Pdf pages are 0 based
			(outpdf's insertPage:this_page atIndex:lastPage)
			set lastPage to outpdf's pageCount()
		end repeat
	end repeat
	outpdf's writeToFile:merged_pdf
end main

main()
1 Like

Edited the script as I was using a whose clause that returned an not properly sorted list.
Now I build the list on the fly so its ok.
98 MB → 13.9 MB
Not sure what I loose in the process because result seems identical to me.

xpab. I tested your script, and it seems to work fine with Monterey, but I did encounter one issue.

My test PDF contained 4.5 MB. When I ran the script, I entered crop coordinates 1-1-333-333, and the PDF was cropped as expected, but the file size of the cropped PDF was 4.5 MB. I repeated that and entered crop coordinates 2-2-333-333, and the PDF was cropped as expected, but the file size was 276 KB. In the first case Preview shows the media box as 1600 x 1055 points (the same as the source PDF) but in the second case Preview shows the media box as 334 x 334. In both cases Preview shows the crop box as 334 x 334.

I’m interested in how you handled matters, and I’ll have a look at your script tomorrow.

After trying your sips script I realized that I was wrong and sips indeed truly crops PDF (not just masks cropped area with crop boxes). Thanks! Live and learn.

xpab. I spent some time looking at your script, which actually crops the source PDF twice–first with the PDFPages setBounds method and then with sips. Unfortunately it’s hard for me to completely understand how these interact.

IMO, if the script works reliably then you should simply use it. If you encounter the issue noted in my earlier post, you can change kPDFDisplayBoxCropBox to kPDFDisplayBoxMediaBox in one spot, which cured the issue in my testing.

FWIW, there appear to be two possible alternative approaches which might reliably accomplish what you want. The first is to:

  1. split the source PDF into individual pages with ASObjC;
  2. use sips in batch mode to crop the individual pages; and
  3. merge the cropped PDF pages with ASObjC.

The above is probably easily accomplished but it doesn’t work on Mojave. The alternative might be to use ASObjC to:

  1. convert each page of the source PDF to a bitmap image;
  2. crop the bitmap image;
  3. convert the cropped bitmap image back to a PDF; and
  4. add the cropped PDF to a new target PDF.

I thought I would work on the ASObjC solution, and I’ll post that (if it works) in my thread on the MacScripter’s site. Anyways, your script is an ingenious solution to a not-so-simple task.

Thanks for your remarks.
Indeed it is strange to have those inconsistent results. (More on that later)
So far, I didn’t got such results where the size doesn’t change but I found a reproducible case to achieve that.
Not sure I understand why you say that the first method does not work under Mojave as it seems to me that it is exactly what I am doing (but have to play around sisps to overcome that fact that you can’t use setoffset)
Anyway I’ve improved a bit the script to avoid having to enter the coordinates.
I first do a crop in Preview on the first page then I let the script retrieve the crop box and apply it to the rest.
This lead me to interesting findings and a few points that I don’t understand.
First , when extracting the crop box rectangle from a page I have an unexpected result.
I would have expected to get the origin point in x,y (in points) then the width/height (also in point) . Instead for y I get a value that is not the y coordinate. (I have to calculate it as show in the script)
Either it is a bug or I don’t understand something.
Then , there is this multiplying factor that I have to use to get the proper height/width to feed to the crop command of sips . ( 2.09 seems to be the more accurate value) I totally don’t understand the logic here. Is it because of some dpi consideration or related to retina display ?
But what puzzled me is that if I use 2.085 instead then sips doesn’t free any space and give me a pdf of the same size than the input.
I also tried another script that use the sips crop command without actually cropping anything (using the size of the media box) and it still manage to reduce the file size a bit. And if I do it several times on the results files , the size decrease each time so what is really happening behind the curtain ?
Is there some re-compression involved at some point ? But why ?
So the process seems to work but I can’t fully understand why and how.
Regards

use framework "Foundation"
use framework "Quartz"
use scripting additions

property NSString : a reference to current application's NSString
property NSURL : a reference to current application's NSURL
property PDFDocument : a reference to current application's PDFDocument

on main()
	tell application "Finder" to set sourceFiles to selection as alias list
	if (count sourceFiles) ≠ 1 then return
	tell application "Finder" to set LeFi to (item 1 of sourceFiles)
	set sourceFile to POSIX path of LeFi
	if sourceFile does not end with ".pdf" then return
	set sourceFile to current application's |NSURL|'s fileURLWithPath:sourceFile
	set sourceDoc to (current application's PDFDocument's alloc()'s initWithURL:sourceFile)
	-- Get Media and Crop boxes and see if there is a crop defined for page 1
	set aPage to (sourceDoc's pageAtIndex:0)
	set {{Xm1, Ym1}, {Xm2, Ym2}} to (aPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set {{Lex1, Ley1}, {Lex2, Ley2}} to (aPage's boundsForBox:(current application's kPDFDisplayBoxCropBox))
	if {{Xm1, Ym1}, {Xm2, Ym2}} is {{Lex1, Ley1}, {Lex2, Ley2}} then errorAlert("Page 1 has no crop box.")
	set Ley1 to Ym2 - Ley1 - Ley2
	-- Create Temporary folder
	
	tell application "Finder"
		set FoldNam to (name of LeFi) & " Pages"
		set TargetNam to name of LeFi
		try -- If folder already exists then proceed
			make new folder at parent of LeFi with properties {name:FoldNam}
		end try
		-- make POSIX paths for Temp and Destination folders
		set DestFold to POSIX path of (((parent of LeFi) as text) & FoldNam & ":") as text
		set TargetFold to POSIX path of ((parent of LeFi) as text)
	end tell
	-- Prepare URL for writeToURL method and count pages
	set sourceFolder to sourceFile's URLByDeletingLastPathComponent
	set sourceFolder to sourceFolder's URLByAppendingPathComponent:FoldNam
	set sourcePageCount to sourceDoc's pageCount()
	-- Loop through the pages, change the cropbox and write each page to a pdf file in Temp folder
	set LaList to {}
	repeat with i from 1 to sourcePageCount
		set targetDoc to current application's PDFDocument's new()
		set aPage to (sourceDoc's pageAtIndex:((i as integer) - 1))
		if i is not 1 then -- page one is already croped
			set {{aE, bE}, {cE, dE}} to (aPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
			set pageSize to {{Lex1, dE - Ley1 - Ley2}, {Lex2, Ley2}}
			(aPage's setBounds:pageSize forBox:(current application's kPDFDisplayBoxCropBox))
		end if
		--(aPage's setBounds:pageSize forBox:(current application's kPDFDisplayBoxMediaBox))
		(targetDoc's insertPage:aPage atIndex:0)
		-- Magic trick for actual croping with sips (May not work for you) (Why multiply by 2.09 ? dpi,retina ?)
		-- If I use 2.085 I get sipsed files that remains the same size ?????? Whith 2.09 they are smaller ????
		copy (Lex2 * 2.09) as integer to LaW
		copy (Ley2 * 2.09) as integer to LaH
		set targetFile to (sourceFolder's URLByAppendingPathComponent:("Page " & i & ".pdf"))
		(targetDoc's writeToURL:targetFile)
		-- Feed it each single page pdf to sips that create a croped version
		-- Under Mojave there is no SetCropOffset option so we have to make the crop beforehand
		do shell script "/usr/bin/sips --cropToHeightWidth " & LaH & " " & LaW & " --out '" & DestFold & "Page " & i & " (c).pdf' '" & DestFold & "Page " & i & ".pdf'"
		-- Build list of sips croped pages
		set end of LaList to (DestFold & "Page " & i & " (c).pdf")
	end repeat
	-- Merge all the pages to rebuild the full pdf using the list of sipsed files
	set merged_pdf to (NSString's stringWithString:(TargetFold & TargetNam & " (Trimed).pdf"))'s stringByStandardizingPath()
	set outurl to NSURL's fileURLWithPath:(item 1 of LaList)
	set outpdf to PDFDocument's alloc()'s initWithURL:outurl
	set LaList to rest of LaList
	set lastPage to outpdf's pageCount()
	repeat with pdfdoc in LaList
		set thisURL to (NSURL's fileURLWithPath:(pdfdoc as text))
		set thisPDF to (PDFDocument's alloc()'s initWithURL:thisURL)
		repeat with n from 1 to thisPDF's pageCount()
			set this_page to (thisPDF's pageAtIndex:(n - 1)) -- Pdf pages are 0 based
			(outpdf's insertPage:this_page atIndex:lastPage)
			set lastPage to outpdf's pageCount()
		end repeat
	end repeat
	outpdf's writeToFile:merged_pdf
	--tell application "Finder" to delete ((POSIX file DestFold) as alias) -- Delete Temp folder
end main

main()

on errorAlert(dialogMessage)
	display alert "Can't proceed:" message dialogMessage as critical
	error number -128
end errorAlert

I agree that your script works with sips under Mojave. My first method would include the following code, which appears not to work under Mojave.

-- this code overwrites originals
do shell script "sips --cropOffset 1 1 --cropToHeightWidth 300 300 ~/pdftemp/*.pdf"

After some more experiments I’ve determined that sips --cropToHeightWidth , in this context, taking a bitmap pdf page as input is actually destructive. It doesn’t just crop the image data but does a recompression in the process that produce a loss of quality.
This can be seen quite clearly in this example of a detail of the original and the cropped version.


This is not a big surprise and in most cases (in my cases) , not an issue at all but it must be said.
(I suppose that depending on how an image is stored in a pdf, it is simply not possible to crop it without recompressing…)

Still wondering about this y value for the Cropbox point and the multiplier value to get the proper height and width values to pass to sips.

xpab. I ran some tests to investigate the issue you raise. My source PDF was:

PDF Test.pdf
672 KB
13.89 × 14.66 inches page size
1000 x 1055 media box
1000 x 1055 crop box

I ran your script from post 10 and used a crop of 2-2-333-333. The first temporary file created by your script with the setBounds method was:

Page 1.pdf
672 KB
4.66 × 4.66 inches
1000 x 1055 media box
335 x 335 crop box

The second temporary file created by your script with the sips utility applied to Page 1.pdf was:

Page 1 (c).pdf
79 KB
4.64 × 4.64 inches page size
334.08 x 334.08 media box
334.08 x 334.08 crop box

When running your script, I logged the values for LaH and LaW (which are input to sips) and they were 696 and 696. So, the question is what does the sips utility do when it applies these height and width values to Page 1.pdf. I don’'t have a knowledge of the inner workings of sips, but my guess is that it crops part of the image in Page 1.pdf (which is outside the crop box) and that it does other stuff that reduces the size of the PDF.

Anyways, I’m not remotely an expert on this topic and that’s about all I know. Hopefully other forum members will offer their advice and opinions.

OK. The y value mystery is solved. Such a stupid mistake.
Forgot to notice that for CGRect the origin point is bottom-left. (and not top-left)
(Funny that the script was nevertheless working OK)

For those who would like to use AppleScriptObjC, here is a non-destructive cropping script:
(it preserves image’s resolution)

use framework "Foundation"
use framework "Quartz"
use scripting additions

property mm2pts : 2.83464566929

-- initial values for file paths
set theURL to current application's NSURL's fileURLWithPath:"/some/original file.pdf"
set destURL to current application's NSURL's fileURLWithPath:"/some/destination file.pdf"

-- desired crop values in millimeters
set leftCrop to 10
set rightCrop to 20
set topCrop to 30
set bottomCrop to 40

-- get initial values from document
set theDoc to current application's PDFDocument's alloc()'s initWithURL:theURL
set pageCount to (theDoc's pageCount())

repeat with iPage from 0 to pageCount - 1
	-- get page bounds 
	set nextPage to (theDoc's pageAtIndex:iPage)
	set {{p1, p2}, {p3, p4}} to (nextPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	
	-- prepare the crop
	set leftCrop to (leftCrop * mm2pts)
	set bottomCrop to (bottomCrop * mm2pts)
	set rightCrop to p3 - (rightCrop * mm2pts) - leftCrop
	set topCrop to p4 - (topCrop * mm2pts) - bottomCrop
	
	-- crop the page
	(nextPage's setBounds:{{leftCrop, bottomCrop}, {rightCrop, topCrop}} forBox:(current application's kPDFDisplayBoxCropBox))
		
	-- crop the page deleting unwanted areas
	--(nextPage's setBounds:{{leftCrop, bottomCrop}, {rightCrop, topCrop}} forBox:(current application's kPDFDisplayBoxMediaBox))

end repeat

(theDoc's writeToURL:destURL)

Go to PDFDisplayBox to see what source/target bounds you want to use.

1 Like