Some PDF manipulation question

peavine · October 24, 2022, 2:28pm

I’ve done a little research and testing on this topic and thought I’d pass along what I’ve learned. The Apple documentation defines kPDFDisplayBoxMediaBox as:

A rectangle defining the boundaries of the physical medium for display or printing, expressed in default user-space units.

And, kPDFDisplayBoxCropBox is defined as:

A rectangle defining the boundaries of the visible region , expressed in default user-space units. Default value equal to kPDFDisplayBoxMediaBox.

The question originally raised by the OP is what impact does the use of these terms have when cropping a PDF with my script. So, I ran a test with my script and the results with a crop rectangle of 1-1-500-500 were:

Original PDF
File Size: 672 KB
Media Box: 1000 x 1055
Crop Box: 1000 x 1055

Original PDF Cropped with kPDFDisplayBoxMediaBox
File Size: 198 KB
Media Box: 500 x 500
Crop Box: 500 x 500

Original PDF Cropped with kPDFDisplayBoxCropBox
File Size: 672 KB
Media Box: 1000 x 1055
Crop Box: 500 x 500

I also tested Jonas’ script with my test PDF and the file sizes of the cropped PDFs were 672 and 198 KB. Our scripts use the same method to crop the PDFs so that’s as expected. However, I do much prefer the approach used in Jonas’ script to calculate the crop rectangle.

xpab · October 24, 2022, 3:31pm

I ran your script but the result file is always the same size than the original , regardless of what rect you modify (cropbox or mediabox)
Do you get an output file smaller than the original ?
If that’s the case then this would mean that the behavior is not the same under Mojave than under Monterey.
I’m getting good results using sips.
(And yes , this is only relevant for pdfs containing bitmap images.)

leo_r · October 24, 2022, 3:46pm

One more note about sips cropping PDF:

sips, obviously, just converts PDF to image, crops the image, then resaves it as PDF.

Since the original PDF discussed here is in fact just a raster image, it’s not immediately obvious.

But if you take a standard PDF file (with fonts and vector objects), sips will also just turn it into a cropped image.

(Naturally, the only way to crop the actual dimensions of a PDF file that contains fonts and vectors is to turn it into an image. Otherwise, cropped area can only be masked with the crop box).

ionah · October 24, 2022, 4:39pm

Sorry. And you’re right: the syntax to get/set a rectangle (bounds) is different.
If my memory is good, this should go well:

repeat with iPage from 0 to pageCount - 1
	-- get page bounds 
	set nextPage to (theDoc's pageAtIndex:iPage)
	set theRect to (nextPage's boundsForBox:(current application's kPDFDisplayBoxMediaBox))
	set {{p1, p2}, {p3, p4}} to theRect as record
	
	-- prepare the crop 
	set leftCrop to (leftCrop * mm2pts)
	set bottomCrop to (bottomCrop * mm2pts)
	set rightCrop to p3 - (rightCrop * mm2pts) - leftCrop
	set topCrop to p4 - (topCrop * mm2pts) - bottomCrop
	
	-- crop the page as in Acrobat
	set theCrop to current application's NSMakeRect(leftCrop, bottomCrop, rightCrop, topCrop)
	(nextPage's setBounds:theCrop forBox:(current application's kPDFDisplayBoxCropBox))
	
	-- crop the page deleting unwanted areas
	--(nextPage's setBounds:theCrop forBox:(current application's kPDFDisplayBoxMediaBox))
	
end repeat

xpab · October 24, 2022, 4:56pm

Actually , both works the same. (Except the record casting that produce an error)
The point is that it doesn’t modify the actual size of the file when you modify the MediaBox. Only the viewport is modified , the image data is not modified. (At least here.)

ionah · October 24, 2022, 5:45pm

So, this syntax is for systems prior to Mojave.

As my script does not modify the images’s resolution, the resulting file can’t be lighter if you use kPDFDisplayBoxCropBox. Refer to the post by Shane above in this topic.

xpab · October 24, 2022, 6:17pm

Yes, of course, but I was mentioning the kPDFDisplayBoxMediaBox.
Under Mojave , modifying it doesn’t change the file size.
This seems logical. Depending on how the underlying data image is stored (lossless or not) you may not crop it without recompressing. Perhaps the fact that we may work on different type of source data makes those difference of results.
I was a little confused with the remarks of peavine mentioning file size changes.

peavine · October 24, 2022, 7:20pm

My test PDF was a PNG saved as a PDF. I just saved a digital JPEG photo as a PDF and cropped it with my script. The PDF was cropped but there was no change in file size. I tried Jonas’ script and it didn’t change the size either.

xpab · October 24, 2022, 8:24pm

Thanks for the clarification. This makes sense now.

peavine · October 27, 2022, 2:10pm

I don’t believe the above issue was ever addressed, and I thought I would provide a possible answer.

Xpab’s script uses 1) the boundsForBox method to get the bounds of the source PDF, 2) the setBounds method to crop the source PDF, and 3) the sips utility to crop the source PDF. However, the boundsForBox method returns points, the setBounds method uses points, and sips uses pixels. So, a conversion needs to be made for sips to work correctly.

My guess is that xpab’s PDF has a resolution of 150 dpi which would yield a conversion factor of:

150 dpi / 72 points per inch (PDF Default) = 2.083

If the resolution of a source PDF is not 150 dpi then xpab’s script may not work as expected. However, it would be easy to calculate the conversion factor in the script. FWIW, I’ve included below some data from my test PDF.

Bounds as returned by bounds for box method
set {{aE, bE}, {cE, dE}} to (aPage’s boundsForBox:(current application’s kPDFDisplayBoxMediaBox)) → {{0.0, 0.0}, {1000.0, 1055.0}}
Pixel width and height returned by sips
do shell script “sips -g pixelWidth -g pixelHeight " & quoted form of theFile
→ pixelWidth: 2084
pixelHeight: 2198”
Resolution returned by sips
do shell script “sips -g dpiWidth -g dpiHeight " & quoted form of theFile
→ dpiWidth: 150.000
dpiHeight: 150.000”

ionah · October 27, 2022, 3:18pm

I think you’re making a confusion here. The bounds returned by this method are the page format in points (letter, a4, etc.} and not the image’s size in pixels. Thus, there’s no resolution.
With this method, you will have first to convert the embed image, let’s say in jpeg, to change its resolution and re-import it.

xpab · October 27, 2022, 4:20pm

Thanks for your perseverance and attention to details.
My bad for having missed this difference of units.
There is still some things I don’t fully understand about this.
(Sorry if it a bit out of scope)
Indeed when using sips , it tells me that the pdf is 150 dpi.
The problem is that this pdf is supposed to be at 300 dpi (as selected in the scanner prefs)
It has a dimension of 2550 x 3510 pixels. Preview report it is 8.5 x 11.7 inches (612 x 842 points) . This gives a 300 value.
But it would seems that in this case we are talking about ppi and not dpi.

Another pdf that also result being at 150 dpi is in fact 1654 x 2338 pix / 8.3 x 11.7 i / 595 x 841

Both a handled correctly by sips because they are both at 150 dpi but one is at 300 ppi and the other at 200 ppi

There’s a little something that escapes me there.
This seems to imply that the scanner should have used the term ppi instead of dpi , as regardless of the settings all the scanned documents seems to be at 150 dpi with the same dimension and a different size (pixels.)

peavine · October 27, 2022, 4:28pm

Thanks Jonas for the correction, which I agree with but with a comment.

Within the context of the OP’s script, my explanation made some sense, because the script uses the returned page bounds to set the pixel width and height for use by sips. This may be technically incorrect, but my goal was to explain why the conversion factor worked. Anyways, this all made sense if each page of the PDF contained one image AND if the bounds of the page and the underlying image are the same. I did some additional testing and this was the case when a PNG screenshot was converted to a PDF with ASObjC:

PNG screenshot: 1682 x 2020 pixels
The screenshot dpi: 144
The conversion factor: 144 / 72 = 2.0
The screenshot converted to PDF with ASObjC: 841 x 1010 points
Points to pixels conversion: 841 x 2 = 1682

peavine · October 27, 2022, 11:36pm

xpab. I ran a test with my epson scanner and got a similar result. I scanned a document to PDF at 300 dpi and letter size. Preview shows the page size as 8.5 x 11 inches. Sips returned a pixel width and height of 1271 x 1650 but should have returned 2544 x 3300. Also, sips returned a dpi of 150 rather than 300. I don’t know why that is.

ionah · October 28, 2022, 9:07am

Resolution is a very hard notion to explain. Especially for me and my bad english.
What I can say is that an image, once it’s saved on disk, does not have an intrinsic resolution.
It only have a pixel size (width and height).
The resolution tag attached to the image is an extra info to tell to host applications how should it be imported or displayed. But the app has to be able to use it.
With cameras and scanners, setting the resolution is choosing how much your image must be detailed, how much pixels it will contain. (not sure I’m very clear…)

Here is a script that will resize pages to the desired resolution.
Be aware that PDFDocument can’t get images resolution. (And I don’t know another way to get it)
So, don’t ask for an output at 300ppp if your scan is at 150ppp. It could end with a blurred image.

use framework "Foundation"
use framework "AppKit"
use scripting additions

property mm2pts : 2.83464566929

## PREPARE SOURCE AND TARGET FILES
set pdfURL to current application's NSURL's fileURLWithPath:"some.pdf"
set theName to pdfURL's URLByDeletingPathExtension()'s lastPathComponent()
set destURL to pdfURL's URLByDeletingLastPathComponent()'s URLByAppendingPathComponent:("" & theName & " [crop].pdf")

## PREPARE THE CROP (values are in millimeters)
set leftCrop to 10 + 7.4
set rightCrop to 20 + 7.4
set topCrop to 30 + 7.4
set bottomCrop to 40 + 7.4 + 10

## DEFINE THE DESIRED RESOLUTION
set desiredRez to 72

## LOAD THE PDF FILE
set theDoc to current application's PDFDocument's alloc()'s initWithURL:pdfURL
set pageCount to (theDoc's pageCount())
set newDoc to current application's PDFDocument's alloc()'s init()

## CROP EACH PAGE 
repeat with iPage from 1 to pageCount
	## 	GET THE PAGE BOUNDS
	set thePage to (theDoc's pageAtIndex:(iPage - 1))
	set theData to (thePage's dataRepresentation())
	set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:theData)
	set {{aLeft, aTop}, {aWidth, aHeight}} to pdfImageRep's |bounds|()
	
	## 	PREPARE THE CROP
	set leftCrop to (leftCrop * mm2pts)
	set rightCrop to (rightCrop * mm2pts)
	set topCrop to (topCrop * mm2pts)
	set bottomCrop to (bottomCrop * mm2pts)
	set cropWidth to (aWidth - (leftCrop + rightCrop))
	set cropHeight to (aHeight - (topCrop + bottomCrop))
	set cropRect to {{0, 0}, {cropWidth, cropHeight}}
	set resizing to (desiredRez / 72)
	
	## CREATE AN EMPTY IMAGE THEN PLACE THE CROPPED IMAGE IN IT
	set newBitmap to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:cropWidth * resizing pixelsHigh:cropHeight * resizing bitsPerSample:8 samplesPerPixel:4 hasAlpha:true isPlanar:false colorSpaceName:(current application's NSCalibratedRGBColorSpace) bytesPerRow:0 bitsPerPixel:0)
	current application's NSGraphicsContext's saveGraphicsState()
	set theContext to (current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newBitmap)
	(current application's NSGraphicsContext's setCurrentContext:theContext)
	(theContext's setShouldAntialias:true)
	(theContext's setImageInterpolation:(current application's NSImageInterpolationHigh))
	(pdfImageRep's drawInRect:{{-leftCrop * resizing, -bottomCrop * resizing}, {aWidth * resizing, aHeight * resizing}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:(1.0) respectFlipped:true hints:(missing value))
	current application's NSGraphicsContext's restoreGraphicsState()
	
	## MAKE THE RESULT A JPEG TO REDUCE DOCUMENT SIZE AND CONVERT IT TO A PDF PAGE 
	set theData to (newBitmap's representationUsingType:(current application's NSJPEGFileType) |properties|:{NSImageCompressionFactor:0.8, NSImageProgressive:false})
	set theImage to (current application's NSImage's alloc()'s initWithData:theData)
	set theImageView to (current application's NSImageView's alloc()'s initWithFrame:cropRect)
	(theImageView's setImage:theImage)
	set theData to (theImageView's dataWithPDFInsideRect:cropRect)
	set nextPage to ((current application's PDFDocument's alloc()'s initWithData:theData)'s pageAtIndex:0)
	
	## 	INSERT THE PDF PAGE IN THE DOC
	(newDoc's insertPage:nextPage atIndex:(iPage - 1))
	
end repeat

## WRITE THE PDF TO THE DISK
(newDoc's writeToURL:destURL)
current application's NSWorkspace's sharedWorkspace()'s openURL:destURL

xpab · October 28, 2022, 4:28pm

Thank you very much for your remarks and script.
This seems to be a very elegant and effective solution that additionally don’t rely on any additional tools (sips) and expose very useful images manipulations in ApplescriptObjC.
I will play with it this weekend and try to integrate your method in my scripts (that were relying on sips and were forced to produce temps files)
Regards

xpab · October 28, 2022, 7:19pm

I’ve modified you script in the following way:
Instead of entering the crop rect in millimeters I read the crop rect from the pdf.
You therefore need to do the crops beforehand in Preview (or whatever) .
I had to modify a few bits here and there to achieve that, notably at marker A
(with the -Leftcrop for the x1 coordinates the left part of image was truncated)
It gives me the expected results I had with my previous script using sips.
There is still some work to do to handle rotated pages.
With the 0.8 compression factor the size of the result was basically the same so I lowered it a bit. (In my initial test pdf the cropped area was quite small.)
The only weak point is that the target dpi is hardcoded.
As sometimes I do scans at 200 dpi it could be an issue if I forget to do some changes.
If I understand correctly it is because there is no simple way of getting the initial dpi tag of the source pdf.
Perhaps just looking at the size (in pixels) could give me the answer whether it was scanned at 300 dpi or 200 (or another value)

You apparently need to add <use framework “Quartz”> otherwise the script won’t run from the Script menu.

Do you see some things not done properly ?
Is there a difference between doing
set targetDoc to current application’s PDFDocument’s new() (as I was using)
and
set newDoc to current application’s PDFDocument’s alloc()'s init() (as you do in this script)

use framework "Foundation"
use framework "AppKit"
use scripting additions

-- Get Initial doc (MODIFIED)
tell application "Finder" to set sourceFiles to selection as alias list
set LeFi to item 1 of sourceFiles
set sourceFile to POSIX path of LeFi


## PREPARE SOURCE AND TARGET FILES
set pdfURL to current application's NSURL's fileURLWithPath:sourceFile
set theName to pdfURL's URLByDeletingPathExtension()'s lastPathComponent()
set destURL to pdfURL's URLByDeletingLastPathComponent()'s URLByAppendingPathComponent:("" & theName & " [crop].pdf")

## DEFINE THE DESIRED RESOLUTION
set desiredRez to 300

## LOAD THE PDF FILE
set theDoc to current application's PDFDocument's alloc()'s initWithURL:pdfURL
set pageCount to (theDoc's pageCount())
set newDoc to current application's PDFDocument's alloc()'s init()

## CROP EACH PAGE 
repeat with iPage from 1 to pageCount
	set thePage to (theDoc's pageAtIndex:(iPage - 1))
	
	-- Get Bounding box for this page and set crop vars (MODIFIED)
	set {{leftCrop, bottomCrop}, {cropWidth, cropHeight}} to (thePage's boundsForBox:(current application's kPDFDisplayBoxCropBox))
	
	set cropRect to {{0, 0}, {cropWidth, cropHeight}}
	set resizing to (desiredRez / 72)
	
	set theData to (thePage's dataRepresentation())
	set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:theData)
	-- Get Page Bounds - Not used (MODIFIED)
	-- set {{aLeft, aTop}, {aWidth, aHeight}} to pdfImageRep's |bounds|()
	
	## CREATE AN EMPTY IMAGE THEN PLACE THE CROPPED IMAGE IN IT
	set newBitmap to (current application's NSBitmapImageRep's alloc()'s initWithBitmapDataPlanes:(missing value) pixelsWide:cropWidth * resizing pixelsHigh:cropHeight * resizing bitsPerSample:8 samplesPerPixel:4 hasAlpha:true isPlanar:false colorSpaceName:(current application's NSCalibratedRGBColorSpace) bytesPerRow:0 bitsPerPixel:0)
	current application's NSGraphicsContext's saveGraphicsState()
	set theContext to (current application's NSGraphicsContext's graphicsContextWithBitmapImageRep:newBitmap)
	(current application's NSGraphicsContext's setCurrentContext:theContext)
	(theContext's setShouldAntialias:true)
	(theContext's setImageInterpolation:(current application's NSImageInterpolationHigh))
	-- Marker A
	-- Changed rect assignment (MODIFIED)
	(pdfImageRep's drawInRect:{{0, 0}, {cropWidth * resizing, cropHeight * resizing}} fromRect:(current application's NSZeroRect) operation:(current application's NSCompositeSourceOver) fraction:(1.0) respectFlipped:true hints:(missing value))
	current application's NSGraphicsContext's restoreGraphicsState()
	
	## MAKE THE RESULT A JPEG TO REDUCE DOCUMENT SIZE AND CONVERT IT TO A PDF PAGE 
	set theData to (newBitmap's representationUsingType:(current application's NSJPEGFileType) |properties|:{NSImageCompressionFactor:0.25, NSImageProgressive:false})
	set theImage to (current application's NSImage's alloc()'s initWithData:theData)
	set theImageView to (current application's NSImageView's alloc()'s initWithFrame:cropRect)
	(theImageView's setImage:theImage)
	set theData to (theImageView's dataWithPDFInsideRect:cropRect)
	set nextPage to ((current application's PDFDocument's alloc()'s initWithData:theData)'s pageAtIndex:0)
	
	## 	INSERT THE PDF PAGE IN THE DOC
	(newDoc's insertPage:nextPage atIndex:(iPage - 1))
	
end repeat

## WRITE THE PDF TO THE DISK
(newDoc's writeToURL:destURL)
current application's NSWorkspace's sharedWorkspace()'s openURL:destURL

ionah · October 29, 2022, 8:21am

Instead of entering the crop rect in millimeters I read the crop rect from the pdf.
You therefore need to do the crops beforehand in Preview (or whatever) .

I don’t get it. Why should you crop a page that’s already cropped?

The only weak point is that the target dpi is hardcoded.

Just display a dialog to enter the value.

If I understand correctly it is because there is no simple way of getting the initial dpi tag of the source pdf.
Perhaps just looking at the size (in pixels) could give me the answer whether it was scanned at 300 dpi or 200 (or another value)

You can’t. PDFDocument only returns the page size in 72ppp.

You apparently need to add <use framework “Quartz”> otherwise the script won’t run from the Script menu.

Right.

Is there a difference between doing
set targetDoc to current application’s PDFDocument’s new() (as I was using)
and
set newDoc to current application’s PDFDocument’s alloc()'s init() (as you do in this script)

No.

xpab · October 29, 2022, 8:54am

To actually remove the cropped parts of the image from the file to reclaims some space.
As you know, Preview only changes the Crop box rectangle.

Am I correct assuming that if the source image inside a page is in TIFF (uncompressed) (I don’t know if compressed TIFF are lossless or not) this mean that there will be no quality loss when doing the crop (as the is no recompression) ? (Of course, modifying the script accordingly)

ionah · October 29, 2022, 9:36am

The script exports every page as a jpeg.
Tiff can be compressed but in most cases, it’s not. Its aim is to preserve quality.
Jpeg is always lossy. More or less, depending on the compression settings. It’s made to gain file weight.