Some PDF manipulation question

xpab · October 29, 2022, 10:26pm

Sorry Peavine, I had failed to notice that you posted a similar script in your MacScripter thread.
I’m still struggling with a little element:
I wanted to get the pixel size of each page to determine the dpi set by the scanner.
For scans made at 200 dpi or 300 the image size in points is the same.
(I first hoped I could get that by extracting EXIF data from the image with NSImageEXIFData but this doesn’t seems to work in Mojave as the EXIF data returned does not contains the dpi value that is in the EXIF record.)
So I was planing to use the pixel size of the images to infer it.
But I don’t manage to get the Pixel values from a page variable.
Let say that ThePage is the page object.
How can I get the exact pixel dimensions from it.
Geting the kPDFDisplayBoxMediaBox , the Bounds for the ImageRepresentations or Size all returns values in Points.
Sorry, it must be very trivial but I’m stuck on this.

Anyway , a bit thank you to you and ionah . All yours answers were very helpful in my slow process of trying to understand and manage the power of ApplescriptObjC (an learn other things too)

leo_r · October 30, 2022, 12:28am

If I understand correctly, your confusion derives from the following:

PDF files do not have any pixel dimensions. Because they are not raster images.

You can only acquire pixel dimensions after rasterizing PDF file, that is turning it into an image.

The image you deal with in this particular case, is not the PDF file itself but an image inside the PDF file.

Now, images inside PDF file do have pixel dimensions, as well as physical dimensions (points), which can allow you to calculate their resolution.

However, retrieving this data is an extremely complex process. You’ll have to scan PDF file structure using all kind of unfriendly CGPDF… tools. Before that, you’ll also need to understand this structure and its very specific elements with their unique names and specific position in the PDF contents tree. And then there are many different ways in which various applications create this PDF structure, which adds yet more layers of complicity.

Alternatively, there are third-party frameworks that can let you extract this data with a couple lines of code. They cost from few hundreds $$ per license to few thousands a year. Exactly because of the process mentioned above.

It’s also quite possible there are also free open source frameworks or tools that can do this with various levels of complicity (they come and go and I don’t know the current situation).

If, once again, I understand your goal correctly, you need to first deal with the raster image itself - NOT the PDF file. Only turn this image into PDF at he last stage when you already have an image of the desired size and resolution.

peavine · October 30, 2022, 3:40pm

xpab. As mentioned by Leo, PDF pages do not have a pixel size, and I assume you are referring to an image within a PDF. If that is correct, I don’t know a reliable method of doing this. I tried various approaches but none of them returned the desired information.

use framework "AppKit"
use framework "Foundation"
use scripting additions

set theFile to (choose file)
set thePosixFile to POSIX path of theFile
set theHFSFile to theFile as text

-- uses NSImage and returns points (works with JPEG and PDF)
set theImage to current application's NSImage's alloc()'s initWithContentsOfFile:thePosixFile
set theSize to (theImage's |size|())
log theSize's width
log theSize's height

-- uses NSBitmapImageRep and returns pixels (works with JPEG but not PDF)
set imageRep to current application's NSBitmapImageRep's imageRepWithContentsOfFile:thePosixFile
log imageRep's pixelsWide()
log imageRep's pixelsHigh()

-- uses sips and returns pixels but not accurately with a PDF (works with JPEG and PDF)
set text item delimiters to return
set theBounds to (do shell script "sips -g pixelWidth -g pixelHeight " & quoted form of thePosixFile)
log text item 2 of theBounds
log text item 3 of theBounds
set text item delimiters to ""

-- uses Images Events and works the same as sips
tell application "Image Events"
	launch
	set theImage to open file theHFSFile
	set theProperties to properties of theImage
	log dimensions of theProperties
end tell

leo_r · October 30, 2022, 4:31pm

If your goal is to get the image resolution, then you can do so via this formula:

wPx/(wPt/72)

where

wPx is imageRep's pixelsWide()
wPt is theSize's width

You can also get size() directly from NSBitmapImageRep

xpab · October 30, 2022, 8:16pm

Correct.

I still don’t know how to get pixelswide() (in pixels) starting from a var containing a pdf page.
If I do this: (taken and modified from a peavine example)

set theData to (thePage's dataRepresentation())
set pdfImageRep to (current application's NSPDFImageRep's imageRepWithData:theData)		
set theImage to (current application's NSImage's alloc()'s initWithData:theData)
set theData to theImage's TIFFRepresentation()
set theImage to (current application's NSImage's alloc()'s initWithData:theData)
set theSize to (theImage's |size|()) as record
set theHeight to height of theSize as integer
set {x, y, w, h} to {0, 0, cropWidth, cropHeight}
set cropRectangle to {{x, (theHeight - y - h)}, {w, h}}
theImage's lockFocus()
set theRep to (current application's NSBitmapImageRep's alloc()'s initWithFocusedViewRect:cropRectangle)
theImage's unlockFocus()
			
return theRep's pixelsWide()

I get a result in points that I could have got much easier. Still no pixels.
You said it was non trivial to retrieve pixel dimensions of an image inside a pdf page but it seems we need the pixels to calculate the resolution.

In the end , I think I will revert to some ugly but effective trick.
I use exiftool command on the pdf and directly retrieve the pixel dimensions from here, gracefully provided by the scanner. This of course assumes that all pages are at the same resolution . (And this is the case for the scanned pdfs I target.)
With this and the dimensions in points I can calculate the dpi and feed it to the initial cropping script.
(It is working fine and now handle properly rotated pages but I strangely got a resulting pdf that is a few pixels smaller that the original (that is for an image inside a one page pdf) . Must be some rounding effect I have yet to cleanup.
Another minor caveat is that the resulting recompressed pdf I get from the inoah method is reported to be at 72dpi even if for its size and pixel dimension it is actually 300 dpi. But as it looks like the original (ignoring the quality reduction due to the jpeg reompression) I can live with this.
Indeed all this is quite complicated.
(To check the details of those pdfs I use the excellent Xee3 image viewer that provide quite detailed (and seemly accurate) info about images and pdf files.)

leo_r · October 30, 2022, 9:59pm

There’s one thing I don’t understand:

I assume you don’t have an option to just scan the originals as images (not as PDF), do whatever you need with the images, and only then turn them into PDF if needed?

xpab · October 30, 2022, 10:56pm

Apart that I already have a lot of PDF files that need some treatment, yes I could scan as jpeg files but then it would complicate some other parts of the process. (When scanning you can on the fly group the pages of articles in a single pdf instead of having to inspect each images to group them afterwards. I already have other scripts to merge images (or pdf files) into pdfs.)
And it is also a way to understand and learn new things…

leo_r · October 30, 2022, 11:16pm

I see.

You may find useful then Xpdf command line tools. I thought they were long abandoned - but it now looks like they keep them updated. Namely you’ll need pdfimages which should be able to extract images from PDF at their original resolution:

http://www.xpdfreader.com/download.html

xpab · October 30, 2022, 11:48pm

Useful stuff. Thank you.

system · November 2, 2022, 11:49pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.