Leaking memory somewhere

hi all,

I have a folder with contains 500 PDF files. When I add one (or more) PDFs to this folder, a script is triggered to check if I already have the same (or a very similar one) PDF among the 500 already present in the folder.
The way to find duplicate or quasi-duplicates is by:

  • extracts all words (only those bigger than 4 chars)
  • counts the occurrence of each words
  • compare this list, with those obtained with each the 500 PDFs, 1.by-1 in a repeat loop.

Now when I start, SB memory usage is about 200MB. When it finished (after 5 dropped new PDFs (which means 5 loops, each with a 500-nested loops) the memory usage is almost 40Gb.
Is this normal ? If not, how can I narrow down the memory leak ?
Thanks,

Luciano

That’s normal. AppleScript memory management is poor at the best of times, and worse with AppleScriptObjC. The good news is that the value in the memory column of Activity Monitor is almost meaningless.

I see. There is nothing we can do. Thanks Shane.
I was hoping to reduce this memory usage/leakage since when it becomes too large, the Finder quits SB…
Thanks Shane.

There may be a way to streamline your process.

I’ve done something similar where I get the file sizes first, then compare only the files with the exact sizes.

After that, the appleScript read command reads 1000k chunks of text from each of the files being compared and compares them one pair at a time until I reach the end of the files. As soon as it finds a pair of 1000k chunks in the two files that are not equal, it exits the repeat. If it doesn’t find any differences then it deletes the older one.

By doing it this way you don’t need to open the PDF, you just read its compressed data. We used a version of this weekly for a few years and never missed a duplicate. (as far as we know)

It would be more efficient if you built the lists once once, and saved that info in a separate folder. You could then search the lists directly, rather than reading potentially very large PDF files each time.

I suspect the issue for @ldicroce is where he says “(or a very similar one)”. But in your case there’s an NSFileManager method that does something like your process: contentsEqualAtPath:andPath:.

Thanks a lot for the feedback.
Most of the effort is to find PDFs (scientific publications) which are similar (early version of the same file).
The script I am using already filters for file size, allowing an error of size of +/- 20%. So this reduce the number of PDFs (among the 500) to test.
And the 500 PDFS are not always the same. Those are the files a still have to read, which composition changes weekly …
Extracting the words from each PDF, is quite fast (thanks to Shane sub-routine, which I am including below) it takes only 1-2 seconds for each PDFs.

set {countedSet2a, countedSet2b, TotalWordsFile2} to (my countedSetFromPDFAt:thePath2 minWordLength:4) -- change word length to suit
-- subtract words in second doc from first
(countedSet1a's minusSet:countedSet2b)
-- subtract words in first doc from second
(countedSet2a's minusSet:countedSet1b)

on countedSetFromPDFAt:thePath minWordLength:minLength
        	-- get text of PDF
        	set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
        	set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
        	set theText to theDoc's |string|()
        	-- break into words
        	set theText to theText's stringByReplacingOccurrencesOfString:"\\W+" withString:space options:(current application's NSRegularExpressionSearch) range:{0, theText's |length|()}
        	set theWords to theText's componentsSeparatedByString:space
        	-- remove words too small
        	set thePred to current application's NSPredicate's predicateWithFormat:"length >= %@" argumentArray:{minLength}
        	set theWords to theWords's filteredArrayUsingPredicate:thePred
        	set p_TotalWords to the number of items of theWords -- count the total words in thePath1, but it updated it with thePath2
        	-- make two identical counted sets; these store words plus the number of times they appear
        	set countedSet1 to current application's NSCountedSet's setWithArray:theWords
        	set countedSet2 to current application's NSCountedSet's setWithArray:theWords
        	--set countedSet3 to current application's NSCountedSet's setWithArray:theWords
        	return {countedSet1, countedSet2, p_TotalWords}
end countedSetFromPDFAt:minWordLength:

Comparing the extracted words is the slow step! Script below:

-- loop through what's left in the doc 1 list, which will be words that appear more times in doc 1 than doc 2
set CountOfResidualWordsFile1 to 0 as integer -- we also count the total words in thePath1-thePath2
set theObjects to countedSet1a's allObjects()
repeat with anObject in theObjects
	set CountOfResidualWordsFile1 to (CountOfResidualWordsFile1 + (countedSet1a's countForObject:anObject)) as integer
	set end of theList to (anObject as text) & ": " & (countedSet1a's countForObject:anObject)
end repeat

-- loop through what's left in the doc 2 list, which will be words that appear more times in doc 2 than doc 1
set CountOfResidualWordsFile2 to 0 as integer -- we also count the total words in thePath2-thePath1
set theObjects to countedSet2a's allObjects()
-- we sum it to the previous built list (theList)
repeat with anObject in theObjects
	set CountOfResidualWordsFile2 to (CountOfResidualWordsFile2 + (countedSet2a's countForObject:anObject)) as integer
	set end of theList to (anObject as text) & ": -" & (countedSet2a's countForObject:anObject)
end repeat

I was considering to limiting the analysis the the initial pages.
Since I don’t know how to limit the words extraction to 2-3 pages, I was planning to analyse only the initial 300 words extracted …

By the way Ed, can you send me the part of your script where “reads 1000k chunks of text from each of the files

Thanks again !
L.

Something like:

set maxPages to 3  -- change  to suit
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theText to current application's NSMutableString's |string|()
set numOfPages to theDoc's pageCount()
if numOfPages > maxPages then set numOfPages to maxPages
repeat with i from 0 to (numOfPages - 1)
	(theText's appendString:((theDoc's pageAtIndex:i)'s |string|()))
end repeat

Thanks Shane. I will test it today.
Ciao
L.