What's the best way to find duplicates in a NSArray?

I can’t remember where in the Apple Reference Documentation I have seen a most direct method to achieve this:

use framework "Foundation"
use scripting additions

set arrayA to current application's NSArray's arrayWithArray:{1, 2, 3, 4, 3, 5, 3}
set arrayB to (current application's NSOrderedSet's orderedSetWithArray:arrayA)'s allObjects()
set theDiffs to (arrayB's differenceFromArray:arrayA)'s removals()
if theDiffs as list = {} then return {}
return (theDiffs's valuesForKeys:{"object", "index"}) as record
2 Likes

Shane taught us as following…

-- Created 2017-11-07 by Takaaki Naganoya
-- 2017 Piyomaru Software
use AppleScript version "2.4"
use scripting additions
use framework "Foundation"

property NSCountedSet : a reference to current application's NSCountedSet

set aList to {1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 8, 10, -2}
set aRes to returnDuplicatesOnly(aList) of me
-->	{​​​​​1, ​​​​​2, ​​​​​3, ​​​​​4​​​}

on returnDuplicatesOnly(aList as list)
	set aSet to NSCountedSet's alloc()'s initWithArray:aList
	set bList to (aSet's allObjects()) as list
	
	set dupList to {}
	repeat with i in bList
		set aRes to (aSet's countForObject:i)
		if aRes > 1 then
			set the end of dupList to (contents of i)
		end if
	end repeat
	
	return dupList
end returnDuplicatesOnly

Another NSCountedSet suggestion:

use framework "Foundation"
use scripting additions

set arrayOne to current application's NSArray's arrayWithArray:{1, 2, 3, 4, 6, 3, 5, 3, 4, 6}
set setOne to current application's NSCountedSet's alloc()'s initWithArray:arrayOne
set arrayTwo to (arrayOne's valueForKeyPath:"@distinctUnionOfObjects.self")
set setTwo to current application's NSCountedSet's alloc()'s initWithArray:(arrayTwo)
setOne's minusSet:setTwo
return setOne's allObjects() as list -->{3, 6, 4}

I ran a timing test with a list that contained 792 items, and my script took 1 millisecond to run. Jonas’ script took 8 milliseconds to run, but that’s to be expected as it returns significantly more information.

Script Geek says peavine’s script is about 20% faster than mine.

Mac14,15,  macOS Version 15.3 (Build 24D60),  1000 iterations
         First Run   Total Time    Average     Median    Maximum    Minimum   Std.Dev.
First       0.0016       0.3052     0.0003     0.0003     0.0008     0.0002     0.0000
Second      0.0012       0.2571     0.0003     0.0003     0.0006     0.0002     0.0000
Ratio (excluding first run): 1.19:1   Ratio of medians: 1.18:1

I’ll use this.

1 Like

Like ionah’s script, this returns both the duplicates and their indices. It’s not quite as fast as ionah’s, but it should work on systems older than macOS 10.15 and it returns 1-based indices. It can easily be adapted for 0-based indices if required.

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

on findDuplicates(aList)
	script o
		property indexList : aList's items
	end script
	repeat with i from 1 to (count o's indexList)
		set o's indexList's item i to i
	end repeat
	set indexSet to current application's NSMutableIndexSet's indexSetWithIndexesInRange:({0, count o's indexList})
	set anArray to current application's NSArray's arrayWithArray:(aList)
	set firstInstanceIndices to (current application's NSDictionary's dictionaryWithObjects:(o's indexList) forKeys:(anArray))'s allValues()
	repeat with i in firstInstanceIndices
		set i to i as integer
		set o's indexList's item i to missing value
		(indexSet's removeIndex:(i - 1))
	end repeat
	if (indexSet's |count|() = 0) then return {}
	
	return {|index|:o's indexList's integers, object:(anArray's objectsAtIndexes:(indexSet)) as list}
end findDuplicates

set aList to {1, 2, 3, 4, 3, 5, 3}
findDuplicates(aList)

Thanks guys, but I think I’ll stick with my first try.
The advantage here is that it returns the position of the duplicates.

And if someone wants the duplicates only, here is a variation:

use framework "Foundation"
use scripting additions

set theList to {1, 2, 3, 4, 7, 2, 3, 4, 5, 6, 7, 3, 4, 1}

set arrayA to current application's NSArray's arrayWithArray:theList
set arrayB to (current application's NSOrderedSet's orderedSetWithArray:arrayA)'s allObjects()
set theDiff to (arrayB's differenceFromArray:arrayA)
if not (theDiff's hasChanges()) then return {}

-- get duplicates only 
set resultList to ((theDiff's valueForKeyPath:"removals.object")'s allObjects()) as list

-- get duplicates & positions 
set resultRecord to (theDiff's removals()'s valuesForKeys:{"object", "index"}) as record

@NigelGarvey : you posted while I was responding…
@ShaneStanley & @NigelGarvey : I think your like means I’m on the right track…


Update: modified the “get duplicates only” line where there was a bad copy-paste.

Vanilla Script version is twice faster than peavine’s.

http://piyocast.com/as/archives/17236

My earlier script returns a list of duplicate items in an array. As part of its work, the script calculates the number of duplicates (not including the original), and I’ve modified my script to return that additional information as a record.

use framework "Foundation"
use scripting additions

--get a list of duplicates
set arrayOne to current application's NSArray's arrayWithArray:{"aa", "bb", "cc", "aa", "bb", "aa", "dd", "ee", "aa"}
set setOne to current application's NSCountedSet's alloc()'s initWithArray:arrayOne
set arrayTwo to (arrayOne's valueForKeyPath:"@distinctUnionOfObjects.self")
set setTwo to current application's NSCountedSet's alloc()'s initWithArray:(arrayTwo)
setOne's minusSet:setTwo
set theDuplicates to setOne's allObjects()
--return theDuplicates as list --enable if a list of duplicates is all that's needed.

--make a record with the duplicate items as keys and the duplicate counts as values
set theDictionary to (current application's NSMutableDictionary's new())
repeat with aValue in theDuplicates
	set duplicatesCount to (setOne's countForObject:aValue) as integer
	(theDictionary's setValue:duplicatesCount forKey:aValue)
end repeat
set theRecord to theDictionary as record -->{aa:3, bb:1}

Here’s a slightly optimised version of my script above which also allows the index base to be specified as a parameter. This version too should work on any macOS system since 10.10. More thorough testing this morning shows that @ionah’s original generally has the advantage for speed, except where the proportion of duplicates in a large list or array is particularly huge:

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

on findDuplicates(aList, indexBase)
	script o
		property indexList : aList's items
	end script
	set indexBaseOffset to 1 - indexBase
	repeat with i from 1 to (count o's indexList)
		set o's indexList's item i to i - indexBaseOffset
	end repeat
	set indexSet to current application's NSMutableIndexSet's indexSetWithIndexesInRange:({indexBase, count o's indexList})
	set anArray to current application's NSArray's arrayWithArray:(aList)
	set firstInstanceIndices to (current application's NSDictionary's dictionaryWithObjects:(o's indexList) forKeys:(anArray))'s allValues()
	repeat with i in firstInstanceIndices
		(indexSet's removeIndex:(i))
	end repeat
	if (indexSet's |count|() = 0) then return {}
	
	indexSet's shiftIndexesStartingAtIndex:(indexBase) |by|:(-indexBase)
	return {|index|:((current application's NSArray's arrayWithArray:(o's indexList))'s objectsAtIndexes:(indexSet)) as list, object:(anArray's objectsAtIndexes:(indexSet)) as list}
end findDuplicates

set aList to {1, 2, 3, 4, 3, 5, 3}
set indexBase to 1
findDuplicates(aList, indexBase)

In your benchmark script, it appears that returnDuplicatesOnly3: reaches an error once it gets to set aCount to length of (aList of spd) because aList is an NSArray at that point. So it never reaches the repeat loop. Placing a log near the end of each handler confirms this.

If I add as list, e.g. adding set aList of spd to (aList of spd) as list between lines 89 and 90, then returnDuplicatesOnly3 takes over 200 seconds (219.944313049316) without any other changes to the script.

In my benchmark script, " (aList of spd) " is list.

Your point needs to be supplemented with a statement that says, “If we put the benchmark I gave you into Jonah’s program, then what happens?”
That one sentence is necessary.

returnDuplicatesOnly3 takes over 200 seconds (219.944313049316) without any other changes to the script.

NSArray version outputs these results with 100,000 items array.

–M1 Mac mini (macOS 15.4)
→ {{“returnDuplicatesOnly1:”, “0.0003349781036376952832”}, {“returnDuplicatesOnly2:”, “0.0001229047775268554752”}, {“returnDuplicatesOnly3:”, “0.00007200241088867188736”}}

–M2 MacBook Air (macOS 15.4)
→ {{“returnDuplicatesOnly1:”, “0.0002510547637939452928”}, {“returnDuplicatesOnly2:”, “0.00011098384857177735168”}, {“returnDuplicatesOnly3:”, “0.00005698204040527343616”}}

Very slow machine and small memory footage may slower than these results.
200 seconds is amazing!

Since you use performSelector:withObject:, the parameter to returnDuplicatesOnly3: gets converted to an NSArray (at least for me), so when you do copy aList to (aList of spd), it stores an NSArray instead of a list.

If you run the following script, do Message1 and Message2 both get logged twice for you? And does it return both lists rather than one missing?

use framework "Foundation"
use scripting additions

on returnDuplicatesOnly3:(aList)
	script spd
		property aList : {}
		property dList : {}
	end script
	
	log "Message1: This logs twice"
	log class of aList -- First logs (*list*), then (*(Class) __NSArrayM*)

	copy aList to (aList of spd)
	set aCount to length of (aList of spd)
	
	log "Message2: This logs once"
	
	set (dList of spd) to {}
	
	repeat aCount times
		set anItem to contents of (first item of (aList of spd))
		set (aList of spd) to rest of (aList of spd)
		
		if {anItem} is in (aList of spd) then
			if {anItem} is not in (dList of spd) then --ここを追加した (v3)
				set the end of (dList of spd) to anItem
			end if
		end if
		
	end repeat
	
	return (dList of spd)
end returnDuplicatesOnly3:

set arrayA to {1, 2, 3, 4, 3, 5, 3}
set res1 to my returnDuplicatesOnly3:arrayA -- Works successfully, returns {3}
set res2 to my performSelector:"returnDuplicatesOnly3:" withObject:arrayA -- No error, but missing value

return {res1, res2} -- res1 is a list of 1 item, res2 is missing value

I’m not sure what you meant by this, but I think the above script is an adequate test. If you see different behavior, it will be interesting to explore why.

I am using an M1 Pro MacBook Pro (15.4). The NSArray version works quickly and accurately for the first two handlers (returnDuplicatesOnly1: and returnDuplicatesOnly2:), while the third runs quickly but inaccurately on my machine unless I add the as list conversion, which makes it take much longer.

Note that I’m only commenting on the benchmark results and in cases using performSelector:withObject:, not the logical accuracy of the handler itself. It works accurately in all other cases, but it doesn’t run very quickly for me.

For example, without making any changes to your returnDuplicatesOnly3:, I see the following difference when using performSelector:withObject: vs. calling the handler directly:

set aList to {}
repeat with i from 1 to 10000
	set the end of aList to (random number from 1 to 10000)
end repeat

-- Using performSelector
set date1 to current application's NSDate's timeIntervalSinceReferenceDate()
set res1 to my performSelector:"returnDuplicatesOnly3:" withObject:aList -- Missing value
set date2 to current application's NSDate's timeIntervalSinceReferenceDate()
set duration1 to date2 - date1 -- ~0.007s

-- Calling handler directly
set date3 to current application's NSDate's timeIntervalSinceReferenceDate()
set res2 to my returnDuplicatesOnly3:aList -- List of many items
set date4 to current application's NSDate's timeIntervalSinceReferenceDate()
set duration2 to date4 - date3 -- ~2.40s

-- Using Cocoa implementation
set date5 to current application's NSDate's timeIntervalSinceReferenceDate()
set res3 to my returnDuplicatesOnly1:aList -- List of many items
set date6 to current application's NSDate's timeIntervalSinceReferenceDate()
set duration3 to date6 - date5  -- ~0.01s

return {{duration1, res1}, {duration2, res2}}
-- {{0.006880998611, missing value}, {2.394160985947, list of 2672 items}, {0.013992071152, ist of 2672 items}}

For returnDuplicatesOnly3:, the duration appears very short but it does not return a value, because it is silent failing at the line set aCount to length of (aList of spd).

Do you see different behavior with that script?