Most efficient way of combining two lists of records, overwriting duplicates

I need to merge one list of records with a second identically structured list, overwriting records that are duplicates. Duplicate status is determined by a unique ID. The use case is that this is a cache (for an Alfred script filter) that needs to be updated periodically.

The data structure looks like this:

This merging is easily accomplished using AppleScript’s repeat loops, if one is dealing with tens or hundreds of records. However, it becomes completely inadequate when dealing with thousands of records. So, this calls for ASObC!

I’ve been working through @ShaneStanley’s wonderful ‘Everyday AppleScriptObjC’ but it’s a lot to take in and I’ve not found a solution yet.

The simplest approach would seem to be to take every uid in list 2, check if it exists in list 1, delete that record if it does, and then concatenate the two lists. Removing the duplicated records is easy, given the removeObjectsAtIndexes: method. What I’m missing, however, is how to get the index of a record based upon the value of one of its contents—in this case the uid.

Any advice would be greatly appreciated!

A predicate is probably the best approach. Get a list of the uids to skip, then use a predicate based on that to filter the other list.

1 Like

I’ve included two suggestions below, the first of which uses ASObjC. In both cases I changed the record labels just to simplify.

use framework "Foundation"
use scripting additions

set listOne to {{firstname:"John", gender:"male", age:"40"}, {firstname:"Jane", gender:"female", age:"30"}, {firstname:"Jane", gender:"female", age:"30"}}
set listTwo to {{firstname:"Joe", gender:"male", age:"20"}, {firstname:"Jane", gender:"female", age:"30"}}

set arrayOne to current application's NSArray's arrayWithArray:listOne
set arrayTwo to current application's NSArray's arrayWithArray:listTwo
set mergedArray to (arrayOne's arrayByAddingObjectsFromArray:arrayTwo)
set cleanedArray to current application's NSMutableArray's new()
set ageCheck to current application's NSMutableArray's new()

repeat with i from 1 to (count mergedArray)
	set anObject to (mergedArray's objectAtIndex:(i - 1))
	set theAge to (anObject's valueForKey:"age")
	set aDuplicate to (ageCheck's containsObject:theAge)
	if aDuplicate as boolean = false then
		(cleanedArray's addObject:anObject)
		(ageCheck's addObject:theAge)
	end if
end repeat

cleanedArray as list --> {{firstname:"John", age:"40", gender:"male"}, {firstname:"Jane", age:"30", gender:"female"}, {firstname:"Joe", age:"20", gender:"male"}}

The second script uses basic AppleScript and is more than five times faster.

set listOne to {{firstname:"John", gender:"male", age:"40"}, {firstname:"Jane", gender:"female", age:"30"}, {firstname:"Jane", gender:"female", age:"30"}}
set listTwo to {{firstname:"Joe", gender:"male", age:"20"}, {firstname:"Jane", gender:"female", age:"30"}}

set cleanedList to getCleanedList(listOne & listTwo)

on getCleanedList(theList)
	script o
		property mergedList : theList
		property cleanedList : {}
		property ageCheck : {}
	end script
	
	repeat with anItem in o's mergedList
		set anItem to contents of anItem
		set theAge to age of anItem
		if theAge is not in o's ageCheck then
			set end of o's cleanedList to anItem
			set end of o's ageCheck to theAge
		end if
	end repeat
	return o's cleanedList
end getCleanedList
1 Like

Thank you very much. I’ll have to experiment and figure out how this is working, but that is half the fun!

I spent a little time to see if I could make my ASObjC suggestion more competitive and the following revised script is about 20 percent faster.

use framework "Foundation"
use scripting additions

set listOne to {{firstname:"John", gender:"male", age:"40"}, {firstname:"Jane", gender:"female", age:"30"}}
set listTwo to {{firstname:"Joe", gender:"male", age:"20"}, {firstname:"Jane", gender:"female", age:"30"}}

set cleanedList to getCleanedList(listOne, listTwo) --> {{firstname:"John", age:"40", gender:"male"}, {firstname:"Jane", age:"30", gender:"female"}, {firstname:"Joe", age:"20", gender:"male"}}

on getCleanedList(listOne, listTwo)
	set mergedArray to current application's NSMutableArray's arrayWithArray:listOne
	mergedArray's addObjectsFromArray:listTwo
	set cleanedArray to current application's NSMutableArray's new()
	set ageCheck to current application's NSMutableSet's new()
	repeat with anItem in mergedArray
		set theAge to (anItem's valueForKey:"age")
		if (ageCheck's containsObject:theAge) as boolean = false then
			(cleanedArray's addObject:anItem)
			(ageCheck's addObject:theAge)
		end if
	end repeat
	return cleanedArray as list
end getCleanedList

I ran some timing tests and the results were:

2000 records in each list and 1 of 4 records being a duplicate:
Basic AppleScript - 350 milliseconds
ASObjC Script - 630 milliseconds

5000 records in each list and 3 of 10 records being a duplicate:
Basic AppleScript - 2.065 second
ASObjC Script - 1.551 seconds

I also tested the basic AppleScript with 2000 records but without the script object and the result was 18 seconds.

Hi.

If I may come late to the party, here’s an implementation of Shane’s suggestion:

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

-- For convenience, the records here are pretty much the same except for uid values.
-- The "duplicate" records in list1 have "DUPLICATE" as their titles to make them
-- easy to spot, but this isn't used in the script.
set list1 to {{arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118304"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"DUPLICATE", uid:"118305"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118306"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"DUPLICATE", uid:"118307"}}
set list2 to {{arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118300"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118302"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118305"}, {arg:"118305", match:"The Origin and Goal of History", subtitle:"", title:"The Origin and Goal of History", uid:"118307"}}

-- Get array versions of the lists.
set array1 to current application's class "NSMutableArray"'s arrayWithArray:(list1)
set array2 to current application's class "NSArray"'s arrayWithArray:(list2)

-- Filter array1, keeping dictionaries whose uids don't appear in any of array2's dictionaries.
set uids2 to array2's valueForKey:("uid")
set uidFilter to current application's class "NSPredicate"'s predicateWithFormat_("NOT uid IN %@", uids2)
tell array1 to filterUsingPredicate:(uidFilter)

-- Append array2's dictionaries to what's left.
tell array1 to addObjectsFromArray:(array2)

-- Sort the result by uid, if required.
set sortOnUID to current application's class "NSSortDescriptor"'s sortDescriptorWithKey:("uid") ascending:(true)
tell array1 to sortUsingDescriptors:({sortOnUID})

-- Convert back to list.
set newList to array1 as list
1 Like

Nigel. Thanks for the script. I was interested by Shane’s suggestion but didn’t know how to implement it. I’m still working to learn predicates and have a long ways to go.

Our scripts potentially return different results, in that my script eliminates all duplicates in both lists, while your script returns duplicates that exist in list2. I reread the OP’s post and am not sure which he wants, in that he states as quoted below. Perhaps list2 contains no duplicates, in which case our scripts return the same results.

I need to merge one list of records with a second identically structured list, overwriting records that are duplicates… The simplest approach would seem to be to take every uid in list 2, check if it exists in list 1, delete that record if it does, and then concatenate the two lists.

FWIW, I repeated the timing test mentioned above with each list containing 5000 items, and your script was faster with a timing result of 1.34 seconds compared with my script’s 1.59 seconds. I did disable the sort code in your script to compare like with like, and I’ve included the test script below in case it contains an error. However, because our scripts work differently, and because listTwo in the test script contains so may duplicates, I don’t think these timing results are meaningful in comparing our scripts.

use framework "Foundation"
use scripting additions

-- untimed code
set list1 to {}
set list2 to {}
repeat 5000 times
	set someuid to random number from 1 to 10000
	set aRecord to {firstname:"peavine", gender:"male", city:"Prescott", state:"Arizona", uid:someuid}
	set end of list1 to aRecord
	set someuid to random number from 5000 to 15000
	set aRecord to {firstname:"peavine", gender:"male", city:"Prescott", state:"Arizona", uid:someuid}
	set end of list2 to aRecord
end repeat

-- start time
set startTime to current application's CACurrentMediaTime()

-- timed code
set array1 to current application's class "NSMutableArray"'s arrayWithArray:(list1)
set array2 to current application's class "NSArray"'s arrayWithArray:(list2)

-- Filter array1, keeping dictionaries whose uids don't appear in any of array2's dictionaries.
set uids2 to array2's valueForKey:("uid")
set uidFilter to current application's class "NSPredicate"'s predicateWithFormat_("NOT uid IN %@", uids2)
tell array1 to filterUsingPredicate:(uidFilter)

-- Append array2's dictionaries to what's left.
tell array1 to addObjectsFromArray:(array2)

-- Sort the result by uid, if required.
-- set sortOnUID to current application's class "NSSortDescriptor"'s sortDescriptorWithKey:("uid") ascending:(true)
-- tell array1 to sortUsingDescriptors:({sortOnUID})

-- Convert back to list.
set newList to array1 as list

-- elapsed time
set elapsedTime to (current application's CACurrentMediaTime()) - startTime
set numberFormatter to current application's NSNumberFormatter's new()
if elapsedTime > 1 then
	numberFormatter's setFormat:"0.000"
	set elapsedTime to ((numberFormatter's stringFromNumber:elapsedTime) as text) & " seconds"
else
	(numberFormatter's setFormat:"0")
	set elapsedTime to ((numberFormatter's stringFromNumber:(elapsedTime * 1000)) as text) & " milliseconds"
end if

-- result
elapsedTime --> 1.344 seconds -- 1.339 seconds -- 1.348 second (3 runs)

-- count newList --> 8978 -- 9029 -- 9023
1 Like

Hi peavine.

I had to read the original post myself a couple of times before latching onto the words “overwriting” and “updated” and deciding that if records had the same uid in both lists, those in a newer list probably had to replace those in an original.

1 Like

Hi everyone. In theory, my lists shouldn’t contain any duplicates in themselves, only when compared against one another. Sorry if my way of explaining it was confusing!

I’ll try to test the different solutions with a real-world use case and see how they compare.

Thanks so much for spending time on this. I hope it’ll be useful to others too.

Philip. Thanks for the additional information, in which case Nigel’s suggestion is the ASObjC script to use.

I rewrote my basic AppleScript suggestion to work the same as Nigel’s and have included it below. This script and Nigel’s have essentially identical timing results.

set listOne to {{firstname:"John", gender:"male", uid:"40"}, {firstname:"Jane", gender:"female", uid:"30"}}
set listTwo to {{firstname:"Joe", gender:"male", uid:"20"}, {firstname:"June", gender:"female", uid:"30"}}

set cleanedList to getCleanedList(listOne, listTwo) --> {{firstname:"John", gender:"male", uid:"40"}, {firstname:"Joe", gender:"male", uid:"20"}, {firstname:"June", gender:"female", uid:"30"}}

on getCleanedList(list1, list2)
	script o
		property listOne : list1
		property listTwo : list2
		property uidList : {}
		property cleanedList : {}
	end script
	
	repeat with anItem in o's listTwo
		set end of o's uidList to uid of anItem
	end repeat
	
	repeat with anItem in o's listOne
		set anItem to contents of anItem
		if (uid of anItem) is not in o's uidList then
			set end of o's cleanedList to anItem
		end if
	end repeat
	return o's cleanedList & o's listTwo
end getCleanedList
1 Like

Okay, the results are in!

My test involved a total of 7,078 unique records. My listOne held 7,050 of those, and listTwo held 78, with 50 being duplicates (i.e. found in both lists).

I tested 5 different solutions:

  1. A simple vanilla AppleScript repeat loop inside a repeat loop.
  2. @peavine’s second getCleanedList example (19 Jan)
  3. @NigelGarvey’s implementation of @ShaneStanley’s suggestion
  4. @peavine’s revision based on the above
  5. This is basically cheating but I also found a text-based solution using regex/JSON

The results:

  1. I cancelled the script after 2 minutes (though it works fine with very short lists)…
  2. 1.98 seconds
  3. 0.8 seconds
  4. 0.34 seconds
  5. 2.2 seconds

So @peavine is the clear winner, though it was a team effort :slight_smile:

I say that my regex solution is cheating because I was originally asking about doing this with lists. However, the use case requires me to write/read the data cache in JSON, so it made sense.

The idea is quite simple. This will remove duplicate records, where theIDs is a string containing the IDs to be removed, e.g. (234553|353538|9050385):

set theResult to regex change cachedJSON search pattern "\\{(.|\\n)*?\"uid\": \"" & theIDs & "\",(.|\\n)*?\\},?\\n\\h+" replace template ""

This approach was extremely fast (less than half the time of any others) when I tested it on a list of 500 records but it loses relative efficacy, compared with the ASObjC methods, when working with longer lists. Also, I’m not 100% confident that I could make it work 100% of the time.

In any case, I now have a fast and robust solution. Thanks all.

Philip. Thanks for reporting back your results–it’s great to learn how things turned out.

Yes. Thanks for the feedback, Philip. Glad you’ve got what you need.

Just out of interest, one of the reasons peavine’s pure AppleScript revision is faster than the ASObjC on which it’s based is the shortness of listTwo. AppleScript (as written by peavine) can search an AppleScript list of 78 items 7,050 times in less time than it takes to translate the objects and commands into Objective-C, execute that, and translate the result back to AppleScript. If listTwo was significantly longer, both scripts would take longer to complete, but the AppleScript would be more dramatically affected than the ASObjC.

1 Like

That makes sense, thanks. For my purposes, listTwo will always be much shorter. However, I will add both versions to my script library, for future reference.