Generate and merge data with python multiprocessing
- by Bobby
I have a list of starting data. I want to apply a function to the starting data that creates a few pieces of new data for each element in the starting data. Some pieces of the new data are the same and I want to remove them.
The sequential version is essentially:
def create_new_data_for(datum):
"""make a list of new data from some old datum"""
return [datum.modified_copy(k) for k in datum.k_list]
data = [some list of data] #some data to start with
#generate a list of new data from the old data, we'll reduce it next
newdata = []
for d in data:
newdata.extend(create_new_data_for(d))
#now reduce the data under ".matches(other)"
reduced = []
for d in newdata:
for seen in reduced:
if d.matches(seen):
break
#so we haven't seen anything like d yet
seen.append(d)
#now reduced is finished and is what we want!
I want to speed this up with multiprocessing.
I was thinking that I could use a multiprocessing.Queue for the generation.
Each process would just put the stuff it creates on, and when the processes are reducing the data, they can just get the data from the Queue.
But I'm not sure how to have the different process loop over reduced and modify it without any race conditions or other issues.
What is the best way to do this safely? or is there a different way to accomplish this goal better?