Generate and merge data with python multiprocessing

Posted by Bobby on Stack Overflow See other posts from Stack Overflow or by Bobby
Published on 2010-04-17T18:59:15Z Indexed on 2010/04/17 19:03 UTC
Read the original article Hit count: 314

Filed under:

python

|

multiprocessing

|

parallel

I have a list of starting data. I want to apply a function to the starting data that creates a few pieces of new data for each element in the starting data. Some pieces of the new data are the same and I want to remove them.

The sequential version is essentially:

def create_new_data_for(datum):
    """make a list of new data from some old datum"""
    return [datum.modified_copy(k) for k in datum.k_list]

data = [some list of data] #some data to start with

#generate a list of new data from the old data, we'll reduce it next
newdata = []
for d in data:
    newdata.extend(create_new_data_for(d))

#now reduce the data under ".matches(other)"
reduced = []
for d in newdata:
    for seen in reduced:
        if d.matches(seen):
            break
    #so we haven't seen anything like d yet
    seen.append(d)

#now reduced is finished and is what we want!

I want to speed this up with multiprocessing.

I was thinking that I could use a multiprocessing.Queue for the generation. Each process would just put the stuff it creates on, and when the processes are reducing the data, they can just get the data from the Queue.

But I'm not sure how to have the different process loop over reduced and modify it without any race conditions or other issues.

What is the best way to do this safely? or is there a different way to accomplish this goal better?

© Stack Overflow or respective owner

Related posts about python

unmet dependencies in Ubuntu 12.04

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I tried today to install a dvb-card on my Ubuntu 12.04 (Linux blauhai-linux 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux ). The installation failed with an error. After that, i tried to install python (it was already installed but i got this error): linux:~$… >>> More
How can I get sikuli-ide to work?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I installed sikuli-ide with sudo apt-get install sikuli-ide Everything was fine until I tried to start it from the terminal. I typed sikuli-ide But the only response I got was [info] locale: en_US The application was not started, furthermore there is no desktop file and sikuli-ide does not… >>> More
Getting PATH right for python after MacPorts install

as seen on Super User - Search for 'Super User'
I can't import some python libraries (PIL, psycopg2) that I just installed with MacPorts. I looked through these forums, and tried to adjust my PATH variable in $HOME/.bash_profile in order to fix this but it did not work. I added the location of PIL and psycopg2 to PATH. I know that Terminal is… >>> More
call python with system() in R to run a python script emulating the python console

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to pass a chunk of Python code to Python in R with something like system('python ...'), and I'm wondering if there is an easy way to emulate the python console in this case. For example, suppose the code is "print 'hello world'", how can I get the output like this in R? >>> print… >>> More
Python - Calling a non python program from python?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am currently struggling to call a non python program from a python script. I have a ~1000 files that when passed through this C++ program will generate ~1000 outputs. Each output file must have a distinct name. The command I wish to run is of the form: program_name -input -output -o1 -o2… >>> More

Related posts about multiprocessing

Comparison of the multiprocessing module and pyro?

as seen on Stack Overflow - Search for 'Stack Overflow'
I use pyro for basic management of parallel jobs on a compute cluster. I just moved to a cluster where I will be responsible for using all the cores on each compute node. (On previous clusters, each core has been a separate node.) The python multiprocessing module seems like a good fit for this… >>> More
How to synchronize a python dict with multiprocessing

as seen on Stack Overflow - Search for 'Stack Overflow'
I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value). Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way… >>> More
Python - multithreading / multiprocessing, very strange problem.

as seen on Stack Overflow - Search for 'Stack Overflow'
import uuid import time import multiprocessing def sleep_then_write(content): time.sleep(5) print(content) if __name__ == '__main__': for i in range(15): p = multiprocessing.Process(target=sleep_then_write, args=('Hello World',)) p… >>> More
Python - Help with multiprocessing / threading basics.

as seen on Stack Overflow - Search for 'Stack Overflow'
I haven't ever used multi-threading, and I decided to learn it today. I was reluctant to ever use it before, but when I tried it out it seemed way to easy, which makes me wary. Are there any gotchas in my code, or is it really that simple? import uuid import time import multiprocessing def sleep_then_write(content): … >>> More
Solving embarassingly parallel problems using Python multiprocessing

as seen on Stack Overflow - Search for 'Stack Overflow'
How does one use multiprocessing to tackle embarrassingly parallel problems? Embarassingly parallel problems typically consist of three basic parts: Read input data (from a file, database, tcp connection, etc.). Run calculations on the input data, where each calculation is independent of any other… >>> More