parallelize a function of multiple arguments but over one of the arguments

multiprocessing multiple argument
joblib parallel multiple arguments
python multiprocessing process multiple arguments
multiprocessing python
multiprocessing function multiple arguments
python map multiple arguments
python pool apply_async multiple arguments
starmap multiple arguments

I have function of processing a relatively large dataframe and run time takes quite a while. I was looking at ways of improving run time and I've come across multiprocessing pool. If I understood correctly, this should run the function for the equal chunks of the dataframe in parallel, which means it could potentially run quicker and save time.

So my function takes 4 different arguments, the last three of them are just mainly lookups, while the first one of the four is the data of interest dataframe. so looks something like this:

def functionExample(dataOfInterest, lookup1, lookup2, lookup3):
    #do stuff with the data and lookups)
    return output1, output2

So based on what I've read, I come to the below way of what I thought should work:

num_partitions = 4
num_cores = 4

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

Then to call the process (where mainly I couldn't figure it out), I've tried the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample))

This returns the error:

functionExample() missing 3 required positional arguments: 'lookup1', 'lookup2', and 'lookup3'

Then I try adding the three arguments by doing the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample(lookup1, lookup2, lookup3))

This returns the error below suggesting that it took the three arguments as the first three arguments of the function and missing the fourth instead of them being the last three arguments the previous error suggested they were missing:

functionExample() missing 1 required positional arguments: 'lookup1'

and then if I try feeding it the four arguments by doing the below:

output1, output2= parallelize_dataframe(dataOfInterest, functionExample(dataOfInterest, lookup1, lookup2, lookup3))

It returns the error below:

'tuple' object is not callable

I'm not quite sure which of the above is the way to do it, if any at all. Should it be taking all of the functions arguments including the desired dataframe. If so, why is it complaining about tuples?

Any help would be appreciated! Thanks.

You can perform a partial binding of some arguments to create a new callable via functools.partial:

from functools import partial

output1, output2 = parallelize_dataframe(dataOfInterest,
                                         partial(functionExample, lookup1=lookup1, lookup2=lookup2, lookup3=lookup3))

Note that in the multiprocessing world, partial can be slow, so you may want to find a way to avoid the need to pass the arguments if they're large/expensive to pickle, assuming that's possible in your use case.

Parallelize a Multiargument Function in Python – Zachary Steinert , you parallelize a function with multiple arguments in Python? It turns out that it is not much different than for a function with one argument, but� I am trying to parallelize a for loop in python that must execute a function with several arguments, one of which will be changing through the loop. The loop itself needs to be embedded in a function. I have already looked here, here and here in stackoverflow and beyond (here and here) but I just cannot make it work :(Below is a MWE:

In each case, you are trying to call the function, rather than pass the arguments for when the function is called. What you need is a new callable that calls your original with the correct argument.

from functools import partial


output1, output2 = parallelize_dataframe(
    dataOfInterest,
    partial(functionExample, lookup1=x, lookup2=y, lookup3=z)
)

pool.map - multiple arguments, pool.map accepts only a list of single parameters as input. A list of multiple arguments can be passed to a function via pool.map partial can be used to set constant values to all arguments which are not changed during parallel processing,� The general way to parallelize any operation is to take a particular function that should be run multiple times and make it run parallelly in different processors. To do this, you initialize a Pool with n number of processors and pass the function you want to parallelize to one of Pool s parallization methods.

You could simply modify your function definition to take predefined arguments, or make a function that call your original function using that params.

def functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z):
    #do stuff with the data and lookups)
    return output1, output2

or

def f(dataOfInterest):
    return functionExample(dataOfInterest, lookup1=x, lookup2=y, lookup3=z)

In this way, map() would work as you expect.

multiprocessing — Process-based parallelism — Python 3.8.5 , The Pipe() function returns a pair of connection objects connected by a pipe which by default is Same as terminate() but using the SIGKILL signal on Unix. When using multiple processes, one generally uses message passing for The type of the connection is determined by family argument, but this can generally be� Apply a function to multiple list or vector arguments Description. mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each argument, the second elements, the third elements, and so on.

Frequently Asked Questions — Numba 0.50.1 documentation, Can I pass a function as an argument to a jitted function?�. As of Numba Numba doesn't implement such optimizations by itself, but it lets LLVM apply them. Ufuncs and gufuncs with the target="parallel" option will run on multiple threads. These arguments tell the function the condition being tested and what actions to take if the condition is true or false. Tutorial Shortcut Option . To continue with this example, you may: Enter the arguments into the dialog box as shown in the image above and then jump to the final step that covers copying the formula to rows 7 through 10.

Decorate interface for parallel computation — Sage 9.1 Reference , Decorate a function so that when called it runs in a forked subprocess. We use fork to make sure that the function terminates after one second, no matter what: If you use anything but 'fork' above, then a whole new subprocess is For functions that take multiple arguments, enclose the arguments in tuples when calling� To use pool.map for functions with multiple arguments, partial can be used to set constant values to all arguments which are not changed during parallel processing, such that only the first argument remains for iterating. (The variable input needs to be always the first argument of a function, not second or later arguments).

Mapping, An iterable input; A "map" function that operates on a single item at a time a task is mapped over its inputs, it retains the same call signature and arguments, but adds multiple scalar tasks to a flow, and converts those tasks to run as parallel� Non-strict functions that are passed only simple parameters (that is, not rest, default, or restructured parameters) will sync the value of variables new values in the body of the function with the arguments object, and vice versa: function func(a) { arguments[0] = 99; // updating arguments[0] also updates a console.log(a); } func(10); // 99

Comments
  • Can you share a bit more information on this? What do the DataFrame and function look like?
  • Thanks a lot. This solves it (it's running now, will test if it goes through and proves to be quicker), as to the partial, I think I'll go for the suggestion by @alec_djinn below which should do for a way around the partial. Cheers!
  • I was about to say that won't work; they're using multiprocessing, but you replaced the lambda (unpicklable) with partial (picklable, though potentially slow) between when I began this comment and now. :-)
  • Cool, I forgot pickling would be an issue. The old version was mainly to avoid assuming the OP knew the parameter names, which would preclude the use of keyword arguments.
  • Thank you, this works and is a way around using partial.