How does newly introduced Arrays.parallelPrefix(...) in Java 8 work?

java 8 arrays
java 8 array sort
java array stream
importing arrays java
what is the package for array in java
java array sub-array
arrays swap java
array collection in java

I came across Arrays.parallelPrefix introduced in Java 8.

This overloaded method performs operation on each element of the input array in a cumulative fashion. For e.g. from docs:

Cumulates, in parallel, each element of the given array in place, using the supplied function. For example if the array initially holds [2, 1, 0, 3] and the operation performs addition, then upon return the array holds [2, 3, 3, 6]. Parallel prefix computation is usually more efficient than sequential loops for large arrays.

So, how does Java achieve this task in parallel when the operation on a term is dependent on the operation result on the previous term, and so on?

I tried going through the code myself and they do use ForkJoinTasks, but it's not so straightforward how they would merge result to get the final array.

As explained in Eran’s answer, this operation utilizes the associativity property of the function.

Then, there are two fundamental steps. The first one, is an actual prefix operation (in the sense of requiring the previous element(s) for the evaluation), applied to parts of the array in parallel. The result of each partial operation (identical to the resulting last element), is the offset for the remaining array.

E.g. for the following array, using sum as prefix operation, and four processors

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  

we get

  4 → 13 → 18 → 19    0 →  5 →  6 → 12    6 → 10 → 16 → 21    1 →  7 → 16 → 19  
                 ↓                   ↓                   ↓                   ↓  
                19                  12                  21                  19  

now, we utilize the associativity to apply the prefix operation to the offsets first

                 ↓                   ↓                   ↓                   ↓  
                19         →        31         →        52         →        71  

Then, we get to the second phase, which is to apply these offsets to each element of the next chunk, which is a perfectly parallelizable operation, as there is no dependency to the previous element(s) anymore

                     19   19   19   19   31   31   31   31   52   52   52   52  
                      ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

When we use the same example for eight threads,

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  

  4 → 13    5 →  6    0 →  5    1 →  7    6 → 10    6 → 11    1 →  7    9 → 12  
       ↓         ↓         ↓         ↓         ↓         ↓         ↓         ↓  
      13         6         5         7        10        11         7        12  

       ↓         ↓         ↓         ↓         ↓         ↓         ↓         ↓  
      13    →   19    →   24    →   31    →   41    →   52    →   59    →   71  

           13   13   19   19   24   24   31   31   41   41   52   52   59   59  
            ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

we see that there will be a clear benefit, even when we use the simpler strategy of keeping the work chunks the same for both steps, in other words, accept one idle worker thread in the second phase. We will need about ⅛n for the first phase and ⅛n for the second, needing ¼n total for the operation (where n is the cost of the sequential prefix evaluation of the entire array). Of course, only roughly and in the best case.

In contrast, when we have only two processors

  4    9    5    1    0    5    1    6    6    4    6    5    1    6    9    3  


  4 → 13 → 18 → 19 → 19 → 24 → 25 → 31    6 → 10 → 16 → 21 → 22 → 28 → 37 → 40  
                                     ↓                                       ↓  
                                    31                                      40  

                                     ↓                                       ↓  
                                    31                   →                  71  

                                         31   31   31   31   31   31   31   31  
                                          ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓  
  4   13   18   19   19   24   25   31   37   41   47   52   53   59   68   71  

we can only gain a benefit, when we re-assign the work of the second phase. This is possible, as said, because the second phase’s work has no dependencies between the elements anymore. So we can split this operation arbitrarily, though it complicates the implementation and may introduce an additional overhead.

When we split the work of the second phase between both processors, the first phase needs about ½n and the second will need ¼n, yielding ¾n total, which still is a benefit, if the array is large enough.

As an additional note, you might notice that the offsets calculated in preparation of the second phase are identical to the result for the last element of the chunk. So, you could reduce the required number of operations by one per chunk by simply assigning that value. But the typical scenario is to have only a few chunks (scaling with the number of processors) with a large number of elements, so saving one operation per chunk is not relevant.

Class: java.util.Arrays. java.lang.Object java.util.Arrays LogicBig public static <​T> void parallelPrefix(T[] array, BinaryOperator<T> op) How to find first and last element of Java 8 stream? asList() does not work for primitive arrays? Java Java 12 - CompletionStage's new methods for exception handling · Java 12  The parallelPrefix method is introduced to the Arrays class in java 8. The parallelPrefix method performs a given mathematical function on the elements of the array cumulatively, and them modifies the array concurrently.

The main point is that the operator is a

side-effect-free, associative function

This means that

(a op b) op c == a op (b op c)

Therefore, if you split the array into two halves and apply the parallelPrefix method recursively on each half, you can later merge the partial results by applying the operation on each element of the second half of the array with the last element of the first half.

Consider the [2, 1, 0, 3] with addition example. If you split the array into two halves and perform the operation on each half, you get:

[2, 3]    and    [0, 3]

Then, in order to merge them, you add 3 (the last element of the first half) to each element of the second half, and get:

[2, 3, 3, 6]

EDIT: This answer suggests one way of computing the prefixes of an array in parallel. It's not necessarily the most efficient way, and not necessarily the way used by the JDK implementation. You can further read about parallel algorithms for solving that problem here.

Another new method in Arrays introduced since Java 8 is parallelPrefix. With parallelPrefix, we can operate on each element of the input array  Arrays.parallelPrefix() updates the array on the basis of given operator. Suppose we have an array as [2,1,3,5] and we are performing addition operation, then the result will be [2,3,6,11]. The operators we pass are BinaryOperator , IntBinaryOperator , DoubleBinaryOperator etc. Find the method details.

I read both the answers and still could not understand completely how this is done, so decided to draw an example instead. Here is what I came up with, suppose this is the array that we start with (with 3 CPU's):

7, 9, 6, 1, 8, 7, 3, 4, 9

So each of the 3 threads will get it's chunk to work on:

Thread 1:  7, 9, 6
Thread 2:  1, 8, 7
Thread 3:  3, 4, 9

Since the documentation mandates an associative function, we can compute the sum in the first Thread and some partial sums in the ones ones, and when the first is known - all of them will. Let's see what 7, 9, 6 would become:

7, 9, 6  -> 7, 16, 22

So the sum in the first Thread is 22 - but other threads have no idea about that yet, so what they do instead is work against that as an x for example. Thus Thread 2, will be :

1, 8, 7 -> 1 (+x), 9 (+x), 16(+x) 

Thus the sum from the second Thread would be x + 16, thus in Thread 3, we would have:

3, 4, 9 -> 3 (+ x + 16), 7 (+ x + 16), 16 (+ x + 16)

3, 4, 9 -> x + 19, x + 23, x + 32

This way, as soon as I know x, I know all other results too.

Disclaimer: I am not sure this is how it is implemented (and I tried looking at the code - but it's too complicated).

The new release of Java makes it simple to perform parallel operations Today, Oracle will unveil Java SE 8 — a major step forward in the language. Arrays.​parallelPrefix(). The parallelPrefix method performs a given We introduce you to Apple's new Swift programming language, Working With Us. Methods: These methods apply cumulative operation, in parallel, on each two elements of the array. public static <T> void parallelPrefix(T[] array, BinaryOperator<T> op)

Parallel Array Operations Java 8 includes a couple of other parallel array operations Like the operations on the streams framework, these are data parallel operations. Parallel operations on arrays Name Operation parallelPrefix Calculates imperativeInitialize(int size) { double[] values = new double[size]; for(int i = 0;  How does newly introduced Arrays.parallelPrefix(…) in Java 8 work? (2) As explained in Eran’s answer, this operation utilizes the associativity property of the function. Then, there are two fundamental steps.

On this page we will provide java 8 Arrays parallel prefix example. static void main(String[] args) { BinaryOperator<Floor> opt = (f1, f2) -> new  How does newly introduced Arrays.parallelPrefix(…) in Java 8 work? Hot Network Questions Tips for golfing javascript code, especially converting while loop to expression

How Java Streams addressed pre-Java-8 limitations of Java List<Student> activeStudents = new ArrayList<Student>(); Parallelism can be introduced at a stream source … but stateless intermediate operations work just fine Combine reduction tree idea from Parallel Array Sum with partial sum idea  How does newly introduced Arrays.parallelPrefix(…) in Java 8 work? Convert java.util.Date to String in yyyy-MM-dd format without creating a lot of objects Java-How can I disable a TLS cipher for only some protocols using JVM Config?

Comments
  • Nice question. +1 .. Parallel prefix computation is usually more efficient than sequential loops for large arrays ... every day there is a lot more for me here to learn and I love it.
  • FWIW, the source code (with fairly extensive comments) is here and here.
  • If you want to read more about how tremendously powerful this sort of operation can be, I'd highly recommend the PhD thesis (or papers) by Guy Blelloch : cs.cmu.edu/~guyb/papers/Ble90.pdf . I think the importance of his work - considering modern multi- and many-core hardware - is highly underrated.
  • Also , see this Dr Dobbs article: drdobbs.com/parallel/the-mystery-and-magic-of-prefix-scan/…
  • @DacidSoroko Dr. Dobbs article is a gem. It explains how the prefix scan with parallelism ends up doing much more operations than the iteration logic but the span of the parallel algorithm is much lower(O(log N) with unlimited processers).
  • @Holger I tried to look at the implementation, but this is out of my league way too much; but I was thinking why can't this be done in such a way that (second phase I mean), all the other results depend only on the first one - I mean, once the first one is know, all others can be done separately and independently. My answer should have been really a comment, but it would be way too long
  • 2 core example is crystal clear. Instead of taking O(n) time, this will be done in reduced time. We perform operations on the 2 halves in O(n/2) time parallelly. After this, we update the 2nd half with the sum from first half. This action can be done by splitting and submting the two halves of 2nd part to 2 cores. This operation will take O(n/4) parallelly. So, total time will be O(n/2) + O(n/4) = O(3n/4) which is 3/4th of the time taken in the sequential way.
  • @AdityaGupta yes, but don’t use O(…) for notating this. The Big O notation has a specific meaning, in which the parallel execution still is O(n), because it still scales linearly.
  • Yes, that makes sense.
  • I guess, span should be the right term here.other link
  • Ok, but, depending on what the function does, it is then essentially called multiple times on the elements of the second half. What is the point of parallelizing then, if the work must be made again?