## Looking for a faster way to sum arrays in C#

In the application which I'm currently developing, I must sum pretty big arrays of vectors efficiently. Here's my code:

public List<double[, ,]> normalMaps; public double[, ,] Mix(double[] weights, double gain) { int w, h; w = normalMaps[0].GetLength(0); h = normalMaps[0].GetLength(1); double[, ,] ret = new double[w, h, 3]; int normcount = normalMaps.Count; //for (int y = 0; y < h; y++) Parallel.For(0, h, y => { for (int x = 0; x < w; x++) { for (int z = 0; z < normcount; z++) { ret[x, y, 0] += normalMaps[z][x, y, 0] * weights[z]; ret[x, y, 1] += normalMaps[z][x, y, 1] * weights[z]; ret[x, y, 2] += normalMaps[z][x, y, 2] * weights[z]; } ret[x, y, 0] *= gain; ret[x, y, 1] *= gain; ret[x, y, 2] *= gain; ret[x, y, 0] = Math.Max(-1, Math.Min(1, ret[x, y, 0])); ret[x, y, 1] = Math.Max(-1, Math.Min(1, ret[x, y, 1])); ret[x, y, 2] = Math.Max(-1, Math.Min(1, ret[x, y, 2])); double retnorm = Math.Sqrt(ret[x, y, 0] * ret[x, y, 0] + ret[x, y, 1] * ret[x, y, 1] + ret[x, y, 2] * ret[x, y, 2]); ret[x, y, 0] /= retnorm; ret[x, y, 1] /= retnorm; ret[x, y, 2] /= retnorm; } }); return ret; }

Now, when I try to sum 7 1024*1024 arrays of 3-component vectors, the operation takes 320 ms on my laptop. Making the code multithreaded gave me already a huge performance boost. But I need to make it even faster. How can I optimize it even more? I can already see I could use a simple array instead of a List<>, that would make the code faster, but not much. Is there really nothing left to optimize? I was thinking about moving this thing to GPU, but it's just an idea. Can somebody help me out? Thanks in advance.

Try this,

private double[,,] Mix(double[][,,] normalMaps, double[] weights, double gain) { var w = normalMaps[0].GetLength(0); var h = normalMaps[0].GetLength(1); var result = new double[w, h, 3]; var mapCount = normalMaps.Length; Parallel.For(0, w, x => { for (int y = 0; y < h; y++) { OneStack( x, y, mapCount, normalMaps, weights, gain, result)); } } return result; } private static void OneStack( int x, int y, int mapCount, double[][,,] normalMaps, double[] weights, double gain, double[,,] result) { var weight = weights[0]; var z0 = normalMaps[0][x, y, 0] * weight; var z1 = normalMaps[0][x, y, 1] * weight; var z2 = normalMaps[0][x, y, 2] * weight; for (var i = 1; i < mapCount; i++) { weight = weights[i]; z0 += normalMaps[i][x, y, 0] * weight; z1 += normalMaps[i][x, y, 1] * weight; z2 += normalMaps[i][x, y, 2] * weight; } z0 = Math.Max(-1, Math.Min(1, z0 * gain)); z1 = Math.Max(-1, Math.Min(1, z1 * gain)); z2 = Math.Max(-1, Math.Min(1, z2 * gain)); var norm = Math.Sqrt(z0 * z0 + z1 * z1 + z2 * z2); result[x, y, 0] = z0 / norm; result[x, y, 1] = z1 / norm; result[x, y, 2] = z2 / norm; }

I'm anticipating an improvement because the number of assigns and accesses involving the large multi-dimensional array are minimised. Whilst this comes at the cost of extra instantiations, I anticipate the cost of using the MD array to be larger. Multi-Dimensional arrays are essentially broken in .Net.

**How to sum up an array of integers in C#,** int[] arr = new int[] { Int32. MaxValue, 1 }; int sum = 0; for (int i = 0; i < arr. Length; i++) { sum += arr[i]; } Console. WriteLine(sum); Without the Interlocked line of code, the parallel code runs almost 2 times faster with Parallel.For and the thread-pool strategy. Let’s do some brainstorming. How can we sum the elements of an array without adding an Interlocked expression? In the specific case of summing array elements, it can be easily done with per-thread locals.

You'll get your code from 270ms to 0ms if you know the fact that you are iterating the dimensions **in a bit inefficient order**, which causes false sharing. You are essentially parallelizing "width", instead of height. You might be confusing the way how arrays are stored in memory.

The false-sharing is not the only problem, due to the fact how computers work, you are iterating over things in a **cache-inefficient way.**

Usually array definitions should be `myArray[HEIGHT, WIDTH]`

to keep it consistent with memory storage, and when iterating, the `height`

should be outermost.

Parallel.For(0, w, x => { for (int y = 0; y < h; y++) { ... } }

That took me from 800ms to 150ms, while having equal dimensions, just by swapping the few things.

**How to calculate sum of all elements of an array in C#,** ForEach method, we can perform the addition operation on each element of the specified array. The following example demonstrates this by finding the total sum � Questions: I have 3 byte arrays in C# that I need to combine into one. What would be the most efficient method to complete this task? Answers: For primitive types (including bytes), use System.Buffer.BlockCopy instead of System.Array.Copy. It’s faster. I timed each of the suggested methods in a loop executed 1 million times using 3

As you mentioned, swapping that List<> out for an array will give a noticeable performance boost.

If you switch to arrays, you could also make use of pointers to iterate the values. You'll take a small performance hit for pinning it so it doesn't get moved by the GC but considering the size, the pros should out-weigh the cons. You see this done a fair bit within the .NET framework's source to squeeze every drop of performance they can from hefty iterations.

You may be able to utilize the new SIMD support for the actual calculations but I don't know enough on the subject to be able to give more details. I should also mention that the new SIMD features in .NET aren't fully complete yet and still in beta.

**C# Sharp Exercises: Find the sum of all elements of an array ,** C# Sharp programming, exercises, solution: Write a program in C# Sharp# Sharp to find the sum of all elements of an array. Arrays have a block of memory contiguously allocated on the heap. Your variable for the array is on the stack and points to this contiguous piece of memory. This isn't slow. Actually having the array in the heap can be fast. Because your array can be referenced.

I bet you can double the speed if you swap the X and Y loops:

public double[, ,] Mix(double[] weights, double gain) { int w, h; w = normalMaps[0].GetLength(0); h = normalMaps[0].GetLength(1); double[, ,] ret = new double[w, h, 3]; int normcount = normalMaps.Count; //for (int y = 0; y < h; y++) Parallel.For(0, w, x => { for (int y = 0; y < h; y++) { . . . } }); return ret; }

You want the innermost loop to be on the last array index, and the outermost loop to be the first array index. This results in the most cache-coherent approach. The compiler also doesn't have to do a multiply at each array index lookup, it just does an index. (I think can explain that better if it would help...)

EDIT: I have 2 other optimizations that can get another 15%. One is to do the same change, but with the Z. To do that, the Z loop needs to be pulled out of the main loop. This means going over the data twice, but it is still worth it. The other is to eliminate the extra lookups caused by the lookup of normalMaps[z] 3 times. Please do verify that the results are the same: I think it was okay to do this as a separate step but maybe I missed something.

// Extract Z loop Parallel.For(0, normcount, z => //for (int z = 0; z < normcount; z++) { //Parallel.For(0, w, x => for (int x = 0; x < w; x++) { // I don't know why the compiler isn't smart enough to do this itself but it actually matters double[, ,] temp = normalMaps[z]; //Parallel.For(0, h, y => for (int y = 0; y < h; y++) { ret[x, y, 0] += temp[x, y, 0] * weights[z]; ret[x, y, 1] += temp[x, y, 1] * weights[z]; ret[x, y, 2] += temp[x, y, 2] * weights[z]; } }; }); Parallel.For(0, w, x => { for (int y = 0; y < h; y++) { //Parallel.For(0, normcount, z => ret[x, y, 0] *= gain; ret[x, y, 1] *= gain; ret[x, y, 2] *= gain; ret[x, y, 0] = Math.Max(-1, Math.Min(1, ret[x, y, 0])); ret[x, y, 1] = Math.Max(-1, Math.Min(1, ret[x, y, 1])); ret[x, y, 2] = Math.Max(-1, Math.Min(1, ret[x, y, 2])); double retnorm = Math.Sqrt(ret[x, y, 0] * ret[x, y, 0] + ret[x, y, 1] * ret[x, y, 1] + ret[x, y, 2] * ret[x, y, 2]); ret[x, y, 0] /= retnorm; ret[x, y, 1] /= retnorm; ret[x, y, 2] /= retnorm; }; });

**How to Beat Array Iteration Performance with Parallelism in C# .NET ,** How to Beat Array Iteration Performance with Parallelism in C# . Let's consider a simple programming challenge: Summing all elements in a large array. In a 64-bit process, the Interlocked solution is more than twice as fast as the How would this look if you used async to coordinate the threads target than parallel.for ? We'll use the GetLength () method which accepts a dimension, 0 for columns and 1 for rows, as a parameter and returns the number of items in this dimension. The first dimension is the number of columns, the second is the number of rows.

**C# Sum Method: Add up All Numbers,** Sum. This method adds up all values in an IEnumerable. It computes the sum total of all the numbers in an array, or List, of integers. This extension method in� Other Ways to Create an Array. If you are familiar with C#, you might have seen arrays created with the new keyword, and perhaps you have seen arrays with a specified size as well. In C#, there are different ways to create an array:

Here, string array and arrays of strings both are same term. For Example, if you want to store the name of students of a class then you can use the arrays of strings. Arrays of strings can be one dimensional or multidimensional. Declaring the string array: There are two ways to declare the arrays of strings as follows. Declaration without size

If the complement exists, we need to look up its index. What is the best way to maintain a mapping of each element in the array to its index? A hash table. We reduce the look up time from O (n) O(n) O (n) to O (1) O(1) O (1) by trading space for speed. A hash table is built exactly for this purpose, it supports fast look up in near constant time.

##### Comments

- Probably this would be a better place to ask: codereview.stackexchange.com
- Some suggestions: reverse loops (
`for(int z = 0; z < normcount; z++)`

) into`for(int z = normcount; z >= 0 ; --z)`

since comparison to zero is faster; cache arry items within the loop say`ret[x, y, 0]`

- indexing can be time consuming - Why do you need it to run faster?
- multi-dimensional arrays, humbug.
- In c# i can't say. But with c, c++ and using 128bit
*xmm*registers with assembly code you will gain a great performance boost. You chould write a dll (in c, c++) and use it in your app. - I changed my program to create arrays in the [H,W] order, not [W,H]. So I swapped some variables in your code to use the current order. It made my code slower. And also it returns different values. The Math.Max and Math.Min part. It is supposed to clip any values above 1 and below -1, and then be multiplied by the variable gain. Multiplying it FIRST results in different answers. Example: clip( [0.1, 0, 2.0] * 2 ) = [0.2, 0, 1] ||| clip( [0.1, 0, 2.0] ) * 2 = [0.2, 0, 2]
- @PiotrJoniec I suspect I've over parallelized things in the outer function.
- @PiotrJoniec, in your example (in the question) you multiply by
`gain`

first. - Wow, I had no idea. :\ But somehow I still get different results.
- @PiotrJoniec; How often do things get clipped? If it's often, I believe you can do pretty good optimization in mapCount for() loop. Basically check if each component(z0, z1, z2) has exceeded -1 or 1, and then break the loop.
- Just swapping x and y as you wrote made the compute time go from 280ms to 130 ms. BUT. I was using the same kind of loop in many different places of my program. So with this single answer you just optimized my ENTIRE PROGRAM. THANK. YOU.
- No worries, learned myself few things, namely check this: stackoverflow.com/questions/468832/… the benchmark seems to indicate that multi-dimensional array is 2x slower than jagged array, lol. Perhaps you can get to 75ms even..
- @PiotrJoniec, that's what I mean by broken.
- I've just tried out using arrays instead of List<> - calculation time went down from 320ms to 270ms. I wasn't expecting that much. Thanks for the answer, I'll have to try to use pointers just as you said.