Optimising and why openmp is much slower than sequential way?

openmp slower than serial
openmp makes code slower
pragma omp parallel
openmp optimization
openmp cache optimization
openmp does not speed up
openmp critical
openmp taking more time

I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way.

Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):

#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{  
  for(j=0;j<m_size;j++)
  {  
    result[i]+=matrix[i][j]*vector[j];
  }
}

And this is the code for sequential way:

for (i=0;i<m_size;i++)
        for(j=0;j<m_size;j++)
            result[i] += matrix[i][j] * vector[j];

When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is:

Sequential: 5439 ms Parallel: 11120 ms

I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not.

Things to try out for pedagogic reasons:

0) What happens if matrix and vector are not shared?

1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens.

2) Do not collect the sum in result[i], but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. Don't forget to init temp to 0 before the inner loop starts.

[PDF] How to Get Good Performance by Using OpenMP!, OpenMP program, but not so easy to create a program that provides the Sequential overheads: Sequential work that is replicated. Optimize barrier use! 23 OpenMP 3.0 adds a way to create a task explicitly for the team to execute.! 47. openmp slower than single threaded. inttel. Tue, 07/07/2009 - 13:59. That could go a long way to explaining the slowdown you report. For future reference, you

Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[] array fit in the same cache line. When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result is an array of long long, then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).

To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct.

Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not.

Solved: Why my parallel code by OpenMP is much slower than the , Solved: I tried to use OpenMP to parallelize an inner loop. The parallel code is much slower than the sequential code. It seems C:\HPC3D\SOR\test.f90(11,1): remark #34026: call to memset implemented as a call to optimized library version exposing a fallacy in this method for assessing performance. I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way. Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):

I did this in reference to Hristo's comment. I tried using schedule(static, 256). For me it makes it does not help changing the default chunck size. Maybe it even makes it worse. I printed out the thread number and its index with and without setting the schedule and it's clear that OpenMP already chooses the thread indices to be far from one another so that false sharing does not seem to be an issue. For me this code already gives a good boost with OpenMP.

#include "stdio.h"
#include <omp.h>

void loop_parallel(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
    #pragma omp parallel for schedule(static, 250)
    //#pragma omp parallel for
    for (int i=0;i<m_size;i++) {
        //printf("%d %d\n", omp_get_thread_num(), i);
        long long sum = 0;
        for(int j=0;j<m_size;j++) {
            sum += matrix[i*ld +j] * vector[j];
        }
        result[i] = sum;
    }
}

void loop(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
    for (int i=0;i<m_size;i++) {
        long long sum = 0;
        for(int j=0;j<m_size;j++) {
            sum += matrix[i*ld +j] * vector[j];
        }
        result[i] = sum;
    }
}

int main() {
    const int m_size = 1000;
    int *matrix = new int[m_size*m_size];
    int *vector = new int[m_size];
    long long*result = new long long[m_size];
    double dtime;

    dtime = omp_get_wtime();
    loop(matrix, m_size, vector, result, m_size);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

    dtime = omp_get_wtime();
    loop_parallel(matrix, m_size, vector, result, m_size);
    dtime = omp_get_wtime() - dtime;
    printf("time %f\n", dtime);

}

openMP slower than nonparallel, I am novice to openMP, so I decided to make a very simple program to and gfortran are some "over sensitive" -to put it in some way- with spaces and Looks like new compiler can better optimize this code. As I was writing before, when I ran sequential code - BOTH coprocessors were working, and the� I have included some measurements when compiling my code with g++ (options: -O3 -fopenmp) at the end of this post. It does not show the same performancebehavior, even though the sequential version is slower than that of the icpc.Surprisinglythe parallel version with 1 thread is even faster than the sequential version.

[PDF] openmp tips, tricks and gotchas, and without OpenMP enabled. • If you want to link dummy OpenMP library routines into sequential code, there is code in the standard you can copy ( Appendix A� Summing floating point numbers []. For our first parallel program, we turn to an age-old problem: summing an array of floating point numbers. The basic algorithm to solve this problem is so simple that it allows us to focus on OpenMP features rather than algorithmic details, but we'll see in a bit that the problem is actually less trivial than it appears at first.

[PDF] openmp performance, Time spent in sequential code will limit performance. (that's Amdahl's Law). allows more optimisation, e.g. concurrent updates to different array elements. looking in the web answers, sort of confused id uuid's unique service within device. want know if application on particular device can generate uuid during installation same in other devices app installed?so search can performed using uuid find how many devices using particular application using bluetooth. or there other way using uuid.( fetchuuidswithsdp ())i new android programming. kind if

[PDF] Impact of OpenMP Optimizations for the MGCG Method, a ect very much though the optimized version scales well up to 16 pro- cessors with a larger OpenMP can parallelize a sequential program incrementally. To length may be slower than serial execution, since parallel execution needs an. OpenMP is an implementation of multithreading, a method of parallelizing whereby a master thread (a series of instructions executed consecutively) forks a specified number of slave threads and the system divides a task among them.

Comments
  • How many cores are you using with OpenMP?
  • How do you measure execution time? Do you use the dreaded clock() / CLOCKS_PER_SEC method?
  • Thanks for your reply! although the first two solutions were not so useful, the last did reduce the execution time to 6838ms
  • Glad I could help. Could you maybe share the timings? That could be instructive to others. And maybe click on the upward arrow left to my reply :-)
  • This is a common misconception. Most OpenMP implementations do absolutely nothing in order to protect shared variables from possible data races. It is the programmer's job to ensure that no races occur by explicitly adding synchronisation primitives.
  • That's a very interesting comment @Hristo! I normally would loop over m_size/256 and the jump to to 256*i or something like that instead of using schedule(static,256). I'm going to try some of your suggestions this week.
  • I posted some code as an answer trying your suggestion. It makes no difference to me to set the block size. Setting the block size is one of the areas i'm least familiar with with OpenMP. Maybe it's not safe to assume that OpenMP will distribute the threads in a way that avoids false sharing in general and so it's better to set the block size yourself?
  • As with many other cases - it depends. Some scenarios benefit from setting a specific block size. In other cases the default one suffices. In this particular case it is impossible to say what is the reason for the slowdown with so little information provided by the OP.
  • The default chunk size for static loop scheduling is the ratio between the number of iterations and the number of threads. How false sharing affects the execution time also depends on the work being done in relation to the size of the "collision domains" (i.e. shared cache lines). The computations here are too simple and number of iterations is too small to judge if different chunk size has any effect. Also you are measuring the overhead of creating the very first parallel region which varies. Call loop_parallel twice with the same arguments and only measure the second invocation.
  • If you run this and change the order of the functions, the second function will almost always finish faster.