## dot product using cblas is slow

cblas_ddot

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.

Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 @2,67GHz, 1,92GB RAM)

The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1. In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.

/* Matrix product of A^T * A by two methods. 1) "Row times column" as we learned in school. 2) With rearranged loops such that need for cash refreshes is reduced (this can be improved even more). Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm */ #include <stdio.h> #include <stdlib.h> #include <time.h> #include <cblas.h> #define ROWS 2000 #define COLS 1000 static double a[ROWS][COLS]; static double c[COLS][COLS]; static void dot() { int i, j; double *ai, *bj; ai = a[0]; for (i=0; i<COLS; i++) { bj = a[0]; for (j=0; j<COLS; j++) { c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS); bj += 1; } ai += 1; } } static void axpy() { int i, j; double *ci, *bj, aij; for (i=0; i<COLS; i++) { ci = c[i]; for (j=0; j<COLS; j++) ci[j] = 0.; for (j=0; j<ROWS; j++) { aij = a[j][i]; bj = a[j]; cblas_daxpy(COLS,aij,bj,1,ci,1); } } } int main(int argc, char** argv) { clock_t t0, t1; int i, j; for (i=0; i<ROWS; ++i) for (j=0; j<COLS; ++j) a[i][j] = i+j; t0 = clock(); dot(); t0 = clock(); printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC); axpy(); t1 = clock(); printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC); return 0; }

**dot product is slow! · Issue #131 · xtensor-stack/xtensor-blas · GitHub,** I'm trying to reach the same performance as NumPy. With this code below I evaluate the speed of dot product using xtensor-blas: Dear all, I run benchmarks on a sandy-bridge Intel processor (E5-4620) using Intel MKL 11.1. Here, I have found that cblas_dnrm2 is significantly slower (3.4 s) than the corresponding cblas_ddot call (0.5 s) using one thread.

The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:

DO I = MP1,N,5 DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) + $ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4) END DO

ie. just a loop unrolled to a stride of 5.

If you must use a `ddot`

style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:

#include <emmintrin.h> double ddotsse2(const double *x, const double *y, const int n) { double result[2]; int n2 = 2 * (n/2); __m128d dtemp; if ( (n % 2) == 0) { dtemp = _mm_setzero_pd(); } else { dtemp = _mm_set_sd(x[n] * y[n]); } for(int i=0; i<n2; i+=2) { __m128d x1 = _mm_loadr_pd(x+i); __m128d y1 = _mm_loadr_pd(y+i); __m128d xy = _mm_mul_pd(x1, y1); dtemp = _mm_add_pd(dtemp, xy); } _mm_store_pd(&result[0],dtemp); return result[0] + result[1]; }

(not tested, never been compiled, buyer beware).

This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.

**BLAS dotproduct slower than normal one · Issue #1405 · xianyi ,** To optimize timting I thought about using OpenBLAS. I changed this: BLAS dotproduct slower than normal one #1405. Closed. JonasNey Computes the dot product of two vectors (double-precision). func cblas _zdotc _sub (Int32, Unsafe Raw Pointer!, Int32, Unsafe Raw Pointer!, Int32, Unsafe Mutable Raw Pointer!) Calculates the dot product of the complex conjugate of a double-precision complex vector with a second double-precision complex vector.

If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with `cblas_?dot`

. Performing the matrix multiplication in blocks also helps.

void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) { int i, j, k; MKL_INT incx, incy; incx = 1; incy = b_size; //copy out multiplying matrix from larger matrix float *temp = (float*) malloc(n * n * sizeof(float)); for (i = 0; i < n; ++i) { cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1); } //transpose mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1); for (i = 0; i < n; i+= BLOCK_SIZE) { for (j = 0; j < n; j++) { for (k = 0; k < BLOCK_SIZE; ++k) { C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1); } } } free(temp); }

On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).

**PerformanceTips - SciPy wiki dump,** Note that the built-in python min function can be much slower (up to 300-500 times) The reason is that the dot product uses underlying BLAS operations which When testing Theano to make sure that it was using the right BLAS, I noticed that vector-vector dot products seem a little slow (when compared to NumPy with the same BLAS). DebugPrint shows that Theano is using a matrix-vector BLAS function to compute the vector-vector dot product, which might be the cause of this speed difference.

**How fast can we compute the dot product?,** In these pages, we will see how fast we can compute a dot product between two vectors are stored in consecutive memory locations (the BLAS incx parameter is 1 corresponding to the best speed for "large" vectors stored in the L1 cache. Computes the double-precision dot product of a pair of single-precision vectors. cblas _zdotc _sub Calculates the dot product of the complex conjugate of a double-precision complex vector with a second double-precision complex vector.

**Applied Parallel and Scientific Computing: 10th International ,** 3.1 Dot Products and BLAS Support Since version 2.3.0, C-XSC includes the dot products) available, it has to be simulated in software and thus is very slow. Computes the double-precision dot product of a pair of single-precision vectors. func cblas _zdotc _sub (Int32, Unsafe Raw Pointer!, Int32, Unsafe Raw Pointer!, Int32, Unsafe Mutable Raw Pointer!) Calculates the dot product of the complex conjugate of a double-precision complex vector with a second double-precision complex vector.

**Numerical Linear Algebra: A Concise Introduction with MATLAB and Julia,** A Concise Introduction with MATLAB and Julia Folkmar Bornemann Furthermore, outer product calculations are slower by a factor of approximately 30 (which is the Level-2 BLAS (1988): matrix-vector operations, replacing two for-loops. You can run the script successively using different BLAS implementations, either in different virtual environments or using a different BLAS configuration in site.cfg. For a fair comparison, you should set your CPU to a constant clock frequency and not perform any other heavy lifting while the scripts are running, especially when they make use of multiple cores on your machine.

##### Comments

- Could you try explaining a little more clearly what you are trying to calculate? is it
`triu(A.T * A)`

or is it`triu(A).T * triu(A)`

, or something else again? - thank you for your answer, but unfortunately this isn`t faster. Matlab solve this problem 7x faster than c++, but i have to do this in c++, without any loss of speed. Have someone an idee to boost this up.
- @user1235183: What problem does Matlab solve faster - the dot product or the matrix multiply?