Cannot understand why more vectorization is slower than less vectorization in this case?

I meet a practical problem in which more vectorization is slower than less vectorization. I simplify the main problem into the following toy model. In the following codes, I use two different methods to realize the same functionality. f1 uses one loop, f2 uses two loops. Naively we will think that f1 is faster than f2. However f2 is faster than f1. I can't understand why this happens.

import numpy as np
def time_function(f, *args):
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

def f2(X, Y):
    d = np.zeros((X.shape[0], Y.shape[0]))
    for i in range(X.shape[0]):
        for j in range(Y.shape[0]):
            d[i,j] = np.sum(np.square(X[i] - Y[j])) 
    return d

def f1(X, Y):
    d = np.zeros((X.shape[0], Y.shape[0]))
    for i in range(X.shape[0]):
        d[i] = np.sum(np.square(Y - X[i]), axis=1) 
    return d

X = np.ones((500, 3072))
Y = np.ones((5000, 3072))

time2 = time_function(f2,X,Y)
print('Two loop version took %f seconds' % time2)

time1 = time_function(f1,X,Y)
print('One loop version took %f seconds' % time1)
Two loop version took 24.691270 seconds
One loop version took 31.785896 seconds

I guess that culprit hides at f1:

d[i] = np.sum(np.square(Y - X[i]), axis=1) 

You substract X[i] from the whole Y array each iteration, which causes intense broadcasting that involves iteration in a range 0 <= i <= 5000, while f2 substracts two distinct values, bounded by 0 <= i <= 500

Vectorized code slower than loops? - MATLAB Answers, Learn more about speed, vectorization, slower, memory allocation, loops, I know that i happens if vectorization requiers large temporary variables, but And finally the test from Columns and Rows are not the same with an additional case. 1 Cannot understand why more vectorization is slower than less vectorization in this case? Feb 4. 1 How to randomly generate a binary tree given the node number?

On my machine, f1 is slightly faster. But a fully vectorized version

D3=(np.square(Y[None,:,:]-X[:,None,:])).sum(axis=2)

gives a memory error because it has to create (500, 5000, 3072) array (57G).

There is a trade off between iterations and memory management. Often a few iterations on a relatively complex task are faster than one step that requires allocating a larger matrix. In your case there's the Y-X difference, but then it has to also do the square (same large size), before reduce the dimensions with the sum.

In [23]: timeit -n1 -r1 f2(X,Y)                                                                
1min 21s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [24]: timeit -n1 -r1 f1(X,Y)                                                                
1min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

A variation on f1, that iterates on the 5000 rows of Y (instead of the 500 of X) times at 1min 25s. A few iterations are generally better than many, provided memory management issues don't penalize you.

numba

An iterative, but compiled version using numba is noticeably better:

In [32]: import numba 
    ...: @numba.jit(nopython=True) 
    ...: def f3(X, Y): 
    ...:     d = np.zeros((X.shape[0], Y.shape[0])) 
    ...:     for i in range(X.shape[0]): 
    ...:         for j in range(Y.shape[0]): 
    ...:             d[i,j] = np.sum(np.square(X[i] - Y[j]))  
    ...:     return d 
    ...:                                                                                       
In [33]: D3=f3(X,Y)                                                                            
In [34]: np.allclose(D,D3)                                                                     
Out[34]: True
In [35]: timeit -n1 -r1 f3(X,Y)                                                                
20 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

and making the sum iteration explicit shaves off some more time:

In [36]: @numba.jit(nopython=True) 
    ...: def f4(X, Y): 
    ...:     d = np.zeros((X.shape[0], Y.shape[0])) 
    ...:     for i in range(X.shape[0]): 
    ...:         for j in range(Y.shape[0]): 
    ...:             for k in range(X.shape[1]): 
    ...:                 d[i,j] += (X[i,k] - Y[j,k])**2  
    ...:     return d 
    ...:      
    ...:                                                                                       
In [37]: D4 = f4(X,Y)                                                                          
In [38]: np.allclose(D,D4)                                                                     
Out[38]: True
In [39]: timeit -n1 -r1 f4(X,Y)                                                                
10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

NumPy Optimization: Vectorization and Broadcasting, The output is much more detailed than for the normal timeit.timeit call. For this reason, we see loops in python are often much slower than in C Can it be the case that it's the NumPy array, and not vectorization that What happens if we want to vectorize a loop where we are dealing with arrays that don't� Vectorized code slower than loops?. Learn more about speed, vectorization, slower, memory allocation, loops, nested loops, column major MATLAB

I see iteration in both definitions. Vectorization gives gains only on large quantities! You're effectively only vectorizing one row worth of values at a time. Python looping is more than capable of handling that, the overhead of going for and setting up the vectorized calculation takes too long in comparison here. Your f1 isn't fully vectorized because of the iteration involved. Thus, this comparison isn't fair to vectorization, when iteration would suffice.

tl;dr have enough values that the vectorization "setup" is offset by the gains of vectorization itself.

Improving performance with SIMD intrinsics in three use cases , When done right, supplementing C or C++ code with vector intrinsics is For the cases presented in this blog post, vectorization improved For a more in-depth introduction, you can read my other article on the subject. On the laptop, the scalar version is likely too slow to handle 60 FPS video of frames� Such formulation could also be found in other DBMS papers, because vectorization is a popular way. This question is related with general DBMS and the Paper from TUM is related with HyPer, which is a not so "famous" DBMS. I would be glad to receive any kinds of examples, explanations.

SSE2 vectorized code seems to run slower than non-vectorized code, I am now doing this on an x86, and having difficulty understanding where With that, here's the non-vectorized loop (I've omitted the preamble in which of Intel C++ generated codes by more than ~1.75x in some cases ( based on my the unrolling I cannot see any unrolling in presented assembly code. No, vectorized code is not always faster. If the vectorization needs the creation of large temporary arrays, loops are often faster. The allocation of memory is very expensive, because it can cause a garbage collection or even disk swapping.

Look Ma, No For-Loops: Array Programming With NumPy – Real , In this tutorial you'll see step-by-step how these advanced features in NumPy something that takes 50 microseconds (fifty millionths of a second) as “slow. In general, vectorized array operations will often be one or two (or more) The problem here is that the smaller array, in its current form, cannot be “stretched” to be� Teams. Q&A for Work. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.

Vectorization: Writing C/C++ code in VECTOR Format Mukkaysh Srivastav Computational Research Laboratories (CRL) – Pune, India 1.0 Introduction: Vectorization has been key optimization principle over x87 stack more than a decade. But often C/C++ algorithmic source-code is written without adequate attention to vectorization concepts.

Comments
  • Sidenote: Have a look at timeit for more convenient execution time checking
  • Does this case teach me that I need to avoid vectorization involves large broadcasting?
  • @maplemaple no, it teaches you to avoid vectorization if the number of data points is low.
  • @ParitoshSingh Why? It seems the gap is larger when Y.shape from (5000, 3072) -> (50000, 3072)
  • @maplemaple you didn't vectorize for the full arrays. you're effectively only vectorizing one "row" worth of points at a time
  • @mapleman The phenomena is pretty simple: f1 computation involves much more data in comparison to f2, and even worth - it involve a lot of implicit data copying hidden in broadcasting. This data gap overwhelms any benefits from vectorization
  • A fully vectorized version, using Y[None,:,:]-X[:,None,:] creates a (500,5000,3072), which may be too big for memory!