## Efficient way of vectorizing distance calculation

For my study, I have to implement pairwise distance L1-distance calculation between two sets of vectors, each represented as a NumPy matrix (vectors are rows). This has to be done using two loops, one loop and no loops. I expected that since NumPy is so great with vectorization, algorithms must rank as two-loops slower than one-loop slower than no-loops.

I've written the functions:

def f_cdist_2(X1, X2): res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64) for ix1 in range(X1.shape[0]): for ix2 in range(X2.shape[0]): res[ix1, ix2] = np.abs(X1[ix1, :] - X2[ix2, :]).sum() return res def f_cdist_1(X1, X2): res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64) for ix1 in range(X1.shape[0]): res[ix1, :] = np.abs(np.tile(X1[ix1, :], (X2.shape[0], 1)) - X2).sum(axis=1) return res def f_cdist_0(X1, X2): res = np.abs( np.tile(X1[:, :, np.newaxis], (1, 1, X2.shape[0])) - \ np.tile(X2.T[np.newaxis, :, :], (X1.shape[0], 1, 1)) ).sum(axis=1) return res

Then I tested the performance with two random matrices of shapes 128 x 512 and 256 x 512, based on 100 runs I've got the results:

Two loops: 156 msec

One loop: 32 msec

No loops: 135 msec

I also tried `cdist`

from `scipy.spatial.distance`

, and got the best performance of all: 9 msec.

Now, is there a better way to implement no-loops function? I hoped it to perform at least as good as one-loop, but for now I'm at loss as to how to improve it.

**UPDATE**

Using *kwinkunks*'s implementation of no-loops approach, I've re-run tests (again 100 trials) on matrices 1024 x 1024, results are below:

Two loops: 5.7 sec

One loop: 6.6 sec

No loops: 3.9 sec

`scipy.spatial.distance.cdist`

: 0.6 sec

So on larger matrices, no-loops implementation indeed works better. `scipy`

makes wonders, but if I understand correctly, it is written on C, hence such a great performance.

**UPDATE**

Tried with 4096 x 1024 matrices of `np.float64`

, same setup:

Two loops: 88 sec

One loop: 66 sec

No loops: Ran out of memory (had ~ 18 Gb of free RAM at the moment)

`scipy.spatial.distance.cdist`

: 13 sec

You can get extra speedup from the vectorized version using Pythran

f_dist.py:

import numpy as np #pythran export f_dist(float64[:,:], float64[:,:]) def f_dist(X1, X2): return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

On my laptop, the original version runs at:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)' 100 loops, best of 3: 7.05 msec per loop

Once you compile the kernel:

> pythran f_dist.py

You can benchmark it:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)' 1000 loops, best of 3: 1.21 msec per loop

Using SIMD instructions further speeds-up the computation:

> pythran f_dist.py -DUSE_XSIMD -march=native > python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)' 1000 loops, best of 3: 774 usec per loop

Disclaimer: I'm the core dev of the pythran project.

**Looking for most efficient way to calc all distance vectors between ,** for i in range(numberOfPoints): # calculate all distance vectors from the ith atom position in allPositions # to every atom position in allPositions� All of the functions being used within the Haversine function, however, are also able to operate on arrays. This makes the process of vectorizing our distance function quite simple: instead of passing individual scalar values for latitude and longitude to it, we’re going to pass it the entire series (columns).

You can avoid the tiling etc with NumPy's broadcasting:

def f_dist(X1, X2): return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

But, surprisingly (to me anyway) it's not faster than your loop (ca. 90 ms on my machine, compared to 24 ms for your `f_cdist_1()`

function).

That broadcasting trick is often useful. It means you can do things like this:

>>> np.array([1,2,3]) * np.array([10, 20, 30])[:, None] array([[10, 20, 30], [20, 40, 60], [30, 60, 90]])

**sklearn.metrics.pairwise.euclidean_distances — scikit-learn 0.23.2 ,** For efficiency reasons, the euclidean distance between a pair of row vector x and most precise way of doing this computation, and the distance matrix returned� We can do it in two ways: with a loop over the elements, or with a vectorized operation. Now: what happens in terms of execution time? I did this calculations with arrays of different dimensions, ranging from 100.000 to 10.000.000.

**Euclidean distance of two vectors - MATLAB Answers,** Euclidean distance of two vector. I need to calculate the two image distance value. A more efficient method, but this matters only for much larger vectors:. The machine's motion, the output force, is then 10 pounds times 1 foot of work, or 10 foot-pounds of work. The work efficiency is then the ratio of output to input in percentage form. This would be 10/12, or 0.83. Multiply this by 100 to convert to a percentage, which would give a work efficiency of 83 percent.

**An Improved Distance Matrix Computation Algorithm for Multicore ,** This paper proposes DistVect1 as highly efficient parallel vectorized algorithm Experimental results show that the proposed method performs improvement of Even if the sequences are short and pairwise distance calculations can be done � The magnitude equation for vectors uses the distance formula to find the magnitude of a vector. Notice that in the distance formula example, the first thing you did was subtract the two points and get them related to the origin at 0,0,0. That's basically the same thing as subtracting two vectors.

**An efficient computation of Euclidean distances using approximated ,** Abstract: For fast vector quantization (VQ) encoding, we present in this paper a new method to speed up the calculation of the squared Euclidean distance� Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers […]

x = 0:pi/30:2*pi; % vectorized calculation of x nx = length(x); nx2 = nx/2; y = x; % pre-allocate memory for y for i = 1:nx2 y(i) = sin(3*x(i)); end for i = nx2+1:nx y(i) = sin(5*x(i)); end Finally, if we're obsessed with performance, we observe that the calculation of y can also be vectorized.

##### Comments

- Interesting study! I'm curious if it the result will be any different if you use much bigger matrices. Have you tried with much bigger shapes?
- That is a good insight. I think the memory of the broadcast of the large arrays is creating considerable overhead. This is a surprising result!
- Tried just now, updated post with results. No-loops now is the best of three. Will try with even larger shapes, but that's going to take significant time. I'll leave it overnight.
- You can do this a lot faster than cdist. eg. stackoverflow.com/a/42994680/4045774 or stackoverflow.com/a/53380192/4045774
- @max9111, a lot faster than
`cdist`

from`scipy`

? Could you please say how? I browsed through questions you suggested, and implemented distance calculations with`np.einsum`

like this:`np.einsum('ijk -> ij', np.abs(X1[:, None, :] - X2[None, :, :]))`

, but it performs pretty much like no-loops implementation above. Is there a better way? - Could you also benchmark with #pythran export f_dist(float64[:,::1], float64[:,::1])? At least in Numba or Cython, declaring non-contigous arrays often hurts SIMD-vectorization.
- It triggers a compilation error :-) Thanks for the extra test case :-)
*In theroy*non contiguous array would prevent vectorization. - Bug fixed locally, and as expected, there's no SIMD processing of the main loop when using sliced array.
- I knew there was a smarter way than using tiling! With your code no-loops takes 50 ms vs. 25 ms one-loop. Still slower, unfortunately, but much closer now.
- I've a feeling it's slower because it has to allocate a large array of shape (128, 256, 512) before summing it.