## What is the fastest way to map group names of numpy array to indices?

I'm working with 3D pointcloud of Lidar. The points are given by numpy array that looks like this:

points = np.array([[61651921, 416326074, 39805], [61605255, 416360555, 41124], [61664810, 416313743, 39900], [61664837, 416313749, 39910], [61674456, 416316663, 39503], [61651933, 416326074, 39802], [61679969, 416318049, 39500], [61674494, 416316677, 39508], [61651908, 416326079, 39800], [61651908, 416326087, 39802], [61664845, 416313738, 39913], [61674480, 416316668, 39503], [61679996, 416318047, 39510], [61605290, 416360572, 41118], [61605270, 416360565, 41122], [61683939, 416313004, 41052], [61683936, 416313033, 41060], [61679976, 416318044, 39509], [61605279, 416360555, 41109], [61664837, 416313739, 39915], [61674487, 416316666, 39505], [61679961, 416318035, 39503], [61683943, 416313004, 41054], [61683930, 416313042, 41059]])

**I'd like to keep my data grouped into cubes of size 50*50*50 so that every cube preserves some hashable index and numpy indices of my points it contains**. In order to get splitting, I assign

`cubes = points \\ 50`

which outputs to:cubes = np.array([[1233038, 8326521, 796], [1232105, 8327211, 822], [1233296, 8326274, 798], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233038, 8326521, 796], [1233599, 8326360, 790], [1233489, 8326333, 790], [1233038, 8326521, 796], [1233038, 8326521, 796], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233599, 8326360, 790], [1232105, 8327211, 822], [1232105, 8327211, 822], [1233678, 8326260, 821], [1233678, 8326260, 821], [1233599, 8326360, 790], [1232105, 8327211, 822], [1233296, 8326274, 798], [1233489, 8326333, 790], [1233599, 8326360, 790], [1233678, 8326260, 821], [1233678, 8326260, 821]])

##### My desired output looks like this:

{(1232105, 8327211, 822): [1, 13, 14, 18]), (1233038, 8326521, 796): [0, 5, 8, 9], (1233296, 8326274, 798): [2, 3, 10, 19], (1233489, 8326333, 790): [4, 7, 11, 20], (1233599, 8326360, 790): [6, 12, 17, 21], (1233678, 8326260, 821): [15, 16, 22, 23]}

My real pointcloud contains up to few hundreds of millions of 3D points. What is the fastest way to do this kind of grouping?

I've tried a majority of various solutions. Here is comparison of time compsumption assuming size of points is arround 20 millions and size of distinct cubes is arround 1 million:

##### Pandas [tuple(elem) -> np.array(dtype=int64)]

import pandas as pd print(pd.DataFrame(cubes).groupby([0,1,2]).indices) #takes 9sec

##### Defauldict [elem.tobytes() or tuple -> list]

#thanks @abc: result = defaultdict(list) for idx, elem in enumerate(cubes): result[elem.tobytes()].append(idx) # takes 20.5sec # result[elem[0], elem[1], elem[2]].append(idx) #takes 27sec # result[tuple(elem)].append(idx) # takes 50sec

##### numpy_indexed [int -> np.array]

# thanks @Eelco Hoogendoorn for his library values = npi.group_by(cubes).split(np.arange(len(cubes))) result = dict(enumerate(values)) # takes 9.8sec

##### Pandas + dimensionality reduction [int -> np.array(dtype=int64)]

# thanks @Divakar for showing numexpr library: import numexpr as ne def dimensionality_reduction(cubes): #cubes = cubes - np.min(cubes, axis=0) #in case some coords are negative cubes = cubes.astype(np.int64) s0, s1 = cubes[:,0].max()+1, cubes[:,1].max()+1 d = {'s0':s0,'s1':s1,'c0':cubes[:,0],'c1':cubes[:,1],'c2':cubes[:,2]} c1D = ne.evaluate('c0+c1*s0+c2*s0*s1',d) return c1D cubes = dimensionality_reduction(cubes) result = pd.DataFrame(cubes).groupby([0]).indices # takes 2.5 seconds

It's possible to download `cubes.npz`

file here and use a command

cubes = np.load('cubes.npz')['array']

to check performance time.

##### Constant number of indices per group

##### Approach #1

We can perform `dimensionality-reduction`

to reduce `cubes`

to a 1D array. This is based on a mapping of the given cubes data onto a n-dim grid to compute the linear-index equivalents, discussed in detail `here`

. Then, based on the uniqueness of those linear indices, we can segregate unique groups and their corresponding indices. Hence, following those strategies, we would have one solution, like so -

N = 4 # number of indices per group c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1) sidx = c1D.argsort() indices = sidx.reshape(-1,N) unq_groups = cubes[indices[:,0]] # If you need in a zipped dictionary format out = dict(zip(map(tuple,unq_groups), indices))

**Alternative #1 :** If the integer values in `cubes`

are too large, we might want to do the `dimensionality-reduction`

such that the dimensions with shorter extent are choosen as the primary axes. Hence, for those cases, we can modify the reduction step to get `c1D`

, like so -

s1,s2 = cubes[:,:2].max(0)+1 s = np.r_[s2,1,s1*s2] c1D = cubes.dot(s)

##### Approach #2

Next up, we can use `Cython-powered kd-tree`

for quick nearest-neighbor lookup to get nearest neighbouring indices and hence solve our case like so -

from scipy.spatial import cKDTree idx = cKDTree(cubes).query(cubes, k=N)[1] # N = 4 as discussed earlier I = idx[:,0].argsort().reshape(-1,N)[:,0] unq_groups,indices = cubes[I],idx[I]

##### Generic case : Variable number of indices per group

We will extend the argsort based method with some splitting to get our desired output, like so -

c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1) sidx = c1D.argsort() c1Ds = c1D[sidx] split_idx = np.flatnonzero(np.r_[True,c1Ds[:-1]!=c1Ds[1:],True]) grps = cubes[sidx[split_idx[:-1]]] indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])] # If needed as dict o/p out = dict(zip(map(tuple,grps), indices))

**Using 1D versions of groups of cubes as keys**

We will extend earlier listed method with the groups of `cubes`

as keys to simplify the process of dictionary creating and also make it efficient with it, like so -

def numpy1(cubes): c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1) sidx = c1D.argsort() c1Ds = c1D[sidx] mask = np.r_[True,c1Ds[:-1]!=c1Ds[1:],True] split_idx = np.flatnonzero(mask) indices = [sidx[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])] out = dict(zip(c1Ds[mask[:-1]],indices)) return out

Next up, we will make use of `numba`

package to iterate and get to the final hashable dictionary output. Going with it, there would be two solutions - One that gets the keys and values separately using `numba`

and the main calling will zip and convert to dict, while the other one will create a `numba-supported`

dict type and hence no extra work required by the main calling function.

Thus, we would have first `numba`

solution :

from numba import njit @njit def _numba1(sidx, c1D): out = [] n = len(sidx) start = 0 grpID = [] for i in range(1,n): if c1D[sidx[i]]!=c1D[sidx[i-1]]: out.append(sidx[start:i]) grpID.append(c1D[sidx[start]]) start = i out.append(sidx[start:]) grpID.append(c1D[sidx[start]]) return grpID,out def numba1(cubes): c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1) sidx = c1D.argsort() out = dict(zip(*_numba1(sidx, c1D))) return out

And second `numba`

solution as :

from numba import types from numba.typed import Dict int_array = types.int64[:] @njit def _numba2(sidx, c1D): n = len(sidx) start = 0 outt = Dict.empty( key_type=types.int64, value_type=int_array, ) for i in range(1,n): if c1D[sidx[i]]!=c1D[sidx[i-1]]: outt[c1D[sidx[start]]] = sidx[start:i] start = i outt[c1D[sidx[start]]] = sidx[start:] return outt def numba2(cubes): c1D = np.ravel_multi_index(cubes.T, cubes.max(0)+1) sidx = c1D.argsort() out = _numba2(sidx, c1D) return out

Timings with `cubes.npz`

data -

In [4]: cubes = np.load('cubes.npz')['array'] In [5]: %timeit numpy1(cubes) ...: %timeit numba1(cubes) ...: %timeit numba2(cubes) 2.38 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 2.13 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.8 s ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

**Alternative #1 :** We can achieve further speedup with `numexpr`

for large arrays to compute `c1D`

, like so -

import numexpr as ne s0,s1 = cubes[:,0].max()+1,cubes[:,1].max()+1 d = {'s0':s0,'s1':s1,'c0':cubes[:,0],'c1':cubes[:,1],'c2':cubes[:,2]} c1D = ne.evaluate('c0+c1*s0+c2*s0*s1',d)

This would be applicable at all places that require `c1D`

.

**Structured Data: NumPy's Structured Arrays,** I'd like to keep my data grouped into cubes of size 50*50*50 so that every cube preserves some hashable index and numpy indices of my� The fundamental object of NumPy is its ndarray (or numpy.array), an n-dimensional array that is also present in some form in array-oriented languages such as Fortran 90, R, and MATLAB, as well as predecessors APL and J. Let’s start things off by forming a 3-dimensional array with 36 elements: >>>

You might just iterate and add the index of each element to the corresponding list.

from collections import defaultdict res = defaultdict(list) for idx, elem in enumerate(cubes): #res[tuple(elem)].append(idx) res[elem.tobytes()].append(idx)

Runtime can be further improved by using tobytes() instead of converting the key to a tuple.

**Fancy Indexing,** This section demonstrates the use of NumPy's structured arrays and record The handy thing with structured arrays is that you can now refer to values either by index or by name: In [6]: Structured array data types can be specified in a number of ways. The reason is that this NumPy dtype directly maps onto a C structure� represent an index inside a list as x,y in python. python,list,numpy,multidimensional-array. According to documentation of numpy.reshape , it returns a new array object with the new shape specified by the parameters (given that, with the new shape, the amount of elements in the array remain unchanged) , without changing the shape of the original object, so when you are calling the

You could use Cython:

%%cython -c-O3 -c-march=native -a #cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True import math import cython as cy cimport numpy as cnp cpdef groupby_index_dict_cy(cnp.int32_t[:, :] arr): cdef cy.size_t size = len(arr) result = {} for i in range(size): key = arr[i, 0], arr[i, 1], arr[i, 2] if key in result: result[key].append(i) else: result[key] = [i] return result

but it will not make you faster than what Pandas does, although it is the fastest after that (and perhaps the `numpy_index`

based solution), and does not come with the memory penalty of it.
A collection of what has been proposed so far is here.

In OP's machine that should get close to ~12 sec execution time.

**4. NumPy Basics: Arrays and Vectorized Computation,** This allows us to very quickly access and modify complicated subsets of an Fancy indexing is conceptually simple: it means passing an array of indices to� I often need to stack 2d numpy arrays (tiff images). For that, I first append them in a list and use np.dstack. This seems to be the fastest way to get 3D array stacking images. But, is there a faster/memory-efficient way? from time import time import numpy as np # Create 100 images of the same dimention 256x512 (8-bit).

**xarray.DataArray — xarray 0.15.1 documentation,** Group-wise data manipulations (aggregation, transformation, function application ). The easiest way to create an array is to use the array function. In most cases they map directly onto an underlying machine representation, which makes it The numerical dtypes are named the same way: a type name, like float or int� python,list,numpy,multidimensional-array. According to documentation of numpy.reshape , it returns a new array object with the new shape specified by the parameters (given that, with the new shape, the amount of elements in the array remain unchanged) , without changing the shape of the original object, so when you are calling the

**Practical Tutorial on Data Manipulation with Numpy and Pandas in ,** Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs. Convert to a pandas mapping {coord name: (tuple of dimension names, array-like)} Index or indices of the maximum of the DataArray over one or more dimensions. argmin ([dim, axis groupby (group[, squeeze, restore_coord_dims]). Returns a � # creating our 2-dimensional array z = np.array([x, y]) for val in z: print(val) [5 0 3 3 7 9] [3 5 2 4 7 6] A two-dimensional array is built up from a pair of one-dimensional arrays. To visit every element rather than every array, we can use the numpy function nditer(), a multi-dimensional iterator object which takes an array as its argument.

**[PDF] MATLAB�to Python: A Migration Guide,** Just to give you a flavor of the numpy library, we'll quickly go through its syntax value x1[-2] 8 #in a multidimensional array, we need to specify row and column index x2 array([[3, 7, #You can use any name as an alias. import pandas as pd data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)� pandas.Series.map¶ Series.map (arg, na_action = None) [source] ¶ Map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series. Parameters arg function, collections.abc.Mapping subclass or Series. Mapping correspondence.