Most efficient way to find mode in numpy array

numpy mode mean, median
scipy stats mode
how to find mean, median and mode in python using numpy
find mode in a list
get mode in list python
most frequent element in an array python numpy
numpy most
python code to find the mode of a list

I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time.

So if the array is like:

1 3 4 2 2 7
5 2 2 1 4 1
3 3 2 2 1 1

The result should be

1 3 2 2 2 1

Note that when there are multiple values for mode, any one (selected randomly) may be set as mode.

I can iterate over the columns finding mode one at a time but I was hoping numpy might have some in-built function to do that. Or if there is a trick to find that efficiently without looping.

Check scipy.stats.mode() (inspired by @tom10's comment):

import numpy as np
from scipy import stats

a = np.array([[1, 3, 4, 2, 2, 7],
              [5, 2, 2, 1, 4, 1],
              [3, 3, 2, 2, 1, 1]])

m = stats.mode(a)
print(m)

Output:

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

As you can see, it returns both the mode as well as the counts. You can select the modes directly via m[0]:

print(m[0])

Output:

[[1 3 2 2 1 1]]

How to find the mode of a NumPy array in Python, How do you find the mode of an array in Python? Most efficient way to find mode in numpy array I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column represents values for various spatial sites for a given time. So if the array is like:

Update

The scipy.stats.mode function has been significantly optimized since this post, and would be the recommended method

Old answer

This is a tricky problem, since there is not much out there to calculate mode along an axis. The solution is straight forward for 1-D arrays, where numpy.bincount is handy, along with numpy.unique with the return_counts arg as True. The most common n-dimensional function I see is scipy.stats.mode, although it is prohibitively slow- especially for large arrays with many unique values. As a solution, I've developed this function, and use it heavily:

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Result:

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7],
                         [5, 2, 2, 1, 4, 1],
                         [3, 3, 2, 2, 1, 1]])

In [3]: mode(a)
Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

Some benchmarks:

In [4]: import scipy.stats

In [5]: a = numpy.random.randint(1,10,(1000,1000))

In [6]: %timeit scipy.stats.mode(a)
10 loops, best of 3: 41.6 ms per loop

In [7]: %timeit mode(a)
10 loops, best of 3: 46.7 ms per loop

In [8]: a = numpy.random.randint(1,500,(1000,1000))

In [9]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 1.01 s per loop

In [10]: %timeit mode(a)
10 loops, best of 3: 80 ms per loop

In [11]: a = numpy.random.random((200,200))

In [12]: %timeit scipy.stats.mode(a)
1 loops, best of 3: 3.26 s per loop

In [13]: %timeit mode(a)
1000 loops, best of 3: 1.75 ms per loop

EDIT: Provided more of a background and modified the approach to be more memory-efficient

Finding Mean, Median, Mode in Python without Libraries, the frequency of each number present in the list and then choosing the one with highest frequency. While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren't on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1, also to validate if it is actually representative of the central distribution of your data you may check whether it falls inside your standard deviation interval.

Python NumPy array tutorial, Return an array of the modal (most common) value in the passed array. If there is more than n-dimensional array of which to find mode(s). axisint or None, Defines how to handle when input contains nan. The following  Η λύση είναι απλή για συστοιχίες 1-D, όπου numpy.bincount είναι πρακτικό, μαζί με numpy.unique με τα return_counts ARG ως True. Η πιο κοινή συνάρτηση n-dimensional που βλέπω είναι το scipy.stats.mode, αν και είναι απαγορευτικά αργή

A neat solution that only uses numpy (not scipy nor the Counter class):

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]])

np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

array([1, 3, 2, 2, 1, 1])

scipy.stats.mode, I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas each column​  Is there a better way to do this? With a knn graph, its a very simple function, since the number of neighbors is fixed and you can just index into X, but with a radius or density based nearest neighbors graph, you have to work with a CSR, (or an array of arrays if you are using a kd tree).

I think a very simple way would be to use the Counter class. You can then use the most_common() function of the Counter instance as mentioned here.

For 1-d arrays:

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 #6 is now the mode
mode = Counter(nparr).most_common(1)
# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])    

For multiple dimensional arrays (little difference):

import numpy as np
from collections import Counter

nparr = np.arange(10) 
nparr[2] = 6 
nparr[3] = 6 
nparr = nparr.reshape((10,2,5))     #same thing but we add this to reshape into ndarray
mode = Counter(nparr.flatten()).most_common(1)  # just use .flatten() method

# mode will be [(6,3)] to give the count of the most occurring value, so ->
print(mode[0][0])

This may or may not be an efficient implementation, but it is convenient.

Most efficient way to find mode in numpy array - Article, I have a 2D array containing integers (both positive or negative). Each row represents the values over time for a particular spatial site, whereas  Believe it or not, after profiling my current code, the repetitive operation of numpy array reversion ate a giant chunk of the running time. What I have right now is the common view-based method:

python - Most efficient way to find mode in numpy array, Learn python mean median and mode with examples with numpy and scipy libraries. by with a middle number of the series. If the series has 2 middle numbers, then we have to calculate avg number. Mode: Mode function produces most repeated ones from the list. 10 Best PHP Courses Certification Courses 2019. The scipy.stats.mode function is defined with this code, which only relies on numpy: def mode(a, axis=0): scores = np.unique(np.ravel(a)) # get ALL unique values testshape = list(a.shape) testshape[axis] = 1 oldmostfreq = np.zeros(testshape) oldcounts = np.zeros(testshape) for score in scores: template = (a == score) counts = np.expand_dims(np.sum(template, axis),axis) mostfrequent = np.where(counts > oldcounts, score, oldmostfreq) oldcounts = np.maximum(counts, oldcounts) oldmostfreq

How to calulate Mean, Median and Mode in numpy, Use scipy.​​ stats. mode(array) and access the first element of the result to return the mode of array . If there are multiple possible modes, then the smallest one will be picked by default. In a two dimensional array, mode can refer to the modes of each row, the modes of each column, or the mode of the entire array. I have a very large NumPy array. 1 40 3 4 50 4 5 60 7 5 49 6 6 70 8 8 80 9 8 72 1 9 90 7 . I want to check to see if a value exists in the 1st column of the array. I've got a bunch of homegrown ways (e.g. iterating through each row and checking), but given the size of the array I'd like to find the most efficient method. Thanks!

scipy stats.mode() function, What's Difference? Quizzes expand_more. C · C++ · Java · Python · Data Structures · Algorithms · Operating Systems · DBMS · Compiler Design  NumPy's array (or ndarray) is a Python object used for storing data. The main advantage of NumPy over other Python data structures, such as Python's lists or pandas' Series , is speed at scale. It's most useful when you're creating large matrices with billions of data points.

Comments
  • There is docs.scipy.org/doc/scipy/reference/generated/… and the answer here: stackoverflow.com/questions/6252280/…
  • @tom10: You mean scipy.stats.mode(), right? The other one seems to output a masked array.
  • @fgb: right, thanks for the correction (and +1 for your answer).
  • So numpy by itself does not support any such functionality?
  • Apparently not, but scipy's implementation relies only on numpy, so you could just copy that code into your own function.
  • Just a note, for people who look at this in the future: you need to import scipy.stats explicitly, it is not included when you simply do an import scipy.
  • Can you please explain how exactly it is displaying the mode values and count ? I couldn't relate the output with the input provided.
  • @Rahul: you have to consider the default second argument of axis=0. The above code is reporting the mode per column of the input. The count is telling us how many times it has seen the reported mode in each of the columns. If you wanted the overall mode, you need to specify axis=None. For further info, please refer to docs.scipy.org/doc/scipy/reference/generated/…
  • Please do contribute it to scipy's stat module so others also could benefit from it.
  • When does np.argmax ever return something with length greater than 1 if you don't specify an axis?
  • Since the question was asked 6 years ago, it is normal that he did not receive much reputation.