Determining duplicate values in an array

find duplicate elements in array in c
find duplicate elements in array in java
count repeated elements in an array in java
remove duplicate elements in array in java
count duplicates in integer array java
how to count duplicate elements in arraylist in java
count repeated elements in an array in c++
find duplicate elements in 2d array in c

Suppose I have an array

a = np.array([1, 2, 1, 3, 3, 3, 0])

How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.

I've come up with a few methods that appear to work:

Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]

This one is cute but probably illegal (as a isn't actually unique):

np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]

Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).


Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

a = np.random.randint(10**7, size=10**7)

The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.

I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).iteritems() if count > 1]
[1, 3]

note: This is similar to Burhan Khalid's answer, but the use of iteritems without subscripting in the condition should be faster.

How To Find Duplicates In Array In Java?, System.out.println( "Repeated Elements are :" ); We calculate the sum of input array, when this sum is subtracted from n(n+1)/2, we get X + Y because X and Y  2 Ways to find duplicate elements in an Array - Java Solution Hello guys, today, you will learn how to solve another popular coding problem. You have given an array of objects, which could be an array of integers and or array of Strings or any object which implements the Comparable interface.

As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:

u, c = np.unique(a, return_counts=True)
dup = u[c > 1]

This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.

How to find duplicate in String Arrays ?, Find the two repeating elements in a given array So to find out the duplicate elements, a HashMap is required, but the question is to solve the problem in  I have an array with 40 elements where some elements are duplicate. I need to create a function that counts the duplicate values in the array and print like this i.e: Number 21 repeats 4 time(s) Number 25 repeats 1 time(s) Number 40 repeats 3 time(s) etc.

People have already suggested Counter variants, but here's one which doesn't use a listcomp:

>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]

[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]

Find the two repeating elements in a given array, Java Array: Exercise-12 with Solution. Write a Java program to find the duplicate values of an Duration: 2:33 Posted: Feb 26, 2020 Just loop over array elements, insert them into HashSet using add() method and check return value. If add() returns false it means that element is not allowed in the Set and that is your duplicate. Here is the code sample to do this :

For Python 2.7+

>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]

Find duplicates in O(n) time and O(1) extra space, There are many techniques to find duplicate elements in array in java like using What if I want to determine the number of duplicates, the total of duplicates? Find Duplicate Elements in Array in C - Array is the collection of similar data type, In this program we find duplicate elements from an array, Suppose array have 3, 5, 6, 11, 5 and 7 elements, in this array 5 appear two times so this is our duplicate elements.

Here's another approach using set operations that I think is a bit more straightforward than the ones you offer:

>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])

I suppose you're asking for numpy-only solutions, since if that's not the case, it's very difficult to argue with just using a Counter instead. I think you should make that requirement explicit though.

Java exercises: Find the duplicate values of an array of integer , We need an array of random integer values. More than that - we need unique values: we said above that if there is more than one duplicate, our best algorithms  For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.

How to find duplicate elements in an array, We know that HashSet doesn't allow duplicate values in it. We can make use of this property to check for duplicates in an array. The idea is to insert all elements​  Logic to count duplicate elements in array Input size and elements in array from user. Store it in some variable say size and arr. Initialize another variable count with 0 to store duplicate count. To count total duplicate elements in given array we need two loops. Run another inner loop to find

Finding a duplicate value in an array, If the value of any key is more than one (>1) then that key is duplicate element. Using this method, you can also find the number of occurrences of  While traversing, keep track of count of all elements in the array using a temp array count[] of size n, when you see an element whose count is already set, print it as duplicate. This method uses the range given in the question to restrict the size of count[], but doesn’t use the data that there are only two repeating elements.

Check for duplicates in an array in Java, You have given an array of objects e.g. integer, String and you need to find the duplicate elements. This tutorials shows two solution of this problem, one with  Finding the indices of duplicate values in one array. Follow 727 views (last 30 days) Finding the indices of duplicate values in one array. I use the same

Comments
  • Could you explain why the sorting solution works? I tried it out but for some reason I don't really get it.
  • @Markus if you sort an array, any duplicate values are adjacent. You then use a boolean mask to take only those items that are equal to the previous item.
  • Shouldn't it be s[:-1][ s[1:] == s[:-1] ]? I get an IndexError otherwise, the boolean mask being one element shorter than the s-array....
  • @snake_charmer I think earlier versions of numpy were more forgiving in this regard. I'll fix it, thanks.
  • pandas seems to have improved the performance of some underlying methods. On my machine, pandas is only 29% slower than the sorting method. The method proposed by Mad Physicist is 17% slower than sorting.
  • Note: Counter(a).items() has to be used in python 3
  • shouldn't x[0] > 1 be x[1] > 1? the latter x represents the frequency.
  • I see it as a wart on this approach is that the 3 is repeated while the 1 is not. It would be nice to have it one way or the other. (This is not a criticism of your answer so much as of the original approach by the OP.)
  • @StevenRumbalski, yeah, I see what you mean. My sense is that the repeated 3 makes sense if what's really needed is a mask rather than a list of items; if what's needed is a list of items, then I agree that not having repeated items is better.
  • I'm not opposed to using Counter, but I am concerned about efficiency and compatibility.
  • Three years later still, and you can use the return_counts argument to unique for this too. See my answer.
  • a[1:][np.diff(a) == 0], no?