Numpy list of random subsets of specified size

numpy random
numpy random uniform
numpy random choice 2d
numpy random randint
numpy random seed
numpy .random shuffle
numpy random normal
random.sample python

I want to compute a list of random subset of an array, where the ordering in the subsets is random and every element in their respective subset is unique, and I want to do so efficiently (in terms of time and space).

For example the array [1,2,3,4,5] with some parameters k=3 and n=5 should produce a matrix

[[4,3,1],
 [1,2,5],
 [2,4,5],
 [3,2,5],
 [2,3,1]]

i.e., we get n=5 lists with k=3 random unique elements from the array.

I would like this to be as fast as possible, without creating a gigantic lookup table of possible combinations, since both time and space are of the essence.

I tried to do this with numpy.random.choice(array,n*k).reshape((n,k)) which almost does, what I want, except for the uniqueness part. I settled on the following

subsets = numpy.zeros(n).reshape((n,1))
subsets = numpy.apply_along_axis(lambda x: numpy.random.choice(array, k, replace=False),1, subsets)

However since this is not purely numpy, this is slow. Too slow for my application. Is there some way to improve on the runtime, and maybe achieve this with pure numpy commands?

Any help would be appreciated.

assuming that k is going to be much smaller than n you could implement the sampling without replacement yourself:

idx = np.random.randint([5,4,3], size=[5,3])
idx[:,2:] += idx[:,2:] >= idx[:,1,None]
idx[:,1:] += idx[:,1:] >= idx[:,0,None]
np.array([1,2,3,4,5])[arr]

putting into a function, this could be:

def sample_noreplace(arr, n, k):
    assert k <= len(arr)
    idx = np.random.randint(len(arr) - np.arange(k), size=[n, k])
    for i in range(k-1, 0, -1):
        idx[:,i:] += idx[:,i:] >= idx[:,i-1,None]
    return np.array(arr)[idx]

this takes ~1ms to run sample_noreplace([1,2,3,4,5], 10000, 3) while the OPs code takes ~800ms

to show that this is sampling uniformly I'd tabulate large amounts of output:

arr = sample_noreplace(range(5), 1_000_000, 3)

# summarise
dict(zip(*np.unique(arr, return_counts=True)))

which gives me: {0: 600288, 1: 599656, 2: 600494, 3: 599233, 4: 600329}, which looks pretty uniform. next we can check within each position:

out = np.zeros([3, 5])
for i in range(3):
    n, c = np.unique(arr[:,i], return_counts=True)
    out[i, n] = c

which gives me a table looking like:

array([[199936., 199701., 200106., 199843., 200414.],
       [200227., 200044., 200345., 199897., 199487.],
       [200125., 199911., 200043., 199493., 200428.]])

which also looks uniform.

numpy.random.choice, Generates a random sample from a given 1-D array. New in version 1.7.0. Examples. Generate a uniform random sample from np.arange(5) of size 3:. random ([size]) Return random floats in the half-open interval [0.0, 1.0). ranf ([size]) Return random floats in the half-open interval [0.0, 1.0). sample ([size]) Return random floats in the half-open interval [0.0, 1.0). choice (a[, size, replace, p]) Generates a random sample from a given 1-D array .. bytes (length) Return random bytes.

Given your numbers one good approach is to simply draw with replacement and discard the draws with duplicates.

Here are the timings for drawing 5000 samples of 10% or 10 whichever is smaller from arrays of varying size. pp_overdraw draws too many samples and discards, pp_fillin draws to the exact number, redraws bad samples until there are none left. pythran is a compiled solution. As OP is asking for pure numpy it is here just for reference.

Code:

import pyflip
import numpy as np
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

@B.add_function()
def pythran(A,k,n):
    assert len(A) >= k
    if A.dtype not in (float,int) or A.ndim > 1:
        return A[pyflip.without_repl(np.arange(len(A)),k,n)]
    else:
        return pyflip.without_repl(A,k,n)

from scipy import stats

@B.add_function()
def pp_overdraw(A,k,n):
    N = len(A)
    p = np.linspace(1-(k-1)/N,1-1/N,k-1).prod()
    nn = int(n/p + 4*np.sqrt(n-n*p)) + 1
    out = np.random.randint(0,N,(nn,k))
    os = np.sort(out,1)
    valid = (os[:,:-1] != os[:,1:]).all(1)
    validx, = valid.nonzero()
    while len(validx)<n: # very unlikely
        replace = np.random.randint(0,N,(nn-len(validx),k))
        rs = np.sort(replace,1)
        rvalid = (rs[:,:-1] != rs[:,1:]).all(1)
        out[~valid,:] = replace
        valid[~valid] = rvalid
        validx, = valid.nonzero()
    return A[out[validx[:n]]]

@B.add_function()
def pp_fillin(A,k,n):
    N = len(A)
    out = np.random.randint(0,N,(n,k))
    os = np.sort(out,1)
    valid = (os[:,:-1] != os[:,1:]).all(1)
    validx, = valid.nonzero()
    while len(validx)<n:
        replace = np.random.randint(0,N,(n-len(validx),k))
        rs = np.sort(replace,1)
        rvalid = (rs[:,:-1] != rs[:,1:]).all(1)
        out[~valid,:] = replace
        valid[~valid] = rvalid
        validx, = valid.nonzero()
    return A[out]


@B.add_function()
def OP(A,k,n):
    subsets = np.zeros(n).reshape((n,1))
    subsets = np.apply_along_axis(lambda x: np.random.choice(A, k, replace=False),1, subsets)

@B.add_function()
def Sam_Mason(A,k,n):
    assert k <= len(A)
    idx = np.random.randint(len(A) - np.arange(k), size=[n, k])
    for i in range(k-1, 0, -1):
        idx[:,i:] += idx[:,i:] >= idx[:,i-1,None]
    return np.array(A)[idx]

@B.add_arguments('array size')
def argument_provider():
    for exp in range(4, 15):
        sz = 2**exp
        A = np.arange(sz)
        yield sz, MultiArgument([A,min(10,sz//10+1),5000])

r = B.run()
r.plot()

import pylab
pylab.savefig('norepl.png')

pythran module. In principle, it runs in Python, but this would be painfully slow. Better compile with

pythran -O3 pyflip.py

file <pyflip.py>

import numpy as np

#pythran export without_repl(int[:],int,int)
#pythran export without_repl(float[:],int,int)

def without_repl(A,k,n):
    N = A.size
    out = np.empty((n,k),A.dtype)
    A = A.copy()
    for i in range(n):
        for j in range(k):
            l = np.random.randint(j,N)
            out[i,j] = A[l]
            A[l] = A[j]
            A[j] = out[i,j]
    return out

Random sampling (numpy.random), sample ([size]), Return random floats in the half-open interval [0.0, 1.0). choice (a[ , size, replace, p]), Generates a random sample from a given 1-D array. random ([size]) Return random floats in the half-open interval [0.0, 1.0). ranf ([size]) Return random floats in the half-open interval [0.0, 1.0). sample ([size]) Return random floats in the half-open interval [0.0, 1.0). choice (a[, size, replace, p]) Generates a random sample from a given 1-D array: bytes (length) Return random bytes.

N = [1, 2, 3, 4, 5]
k = 3
n = 5

arr = np.array([N] * n)

arr
array([[1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5],
       [1, 2, 3, 4, 5]])

[np.random.shuffle(i) for i in arr]

arr
array([[3, 5, 1, 2, 4],
       [5, 1, 3, 2, 4],
       [3, 5, 2, 4, 1],
       [4, 5, 1, 2, 3],
       [1, 5, 3, 2, 4]])

arr[:, :3]
array([[3, 5, 1],
       [5, 1, 3],
       [3, 5, 2],
       [4, 5, 1],
       [1, 5, 3]])

numpy.random.choice, Generates a random sample from a given 1-D array. New in version 1.7.0. Examples. Generate a uniform random sample from np.arange(5) of size 3: >>> If the size is specified, it returns an array of the specified size. In[] # Generating a random integer between 0 and 6 np.random.randint(6) Out[] 2. In[] # Generating a random integer between 6 and 9 np.random.randint(6, 9) Out[] 7 In[] # Generating a one-dimensional array with values between 3 and 9 np.random.randint(3, 9, size=5) Out[] array

Python Numpy Tutorial: Installation, Arrays And Random Sampling, It would not be possible with heterogeneous data sets. NumPy array is a new type of data structure type like the Python list type that we have seen before. If the size is specified, it returns an array of the specified size. In[] numpy.random.choice¶ numpy.random.choice(a, size=None, replace=True, p=None)¶ Generates a random sample from a given 1-D array

Python random.sample() to choose multiple items from any sequence, k must be less than the size of the specified list. A random.sample() function raises a type error if you miss any of the required arguments. The  He seems to want np.random.choice(2, size=1000, replace=False, p=[0.5, 0.3, 0.2]), i.e. two separate choices from 3 items. – deinst Jun 4 '14 at 18:14 Thanks Emre, Numpy is full of nice surprises! Unfortunately for me I'm going to be using IronPython which doesn't support numpy.

How to get embarrassingly fast random subset sampling with Python, To generate a random sample, numpy.random.choice permutes the When our sample size is only a fraction of the whole array length, we do  numpy.random.randint () is one of the function for doing random sampling in numpy. It returns an array of specified shape and fills it with random integers from low (inclusive) to high (exclusive), i.e. in the interval [low, high). Syntax : numpy.random.randint (low, high=None, size=None, dtype=’l’)

Comments
  • Not a great solution but using map is faster. np.array(list(map(lambda x: np.random.choice(array, 3, replace=False),range(5)))).
  • What size n and k and input array do you expect?
  • @PaulPanzer currently, len(array) is around 500, n is between 1000 and 5000 and k is between 2 and 10
  • @PaulPanzer good point, should have asked that as well! note to OP, if k is much lower than sqrt(len(array)) then it becomes useful to use a different algorithm because the chance of picking a duplicate is relatively low. see e.g. the birthday problem for where the sqrt comes from
  • @SamMason In case you are interested I've added the code for a solution that works well (O(len(array)+nk)) for any parameters but relies on pythran (numba would also work) for speed because it is loopy. --- Btw. which regime would you argue is your algo for? It is quadratic in k, so it would also have to be somewhere in the k<<len(array) region, right?
  • very clean solution, thank you. I appreciate the probability analysis as well!
  • the second array in your output matrix has non-unique members. 1 in [1,5,1]
  • This new version can be really costly in terms of space, if N is large but k is small.