Numpy shuffle multidimensional array by row only, keep column order unchanged

numpy shuffle two arrays the same way
np.random.shuffle returns none
numpy random permutation
numpy shuffle columns
np.random.shuffle seed
numpy shuffle not in place
np.random.shuffle not working
shuffle rows of two numpy arrays

How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).

I am looking for the most efficient solution, because my matrix is very huge. Is it also possible to do this highly efficient on the original array (to save memory)?

Example:

import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)

What I expect now is original matrix:

[[ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.45174186  0.8782033 ]
 [ 0.75623083  0.71763107]
 [ 0.26809253  0.75144034]
 [ 0.23442518  0.39031414]]

Output shuffle the rows not cols e.g.:

[[ 0.45174186  0.8782033 ]
 [ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.75623083  0.71763107]
 [ 0.23442518  0.39031414]
 [ 0.26809253  0.75144034]]

That's what numpy.random.shuffle() is for :

>>> X = np.random.random((6, 2))
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778],
       [ 0.44323485,  0.78779887]])

>>> np.random.shuffle(X)
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.44323485,  0.78779887],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778]])

How to randomly shuffle an array in python using numpy, To randomly shuffle a 1D array in python, there is the numpy function multidimensional array by row only, keep column order unchanged� So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go. Answer 3 After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index

You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

In [23]: X
Out[23]: 
array([[ 0.60511059,  0.75001599],
       [ 0.30968339,  0.09162172],
       [ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.0957233 ,  0.96210485],
       [ 0.56843186,  0.36654023]])

In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);

In [25]: X
Out[25]: 
array([[ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.30968339,  0.09162172],
       [ 0.56843186,  0.36654023],
       [ 0.0957233 ,  0.96210485],
       [ 0.60511059,  0.75001599]])

Additional performance boost

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

np.random.rand(X.shape[0]).argsort()

Speedup results -

In [32]: X = np.random.random((6000, 2000))

In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop

In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

In [40]: X = np.random.random((6000, 2000))

In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop

In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop

In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

np.random.shuffle(array), 问题How can I shuffle a multidimensional array by row only in Python (so do Numpy shuffle multidimensional array by row only, keep column order unchanged. To randomly shuffle a 1D array in Numpy shuffle multidimensional array by row only, keep column order unchanged: randomly shuffle an array in python using

After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index

rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]

in more detailsHere, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

def main():
    # shuffle data itself
    rand_num = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.random.shuffle(rand_num)
    print('Time for direct shuffle: {0}'.format((time.time() - start)))

    # Shuffle index and get data from shuffled index
    rand_num2 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    print('Time for shuffling index: {0}'.format((time.time() - start)))

    # using np.take()
    rand_num3 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

Time for direct shuffle: 0.03345608711242676   # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

Line #    Mem usage    Increment   Line Contents
================================================
    39  117.422 MiB    0.000 MiB   @profile
    40                             def main():
    41                                 # shuffle data itself
    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
    43  208.977 MiB    0.000 MiB       start = time.time()
    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
    46                             
    47                                 # Shuffle index and get data from shuffled index
    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
    49  300.531 MiB    0.000 MiB       start = time.time()
    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
    54                             
    55                                 # using np.take()
    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
    57  392.094 MiB    0.000 MiB       start = time.time()
    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))

numpy.ndarray — NumPy v1.19 Manual, An array object represents a multidimensional, homogeneous array of fixed-size items. An associated Row-major (C-style) or column-major (Fortran-style) order . Create an array, but leave its allocated memory unchanged (i.e., it contains “ garbage”). dtype If buffer is None, then only shape , dtype , and order are used. numpy.random.shuffle() “Modify a sequence in-place by shuffling its contents. This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same”. From the documentation.

You can shuffle a two dimensional array A by row using the np.vectorize() function:

shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')

A_shuffled = shuffle(A)

numpy.ndarray — NumPy v1.20.dev0 Manual, An array object represents a multidimensional, homogeneous array of fixed-size items. Row-major (C-style) or column-major (Fortran-style) order. See also. array Create an array, but leave its allocated memory unchanged (i.e., it contains “garbage”). dtype If buffer is None, then only shape , dtype , and order are used. Here's one way avoid loops completely and build the required array: Given an array X with n columns, construct an array Y with n copies of X. Create a mask to select the i-th column from the i-th copy of X in the array Y. Reassign a column-shuffled copy of X to the relevant indices of Y using the mask on Y. In NumPy it looks like this:

I tried many solutions, and at the end I used this simple one:

from sklearn.utils import shuffle
x = np.array([[1, 2],
              [3, 4],
              [5, 6]])
print(shuffle(x, random_state=0))

output:

[
[5 6]  
[3 4]  
[1 2]
]

if you have 3d array, loop through the 1st axis (axis=0) and apply this function, like:

np.array([shuffle(item) for item in 3D_numpy_array])

Turn numpy array into df, Y: If you have a NumPy array which is essentially a row vector (or column vector) shuffle multidimensional array by row only, keep column order unchanged. currently im facing a problem regarding the permutation of 2 numpy arrays of different row sizes, i know how to to utilize the np.random.shuffle function but i cannot seem to find a solution to my specific problem, the examples from the numpy documentation only refers to nd arrays with the same row sizes, e.g x.shape=[10][784] y.shape=[10][784]

4. NumPy Basics: Arrays and Vectorized Computation, ndarray , a fast and space-efficient multidimensional array providing vectorized Linear algebra, random number generation, and Fourier transform capabilities It's often only necessary to care about the general kind of data you're dealing with , Setting whole rows or columns using a 1D boolean array is also easy: numpy.random. shuffle (x) ¶ Modify a sequence in-place by shuffling its contents. This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

Look Ma, No For-Loops: Array Programming With NumPy – Real , I might be biased towards looking at 2D & 3D numpy arrays as having axis=0= rows and axis=1=columns (the same for 2D in pandas DataFrames). So while you'd� So you could use numpy.random.permutation function to generate the index array and use it to shuffle multiple arrays. For example def randomize (a, b): # Generate the permutation index array. permutation = np . random . permutation(a . shape[0]) # Shuffle the arrays by giving the permutation in the square brackets. shuffled_a = dataset

User Divakar, strongest skill. Nay loops , Yay MATLAB bsxfun / NumPy Broadcasting Numpy shuffle multidimensional array by row only, keep column order unchanged. order: {‘K’, ‘A’, ‘C’, ‘F’}, optional. Specify the memory layout of the array. If object is not an array, the newly created array will be in C order (row major) unless ‘F’ is specified, in which case it will be in Fortran order (column major). If object is an array the following holds.

Comments
  • Option 1: shuffled view onto an array. I guess that would mean a custom implementation. (almost) no impact on memory usage, Obv. some impact at runtime. It really depends on how you intend to use this matrix.
  • Option 2: shuffle array in place. np.random.shuffle(x), docs state that "this function only shuffles the array along the first index of a multi-dimensional array", which is good enough for you, right? Obv., some time taken at startup, but from that point, it's as fast as original matrix.
  • Compare to np.random.shuffle(x), shuffling index of nd-array and getting data from shuffled index is more efficient way to solve this problem. For more details comparision refer my answer bellow
  • I wonder if this could be sped up by numpy, maybe taking advantage of concurrency.
  • @GeorgSchölly I thinks this is the most available optimized approach in python. If you want to speed it up you need to make changes on algorithm.
  • I completely agree. I just realized that you are using np.random instead of the Python random module which also contains a shuffle function. I'm sorry for causing confusion.
  • This shuffle is not always working, see my new answer here below. Why is it not always working?
  • Is there a way to choose the axis on which the shuffling should be done (for a >2-d array ? ) Or is it always implicitly the first dimension that is taken into account ? @Kasramvd
  • This sounds nice. Can you add a timing information to your post, of your np.take v.s. standard shuffle? The np.shuffle on my system is faster (27.9ms) vs your take (62.9 ms), but as I read in your post, there is a memory advantage?
  • @robert Just added, check it out!
  • Hi, can you provide the code that produce this output?
  • i lost the code to produce memory_profiler output. But it can be very easily reproduced by following steps in the given link.
  • What I like about this answer is that if I have two matched arrays (which coincidentally I do) then I can shuffle both of them and ensure that data in corresponding positions still match. This is useful for randomising the order of my training set