Shuffle DataFrame rows

pandas sample by group
pandas stratified sampling
pandas random split
pandas shuffle column
pandas balanced sampling
pandas example dataframe
pandas shuffle order of rows
how to shuffle rows in python

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame's rows, so that all Type's are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?

The idiomatic way to do this with pandas is to use the .sample method of your dataframe, i.e.

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

shuffling/permutating a DataFrame in pandas, Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be​  Python Pandas: DataFrame Exercise-40 with Solution. Write a Pandas program to shuffle a given DataFrame rows. Sample data: Original DataFrame: attempts name qualify score 0 1 Anastasia yes 12.5 1 3 Dima no 9.0 2 2 Katherine yes 16.5 3 3 James no NaN 4 2 Emily no 9.0 5 3 Michael yes 20.0 6 1 Matthew yes 14.5 7 1 Laura no NaN 8 2 Kevin no 8.0 9 1

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)

Pandas DataFrame: Shuffle a given DataFrame rows, Python Pandas DataFrame Exercises, Practice and Solution: Write a Pandas program to shuffle a given DataFrame rows. This is simple. First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script set.seed(42) Next, you use the sample() function to

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)

Shuffle DataFrame rows, Use sklearn for this. from sklearn.utils import shuffle. df = shuffle(df). Shuffle arrays or sparse matrices in a consistent way  Randomly reorder a dataframe by row Value. a data frame of the same dimensions with the rows reordered randomly

pandas.DataFrame.sample, Fraction of axis items to return. Cannot be used with n . replacebool, default False​. Allow or disallow sampling of the same row more than once  I am currently trying to find a way to randomize items in a dataframe row-wise. I found this thread on shuffling/permutation column-wise in pandas ( shuffling/permutating a DataFrame in pandas ), but for my purposes, is there a way to do something like

(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.

Shuffle the rows of a python pandas dataframe · GitHub, Shuffle the rows of a python pandas dataframe. pandas_shuffle.py. ''' Title : Pandas Row Shuffler. Author : Felan Carlo Garcia. ''' import numpy  @ChrisA. if you want a "true" shuffle then you have to move data across the network. E.g. each row has equal chances to be at any place in dataset. But if you need just to shuffle within partition, you can use: df.mapPartitions(new scala.util.Random().shuffle(_)) - then no network shuffle would be involved.

How to shuffle the rows in a Pandas DataFrame in Python, a Pandas DataFrame in Python. A Pandas DataFrame is a data store with columns and rows. Shuffling the rows in a DataFrame randomizes the order of rows. Hello, I want to shuffle the rows in my data frame and get a permutation from it. I am having around 40k values. How to do this task in R? thanks.

Shuffle DataFrame rows - Article, Shuffle DataFrame rows. I have the following DataFrame: Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 20 7 8 9 2 21 10 11 12 2 45 13 14 15 3 46 16 17 18 3 . will shuffle the rows itself, so the number of 1's in each row doesn't change. Small changes and it also works great with columns, but this is a exercise for the reader :-P Small changes and it also works great with columns, but this is a exercise for the reader :-P

How to shuffle a dataframe in R by rows - Sudarshini Tyagi, Next, you use the sample() function to shuffle the row indices of the dataframe(df). You can later use these indices to reorder the dataset. rows <- sample(nrow(df))  sklearn.utils.shuffle(*arrays, **options)¶. Shuffle arrays or sparse matrices in a consistent way. This is a convenience alias to resample(*arrays, replace=False) to do random permutations of the collections.

Comments
  • Re. your note, sample() method doesn't have inplace parameter, so it seems like it is (currently) not possible to do what you suggested without creating a new object.
  • Quoting from above "Note: If you wish to shuffle your dataframe in-place [...]".
  • Yes, this is exactly what I wanted to show in my first comment, you have to assign the necessary memory twice, which is quite far from doing it in place.
  • no, it doesn't copy the DataFrame, just look at this line: github.com/pandas-dev/pandas/blob/v0.23.0/pandas/core/…
  • @PV8 Yes you can.
  • This is nice, but you may need to reset your indexes after shuffling: df.reset_index(inplace=True, drop=True)
  • Doesn't df = df.sample(frac=1) do the exact same thing as df = sklearn.utils.shuffle(df)? According to my measurements df = df.sample(frac=1) is faster and seems to perform the exact same action. They also both allocate new memory. np.random.shuffle(df.values) is the slowest, but does not allocate new memory.
  • In terms of shuffling the axis along with the data, it's seems like it can do the same. And yes, it seems like df.sample(frac=1) is about 20% faster than sklearn.utils.shuffle(df), using the same code above. Or you could do sklearn.utils.shuffle(ndarray) to get different result.
  • Please have a look at the Follow-up note of the original answer. There you'll see that even though the references have changed (different ids), the underlying object is not copied. In other words, the operation is effectively in-memory (although admittedly it's not obvious).
  • Please, notice this changes the indices in the original df, as well as producing a copy, which you are saving into df_shuffled. But, which is more worrying, anything that does not depend in the index, for example `df_shuffled.iterrows()' will produce exactly the same order as df. In summary, use with caution!
  • @Jblasco This is incorrect, the original df is not changed at all. Documentation of np.random.permutation: "...If x is an array, make a copy and shuffle the elements randomly". Documentation of DataFrame.reindex: "A new object is produced unless the new index is equivalent to the current one and copy=False". So the answer is perfectly safe (albeit producing a copy).
  • @AndreasSchörgenhumer, thank you for pointing this out, you are partially right! I knew I had tried it, so I did some testing. Despite what the documentation of np.random.permutation says, and depending on versions of numpy, you get the effect I described or the one you mention. With numpy > 1.15.0, creating a dataframe and doing a plain np.random.permutation(df.index), the indices in the original df change. The same is not true for numpy == 1.14.6. So, more than ever, I repeat my warning: that way of doing things is dangerous because of unforeseen side effects and version dependencies.
  • @Jblasco You are right, thank you for the details. I was running numpy 1.14, so everything worked just fine. With numpy 1.15 there seems to be a bug somewhere. In the light of this bug, your warnings are currently indeed correct. However, as it is a bug and the documentation states other behavior, I still stick to my previous statement that the answer is safe (given that the documentation does reflect the actual behavior, which we should normally be able to rely on).
  • @AndreasSchörgenhumer, not quite sure if it's a bug or a feature, to be honest. Documentation guarantees a copy of an array, not a Index type... In any case, I base my recommendations/warnings on actual behaviour, not on the docs :p
  • I prefer this method as it means the shuffle can be repeated if I need to reproduce my algorithm output exactly, by storing the randomised index to a variable.