Sample rows of pandas dataframe in proportion to counts in a column

pandas balanced sampling
pandas sample
pandas drop column
pandas random split
pandas sample by group
select n rows from dataframe pandas
df sample weights
create sample dataframe

I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column called group_id. I would like to randomly sample 10% say of the rows but in proportion to the numbers of each group_id.

For example, if the group_id's are A, B, A, C, A, B then I would like half of my sampled rows to have group_id A, two sixths to have group_id B and one sixth to have group_id C.

I can see the pandas function sample but I am not sure how to use it to achieve this goal.


You can use groupby and sample

sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

value_counts() Method: Count Unique Occurrences of Values in a , Learn how to use the value_counts() method in Python with pandas In pandas, for a column in a DataFrame, we can use the value_counts() Below is a preview of the first few rows of the dataset. Often times, we want to know what percentage of the whole is for each value that appears in the column. Pandas count and percentage by value for a column. This is the simplest way to get the count, percenrage ( also from 0 to 100 ) at once with pandas.


This is not as simple as just grouping and using .sample. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A will end up with 1/20 for a fraction of the total number of rows, group B will get 1/30 and group C ends up with 1/60. You can put these fractions in a dictionary and then use .groupby and pd.concat to concatenate the number of rows* from each group into a dataframe. You will be using the n parameter from the .sample method instead of the frac parameter.

fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))
Edit:

This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.

Starting with equal portions: each group starts with 40 rows

df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
                   'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))

#     group_id      vals
# 12         A -0.175109
# 51         A -1.936231
# 81         A  2.057427
# 111        A  0.851301
# 114        A  0.669910
# 60         A  1.226954
# 73         B -0.166516
# 82         B  0.662789
# 94         B -0.863640
# 31         B  0.188097
# 101        C  1.802802
# 53         C  0.696984


print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        24         A  0.161328
#          21         A -1.399320
#          30         A -0.115725
#          114        A  0.669910
# B        34         B -0.348558
#          7          B -0.855432
#          106        B -1.163899
#          79         B  0.532049
# C        65         C -2.836438
#          95         C  1.701192
#          80         C -0.421549
#          74         C -1.089400

First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).

Second solution: 4 rows for each group (each one third of the sampled rows)


Working with differently sized groups: 40 for A, 60 for B and 20 for C

df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
                   'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))

#     group_id      vals
# 29         A  0.306738
# 35         A  1.785479
# 21         A -0.119405
# 4          A  2.579824
# 5          A  1.138887
# 11         A  0.566093
# 80         B  1.207676
# 41         B -0.577513
# 44         B  0.286967
# 77         B  0.402427
# 103        C -1.760442
# 114        C  0.717776

print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  2.579824
#          32         A  0.451882
#          5          A  1.138887
#          17         A -0.614331
# B        47         B -0.308123
#          52         B -1.504321
#          42         B -0.547335
#          84         B -1.398953
#          61         B  1.679014
#          66         B  0.546688
# C        105        C  0.988320
#          107        C  0.698790

First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.


Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C

df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
                   'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))

#     group_id      vals
# 48         A  1.214525
# 19         A -0.237562
# 0          A  3.385037
# 11         A  1.948405
# 8          A  0.696629
# 39         A -0.422851
# 62         B  1.669020
# 94         B  0.037814
# 67         B  0.627173
# 93         B  0.696366
# 104        C  0.616140
# 113        C  0.577033

print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  0.284448
#          11         A  1.948405
#          8          A  0.696629
#          0          A  3.385037
#          31         A  0.579405
#          24         A -0.309709
# B        70         B -0.480442
#          69         B -0.317613
#          96         B -0.930522
#          80         B -1.184937
# C        101        C  0.420421
#          106        C  0.058900

This is the only time the second solution offered some consistency (out of sheer luck, I might add).

I hope this proves useful.

How To Randomly Select Rows in Pandas?, Pandas' sample function lets you randomly sample data from To randomly select rows from a pandas dataframe, we can use sample Often, you may want to sample a percentage of data rather than a fixed number of rows. How to Select Top N Rows with the Largest Values in a Column(s) in Pandas? If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row. level int or str, optional. If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name. numeric_only bool, default False. Include only float, int or boolean data. Returns Series or DataFrame. For each column/row the number of non-NA/null entries.


I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.

# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))

Getting frequency counts of a columns in Pandas DataFrame , Given a Pandas dataframe, we need to find the frequency counts of each item in importing pandas as pd. import pandas as pd. # sample dataframe. df = pd. in Pandas DataFrame · Dealing with Rows and Columns in Pandas DataFrame  Now that we have used NumPy we will continue this Pandas dataframe sample tutorial by using sample’s frac parameter. This parameter specifies the fraction (percentage) of rows to return in the random sample. This means that setting frac to 1 (frac=1) will return all rows, in random order.


the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

pandas.DataFrame.sample, If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are more likely to be sampled. >>> df . sample ( n = 2 , weights = 'num_specimen_seen' , random_state = 1 ) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8


pandas.Series.value_counts, Series.sample · pandas. Series. value_counts (self, normalize=False, sort=​True, ascending=False, Number of non-NA elements in a DataFrame. for going from a continuous variable to a categorical variable; instead of counting unique  I'm trying to work out how to use the groupby function in pandas to work out the proportions of values per year with a given Yes/No criteria. For example, I have a dataframe called names : Name Number Year Sex Criteria 0 name1 789 1998 Male N 1 name1 688 1999 Male N 2 name1 639 2000 Male N 3 name2 551 1998 Male Y 4 name2 499 1999 Male Y


[PDF] Pandas DataFrame Notes, DataFrame object: The pandas DataFrame is a two- dimensional table of data with column and row indexes. The columns are row index: df = DataFrame(np.​random.rand(500,5)) df.index on columns df['proportion']=df['count']/df['total']. I want to get a percentage of a particular value in a df column. Say I have a df with (col1, col2 , col3, gender) gender column has values of M or F. I want to get the percentage of M and F values


Pandas Cheat Sheet, Download a free pandas cheat sheet to help you work with data in Python. It includes DataFrame(np.random.rand(20,5)) | 5 columns and 20 rows of random floats pd. Series.value_counts) | Unique values and counts for all columns  With pandas version 0.16.1 and up, there is now a DataFrame.sample method built-in: import pandas df = pandas.DataFrame(pandas.np.random.random(100)) # Randomly sample 70% of your dataframe df_percent = df.sample(frac=0.7) # Randomly sample 7 elements from your dataframe df_elements = df.sample(n=7)