Python pandas: exclude rows below a certain frequency count
So I have a pandas DataFrame that looks like this:
r vals positions 1.2 1 1.8 2 2.3 1 1.8 1 2.1 3 2.0 3 1.9 1 ... ...
I would like the filter out all rows by position that do not appear at least 20 times. I have seen something like this
g=df.groupby('positions') g.filter(lambda x: len(x) > 20)
but this does not seem to work and I do not understand how to get the original dataframe back from this. Thanks in advance for the help.
On your limited dataset the following works:
In : df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3) Out: 0 1.2 2 2.3 3 1.8 6 1.9 Name: r vals, dtype: float64
You can assign the result of this filter and use this with
isin to filter your orig df:
In : filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3) df[df['r vals'].isin(filtered)] Out: r vals positions 0 1.2 1 1 1.8 2 2 2.3 1 3 1.8 1 6 1.9 1
You just need to change
20 in your case
Another approach would be to use
value_counts to create an aggregate series, we can then use this to filter your df:
In : counts = df['positions'].value_counts() counts Out: 1 4 3 2 2 1 dtype: int64 In : counts[counts > 3] Out: 1 4 dtype: int64 In : df[df['positions'].isin(counts[counts > 3].index)] Out: r vals positions 0 1.2 1 2 2.3 1 3 1.8 1 6 1.9 1
If you want to filter the groupby object on the dataframe rather than a Series then you can call
filter on the groupby object directly:
In : filtered = df.groupby('positions').filter(lambda x: len(x) >= 3) filtered Out: r vals positions 0 1.2 1 2 2.3 1 3 1.8 1 6 1.9 1
How can I remove rows where frequency of the value is less than 5 , Use value_counts to count values in your dataframe - c = v.apply(pd.Series. value_counts) c Col2 Col3 apple 6.0 NaN grape 1.0 NaN lemon 1.0� Teams. Q&A for Work. Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information.
I like the following method:
def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame: """Filters the DataFrame based on the value frequency in the specified column. :param df: DataFrame to be filtered. :param column: Column name that should be frequency filtered. :param min_freq: Minimal value frequency for the row to be accepted. :return: Frequency filtered DataFrame. """ # Frequencies of each value in the column. freq = df[column].value_counts() # Select frequent values. Value is in the index. frequent_values = freq[freq >= min_freq].index # Return only rows with value frequency above threshold. return df[df[column].isin(frequent_values)]
It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.
Getting frequency counts of a columns in Pandas DataFrame , Given a Pandas dataframe, we need to find the frequency counts of each item in one or more columns of this dataframe. This can be achieved in multiple ways:. Python | Pandas dataframe.count() Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
How about selecting all
position rows with values >= 20
mask = df['position'] >= 20 sel = df.ix[mask, :]
How to use Pandas Count and Value_Counts, Counting number of Values in a Row or Columns is important to know the Frequency or Occurrence of your data. pandas groupby to compute count of group excluding missing values #sort by frequency df['Name'].value_counts( sort=True) You can also get the count of a specific value in dataframe by� In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions.
Cookbook — pandas 1.0.5 documentation, 'CCC': [100, 50, -30, -50]}) : In : df Out: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 Select rows with data closest to certain value using argsort. In : df = pd. index=pd.date_range('2013-08-01', periods=6, freq='B') , . Create a value counts column and reassign back to the DataFrame. In : df� Pandas provides a rich collection of functions to perform data analysis in Python. While performing data analysis, quite often we require to filter the data to remove unnecessary rows or columns. We have already discussed earlier how to drop rows or columns based on their labels .
pandas.DataFrame.drop — pandas 0.23.1 documentation, Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index If none of the labels are found in the selected axis� Pandas : 4 Ways to check if a DataFrame is empty in Python; Pandas: Apply a function to single or selected columns or rows in Dataframe; Python Pandas : Replace or change Column & Row index names in DataFrame; Pandas : How to create an empty DataFrame and append rows & columns to it in python; Pandas : count rows in a dataframe | all or those
Now let’s drop the bottom 3 rows of a dataframe as shown below # Drop bottom 3 rows df[:-3] The above code selects all the rows except bottom 3 rows, there by dropping bottom 3 rows, so the resultant dataframe will be
- I think you misunderstood the question. I want to count the rows with position, for example, being equal to 1 and then remove all of those rows if the count is < 20. It does not matter what the value of position is, just the count of the rows containing that same value. Sorry for the confusion.