How to get a subset of rows from a group in a pandas dataframe?

pandas groupby tutorial
pandas group by count
pandas groupby multiple columns
dataframegroupby
pandas groupby to dataframe
pandas groupby aggregate multiple columns
pandas groupby transform
pandas groupby apply

I have a dataframe with a column ID and a binary column, like the example below

     ID    BINARY_MASK
0   101        1
1   101        0
2   101        1
3   101        1
4   101        1
5   101        1
6   101        0
7   101        1
8   102        1 
9   102        1
11  102        1
12  102        1
13  102        0 
14  102        0

What I want to do is get the first four consecutive entries that are 1, per ID group. The result I would like to see is the following:

     ID    BINARY_MASK
2   101        1
3   101        1
4   101        1
5   101        1
8   102        1 
9   102        1
11  102        1
12  102        1

The index inside the group where there are four consecutive ones differs per group, like in the example. How do I do this?

I have tried the solution that was offered by Bill G in this question, but that didn't work for me.

Working with Pandas dataframes and Python 3.6

Create helper Series for GroupBy.transform with cumsum of shifted values compared by ne (!=) and chain with another condition, last filter by boolean indexing:

s = df['BINARY_MASK'].ne(df['BINARY_MASK'].shift()).cumsum()
m1 = df.groupby(s)['BINARY_MASK'].transform('size') >= 4
m2 = df['BINARY_MASK'] == 1

df = df[m1 & m2]
print (df)
     ID  BINARY_MASK
2   101            1
3   101            1
4   101            1
5   101            1
7   101            1
8   102            1
9   102            1
11  102            1
12  102            1

Pandas GroupBy: Your Guide to Grouping Data in Python – Real , descending now, indstead of the default which is ascending. The main takeaway from the DataFrame anatomy is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame.

query and groupby with head

Easiest thing to do is to filter which are ones prior to grouping. You can do the filtering in several ways, I chose to use query.

df.query('BINARY_MASK == 1').groupby('ID').head(4)

     ID  BINARY_MASK
0   101            1
2   101            1
3   101            1
4   101            1
8   102            1
9   102            1
11  102            1
12  102            1

pandas groupby sort descending order, On a DataFrame, we obtain a GroupBy object by calling groupby() . We could If we also have a MultiIndex on columns A and B , we can group by all but the specified columns. In [10]: df2 Take nth value, or a subset if n is a list. min(). Part 1: Selection with [ ], .loc and .iloc. This is the beginning of a four-part series on how to select subsets of data from a pandas DataFrame or Series. Pandas offers a wide variety of options for subset selection which necessitates multiple articles.

Use groupby + head :

df[df['BINARY_MASK']==1].groupby('ID').head(4)

     ID  BINARY_MASK
0   101            1
2   101            1
3   101            1
4   101            1
8   102            1
9   102            1
11  102            1
12  102            1

Group By: split-apply-combine, Note this is not the same as top N rows according to one variable in the whole dataframe. Let us say we have gapminder data frame that has life  # Get the data from the database df = stockPrices.get_data() # Create technical indicators for each distinct stock # First get a series of all unique stock codes ts = pd.Series(df.Symbol.unique()) # Iterate through the series and call the technical indicator method for row in ts: # filter for just this stock filtered_df = df.loc[df['Symbol'] == row] df = stockPrices.calc_technicals(filtered_df, row)

How to Get Top N Rows with in Each Group in Pandas?, This tutorial assumes you have some experience with Pandas itself, You can use df.tail() to vie the last few rows of the dataset: >>> The apply stage, when applied to your single, subsetted DataFrame , would look like this:. You can use boolean conditions to obtain a subset of the data from the DataFrame. Select rows based on column value To select all rows whose column contain the specified value(s).

When to use aggreagate/filter/transform with pandas, If you want to get a subset of the original rows, use filter() . Vaex joins data without making memory copies, which can save the main memory. Pandas users will be familiar with the join function: dv_join = dv.join(dv_group, on=’col1_50') Get real-world experience with Machine Learning. I’ve designed a Hands-on Data Science Course.

How to use the Split-Apply-Combine strategy in Pandas groupby, Pandas groupby-apply is an invaluable tool in a Python data scientist's toolkit. can use .xs to select subsets of the dataframe by selecting a value and then a to group rows together according to specified column(s) values. Extracting specific columns of a pandas dataframe ¶ df2[["2005", "2008", "2009"]] That would only columns 2005, 2008, and 2009 with all their rows. Extracting specific rows of a pandas dataframe ¶ df2[1:3] That would return the row with index 1, and 2. The row with index 3 is not included in the extract because that’s how the slicing syntax

Comments
  • Thanks for the answer! However I have worded my question wrong (edited now) and am looking for the first 4 consecutive ones in the data. If you look at the example, for ID 101 I want to retrieve indices 2, 3, 4 and 5, and with your answer I would get 0, 2, 3 and 4.
  • That completely changes the nature of the question. I suggest rolling the edit back and asking a new question.
  • Ok, then I will do that.
  • @jezrael I'd still advise that OP ask a new question. We should discourage changing the nature of the questions after people have answered.
  • @ChubaChuubs - I absolutely agree if your question should be modify radically, but here is add one word also mentioned in last paragraph, so new question is not necessary (in my opinion)