Extract a sub-section of a Pandas dataframe

python subset dataframe by column name
pandas dataframe filter by column value
subset dataframe pandas
pandas select columns by name
pandas index to column
how to extract rows from dataframe in python
pandas iloc
pandas loc

I have a dataframe like this:

Name   ID   Level
Name1   A     1
Name2   B     2
Name3   C     3
Name4   D     1
Name5   E     2
Name6   F     1

etc...

I am looking for a way to extract only a subsection of this dataframe based on the name criteria. So I want to extract everything from Name4 onwards until the Name of the last person in that group is another Level 1..i.e Extract from Name4 to Name5 as Name6 is a Level 1.

Or as another example, I want to extract from Name1 to Name3 as Name4 is a Level 1.

I can do this in Excel using a macro which would go along the lines...Find Name1, look at the Level column and if it is not a 1 then take this row of data and keep going until you hit a Name which has a Level 1 again, then stop, then output this section

Hope this makes sense.

Using this dataframe:

In [0]: df
Out[0]: 
    Name ID  Level
0  Name1  A      1
1  Name2  B      2
2  Name3  C      3
3  Name4  D      1
4  Name5  E      2
5  Name6  F      1

Use a helper column/series that indicates if a row is of a certain level (target level). target_lvl = 1:

helper_series = (df['Level'] == target_lvl)

In [1]: helper_series
Out[1]: 
0     True
1    False
2    False
3     True
4    False
5     True

Now you can take a list of ranges that maps the start and end of each subset:

ranges = df.index.where(helper_series).dropna().astype(int).tolist()

In [2]: ranges
Out[2]:
[0, 3, 5]

Note the values of ranges are the index of each row that belongs to target-lvl.

Finally, you just need to extract the subsets from ranges:

subsets = list()
for i in range(len(ranges)):
    if i == 0:
        continue
    subsets.append(df.iloc[ ranges[i-1] : ranges[i] , :])

last_subset = df.iloc[ ranges[-1] :, :]
if not last_subset.empty:    
    subsets.append(last_subset)

In [3]: subsets
Out[3]:
   Name  ID  Level  
0  Name1  A    1    
1  Name2  B    2  
2  Name3  C    3 

   Name  ID  Level  
3  Name4  D    1  
4  Name5  E    2

Python, How do I extract a row from a DataFrame in Python? You can add that to a function as you did with your own code, and put the results into a Pandas Dataframe. def my_parser(s, marker1, marker2): """Extract strings between markers""" base = s.split(marker1)[1].split(marker2) part1 = base[0].strip() part2 = base[1].strip() return part1, part2

You could do something like this:

Create a new column 'Group' that holds the group value and you can then groupby this column

g = 0
for i in df.index:
    if df.loc[i, "Level"] == 1:
        g += 1
    df.loc[i, "Group"] = g

How To Select One or More Columns in Pandas?, . To download the CSV used in code, click here. Part 3: Assigning subsets of data. This is part three of a four-part series on how to select subsets of data from a pandas DataFrame or Series. Pandas offers a wide variety of options for subset selection which necessitates multiple articles.

This will bring the required Name in one place -

df.groupby(df.groupby(['Level']).cumcount())['Name'].apply(lambda x: ','.join(x))

0    Name1,Name2,Name3
1          Name4,Name5
2                Name6
Name: Name, dtype: object

You can set Level now for each of the entities as per your needs / manipulate the lambda in the apply() function as how you want to implement it

python - Extract a sub-section of a Pandas dataframe, Using this dataframe: In [0]: df Out[0]: Name ID Level 0 Name1 A 1 1 Name2 B 2 2 Name3 C 3 3 Name4 D 1 4 Name5 E 2 5 Name6 F 1. Series.str.extract (self, pat, flags = 0, expand = True) [source] ¶ Extract capture groups in the regex pat as columns in a DataFrame. For each subject string in the Series, extract groups from the first match of regular expression pat .

Setup dataframe:

df = pd.DataFrame({'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6'],
                   'ID': ['A', 'B', 'C', 'D', 'E', 'F'],
                   'Level': [1, 2, 3, 1, 2, 1]})

Find the location of new groups (new level 1) using series shift, mark with a 1, then do cumsum.

grp_markers = (df.Level - df.Level.shift()).fillna(-1).values <= 0
df['grp'] = grp_markers.cumsum()

Find subsets like this:

df[df.grp == 2]

    Name ID  Level    grp
3  Name4  D      1      2
4  Name5  E      2      2

Now you can also do groupby things with grp column...

Selecting Subsets of Data in Pandas: Part 1 - Dunder Data, Series. Pandas offers a wide variety of options for subset selection which necessitates… Extracting the individual DataFrame components. Selecting pandas DataFrame Rows Based On Conditions. 20 Dec 2017. Preliminaries # Import modules import pandas as pd import numpy as np # Create a dataframe raw_data

How do I select a subset of a DataFrame?, The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in  Let’s see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator.. Code #1 : Selecting all the rows from the given dataframe in which ‘Percentage’ is greater than 80 using basic method.

Indexing and selecting data, The Python and NumPy indexing operators [] and attribute operator . provide For getting a cross section using a label (equiv to df.xs('a')) Sometimes you want to extract a set of values given a sequence of row labels and column labels, and  By default, the value will be read from the pandas config module. Use a longtable environment instead of tabular. Requires adding a usepackage{longtable} to your LaTeX preamble. escape bool, optional. By default, the value will be read from the pandas config module. When set to False prevents from escaping latex special characters in column names.

Indexing and Selecting Data, Select a subset of a DataFrame by positions. numpy.take. Take elements from an array along an axis. Examples. >>> df = pd.DataFrame([('falcon', 'bird', 389.0), . Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Comments
  • Extract to what? Please give an example of the output you are expecting
  • Did you check df.groupby?
  • how did you get that 'level' column? I feel there are easier ways to group a column instead of looping through each row and checking for a change in the 'level' variable.
  • Apologies if my question was not thorough enough. Still learning how to ask correctly, must learn to submit expected out. Thank you for your feedback. I did look at the groupby function but this did not help. The Level variable would come as part of the data that is downloaded
  • This almost works, except that if I have another Level right after a Level 1 the last sub section does not get pick up unless I insert a dummy Level 1 data row at the end of the dataframe. Which I can live with. I did create a helper series myself, but got stuck as to how to use it being a newbie on Pandas
  • I've just edited the answer to include last subset if you ever wanted to add that.