Pandas: Selecting and modifying dataframe based on even more complex criteria

pandas select rows by multiple conditions
pandas dataframe filter multiple conditions
pandas dataframe filter multiple values
pandas dataframe filter multiple columns
pandas select columns by condition on name
pandas filter rows by condition
pandas np.where multiple conditions
pandas create new column based on multiple condition

I was looking at this and this threads, and though my question is not so different, it has a few differences. I have a dataframe full of floats, that I want to replace by strings. Say:

      A     B       C
 A    0     1.5     13
 B    0.5   100.2   7.3
 C    1.3   34      0.01

To this table I want to replace by several criteria, but only the first replacement works:

df[df<1]='N' # Works
df[(df>1)&(df<10)]#='L' # Doesn't work
df[(df>10)&(df<50)]='M'  # Doesn't work
df[df>50]='H'  # Doesn't work

If I instead do the selection for the 2nd line based on float, still doesn't work:

((df.applymap(type)==float) & (df<10) & (df>1)) #Doesn't work

I was wondering how to apply pd.DataFrame().mask in here, or any other way. How should I solve this?

Alternatively, I know I may read column by column and apply the substitutions on each series, but this seems a bit counter productive

Edit: Could anyone explain why the 4 simple assignments above do not work?

You can use searchsorted

Copy
labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
pd.DataFrame(
    labels[breaks.searchsorted(df.values)].reshape(df.shape),
    df.index, df.columns)

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

In Place
labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
df[:] = labels[breaks.searchsorted(df.values)].reshape(df.shape)
df

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Chained pure Pandas approach with pandas.DataFrame.mask

Deprecated since version 0.21

df.mask(df.lt(1), 'N').mask(df.gt(1) & df.lt(10), 'L') \
  .mask(df.gt(10) & df.lt(50), 'M').mask(df.gt(50), 'H')

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Python Pandas : Select Rows in DataFrame by conditions on , Select DataFrame Rows Based on multiple conditions on columns We will be more than happy to add that. Modification date & time of a file  Home LanguagesPandas: Selecting and modifying dataframe based on even more complex criteria Selecting and modifying dataframe based on even more complex criteria.

Use numpy.select with DataFrame constructor:

m1 = df < 1
m2 = (df>1)&(df<10)
m3 = (df>10)&(df<50)
m4 = df>5

vals = list('NLMH')

df = pd.DataFrame(np.select([m1,m2,m3,m4], vals), index=df.index, columns=df.columns)
print (df)
   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Multiple Criteria Filtering, This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code. Applying multiple filter criter to a pandas DataFrame¶. In [1]:​. Multiple Criteria Filtering Applying multiple filter criter to a pandas DataFrame This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

By using pd.cut

pd.cut(df.stack(),[-1,1,10,50,np.inf],labels=list('NLMH')).unstack()
Out[309]: 
   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Indexing and Selecting Data, It is primarily label based, but will fallback to integer positional access. .ix is the You may access an index on a Series, column on a DataFrame, and a item and map method of Series can also be used to produce more complex criteria: parameter inplace so that the original data can be modified without creating a copy:. Selecting pandas dataFrame rows based on conditions. Selecting pandas DataFrame Rows Based On Conditions > 50 # Select all cases where nationality is USA and

Indexing and Selecting Data, .loc is primarily label based, but may also be used with a boolean array. .loc will access to modify an existing element of a Series or column of a DataFrame, and map method of Series can also be used to produce more complex criteria:. Indexing and selecting data¶ The axis labeling information in pandas objects serves many purposes: Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display. Enables automatic and explicit data alignment. Allows intuitive getting and setting of subsets of the data set.

Indexing and Selecting Data, However, when an axis is integer based, ONLY label based access and not access to modify an existing element of a Series or column of a DataFrame, but be and map method of Series can also be used to produce more complex criteria:. Select Non-Missing Data in Pandas Dataframe With the use of notnull() function, you can exclude or remove NA and NAN values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find same number of rows in newdf. newdf = df[df.origin.notnull()]

Indexing and Selecting Data, .loc is primarily label based, but may also be used with a boolean array. .loc will access to modify an existing element of a Series or column of a DataFrame, and map method of Series can also be used to produce more complex criteria:. The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position. The iloc indexer syntax is data.iloc[<row selection>, <column selection>], which is sure to be a source of confusion for R users. “iloc” in pandas is used to select rows and columns by number , in the order that they appear in the data frame.

Comments
  • ah! I was really thinking about doing this, but wasn't sure how to use the 4 rules in a mask (I learned about mask from your previous answers on my posts a while back)