I need to compare columns within a same dataframe and rank them

dataframe rank by column
pandas rank multiple columns
pandas rank within group
pandas rank over partition by
pandas sort by column
pandas rank same value
pandas rank unique
dask dataframe rank

I have a dataframe with 6 columns I need to compare each 3 columns with other three columns another. The 6 columns are same data but values of first 3 are from one method and other three are other method. So I need to compare them for differences or variations.


  A    B    C  A-1  B-1  C-1
190  289  300  190  287  267

And my conditions are,

conditions = [(combined_min['A'] == combined_min['A-1']) & (combined_min['B'] == combined_min['B-1'] & combined_min['C'] == combined_min['C-1']),
              (combined_min['A'] > combined_min['A-1']) & (combined_min['B'] > combined_min['B-1'] & combined_min['C'] > combined_min['C-1']),
              (combined_min['A'] < combined_min['A-1']) & (combined_min['B'] < combined_min['B-1'] & combined_min['C'] < combined_min['C-1'])]

And my choices are,

choices     = [ "same", 'kj_greater', 'mi_greater' ]

Then I tried,

combined_min['que'] = np.select(conditions,choices, default=np.nan)

But it is throwing error message,

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In the end I need a dataframe like this,

  A    B    C  A-1  B-1  C-1         que
190  289  300  190  287  267  kj_greater

The if the columns A, B, and C, are higher then kj_greater otherwise mi_greater, if all 6 are same then same.


After a bit of digging/reflection, I realized that I was wrong: it turns out that & is a logical operator in Pandas. & implements pairwise logical and between pd.Series and pd.DataFrame objects. Unfortunately, & has different operator precedence than and, so you have to be careful with it (in this case, & has higher precedence than ==, >, or <). The bug in the OP's code just comes down to a lack of parentheses in the right places.

So to get the kind of labeling that the OP was originally after, the code would be:

import numpy as np
import pandas as pd

data= [
    [191, 289, 300, 190, 287, 267],
    [191, 289, 300, 200, 312, 400],
    [191, 289, 300, 191, 289, 300],
    [191, 289, 300, 200, 287, 400],
combined_min = pd.DataFrame(data=data, columns=['A', 'B','C','A-1','B-1','C-1'])

cond = lambda x: [(x['A'] == x['A-1']) & (x['B'] == x['B-1']) & (x['C'] == x['C-1']),
                  (x['A'] > x['A-1'])  & (x['B'] > x['B-1'])  & (x['C'] > x['C-1']),
                  (x['A'] < x['A-1'])  & (x['B'] < x['B-1'])  & (x['C'] < x['C-1'])]
choices = ['same', 'kj_greater', 'mi_greater']

combined_min['que'] = np.select(cond(combined_min), choices, default=np.nan)

This outputs:

     A    B    C  A-1  B-1  C-1         que
0  191  289  300  190  287  267  kj_greater
1  191  289  300  200  312  400  mi_greater
2  191  289  300  191  289  300        same
3  191  289  300  200  287  400         nan

Optionally, cond can be boiled down to a one-liner:

from functools import reduce
from operator import eq, gt, lt, and_

cond = lambda x: [reduce(and_, (op(x[c], x['{}-1'.format(c)]) for c in 'ABC')) for op in (eq, gt, lt)]

Though this reduces readability somewhat.

pandas.DataFrame.rank, How to rank the group of records that have the same value (i.e. ties):. average: average rank For DataFrame objects, rank only numeric columns if set to True. Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share your research! But avoid … Asking for help, clarification, or responding to other answers. Making statements based on opinion; back them up with references or personal experience. To learn more, see our tips on writing great

The problem is that you are missing parenthesis on conditions. Each conditions has to be surrounded by parenthesis.

conditions = [(combined_min['A'] == combined_min['A-1']) & (combined_min['B'] == combined_min['B-1']) & (combined_min['C'] == combined_min['C-1']),
          (combined_min['A'] > combined_min['A-1']) & (combined_min['B'] > combined_min['B-1']) & (combined_min['C'] > combined_min['C-1']),
          (combined_min['A'] < combined_min['A-1']) & (combined_min['B'] < combined_min['B-1']) & (combined_min['C'] < combined_min['C-1'])]

Python, Pandas Dataframe.rank() method returns a rank of every respective index of a series passed. 'min', 'max', 'first', 'dense') which tells pandas what to do with same values. value and the rank function works on non-numeric value only if it's False. All the values in Name column are unique and hence there is no need to  Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more Comparing two columns in pandas dataframe to create a third one

You're error is in your conditions. The problem is that you are not directly comparing booleans, but rather a set of pd.Series containing a boolean, which connot be directly compared as you do.


df['A'] == df['A-1']


0    True
dtype: bool

So when you do:

df['A'] == df['A-1'] & df['A'] == df['A-1']

You get the error you mentioned. Try separating each term using parenthesis, and using any() to get the boolean from the pd.Series:

((df['A'] == df['A-1']) & (df['A'] == df['A-1'])).any()

Ranking Rows of Pandas DataFrame, Example #1 : Here we will create a DataFrame of movies and rank them based on their ratings. filter_none. edit Set the index to newly created column, Rating_Rank We will rank the students based on the highest mark they have scored. filter_none This can be seen for other marks in the table as well. My Personal  If you are looking for a way to compare columns for any two or more cells with the same values within the same row, use an IF formula with an OR statement: =IF(OR(A2=B2, B2=C2, A2=C2), "Match", "") In case there are many columns to compare, your OR statement may grow too big in size.

Pandas DataFrame: rank() function, Got it! This site uses cookies to deliver our services and to show you relevant ads. method, How to rank the group of records that have the same value (i.e. ties):. average: average rank of the group; min: lowest rank in the group numeric_only, For DataFrame objects, rank only numeric columns if set to  Compare columns of two DataFrames and create Pandas Series It's also possible to use direct assign operation to the original DataFrame and create new column - named 'enh1' in this case. For this purpose the result of the conditions should be passed to pd.Series constructor.

Machine Learning for Email: Spam Filtering and Priority Inbox, We will treat these lookups slightly differently due to differences in column labels in the thread.weights data frame, so we need The process here is fairly straightforward, as we use the match function to find the elements in the weight data frame The rank.message function uses similar rules to the get.weights function for  I need to compare 1 column from each data frame to make sure they match and fix any values in that column that don't match. So far I can find the differences in the columns: df1.loc[(df1['col1] != df2['col2'])]

Social Media Data Mining and Analytics, (In a similar vein, we could have used the clusters we obtained from After ranking users' tag frequencies in a decreasing order, we also need to normalize them so relative interest strengths over all the users as a function of the topics' ranks. Create a data frame where the user.id column is the poster's ID, # and the tag  Test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.

  • this is due to operator precedence. You should enclose each condition under a parenthesis .
  • With your above data, I think the result will be combined_min['que'] == np.nan, since combined_min['A'] == combined_min['A-1'] == 190. Maybe make combined_min['A'] == 191 in your example?
  • Thanks , it is printing NAH for all rows
  • @user1017373 I realized that I was wrong about & not being a logical operator, at least for Pandas objects. I posted a corrected answer with code that should label each row according to your original intention.
  • Thanks with this new editing it complains TypeError: object of type 'function' has no len()
  • @user1017373 I think I know what might be causing that TypeError. Is there a line in your code that looks like: np.select(cond, choices, default=np.nan)? You can't pass the cond lambda directly to np.select, you have to call it and pass the result. The line should instead look like np.select(cond(combined_min), choices, default=np.nan). If you're having trouble, first try just copy/pasting the code from my answer and see if you can get that to run as is.