I need to compare columns within a same dataframe and rank them
pandas rank multiple columns
pandas rank within group
pandas rank over partition by
pandas sort by column
pandas rank same value
pandas rank unique
dask dataframe rank
I have a dataframe with 6 columns I need to compare each 3 columns with other three columns another. The 6 columns are same data but values of first 3 are from one method and other three are other method. So I need to compare them for differences or variations.
Df.head() A B C A-1 B-1 C-1 190 289 300 190 287 267
And my conditions are,
conditions = [(combined_min['A'] == combined_min['A-1']) & (combined_min['B'] == combined_min['B-1'] & combined_min['C'] == combined_min['C-1']), (combined_min['A'] > combined_min['A-1']) & (combined_min['B'] > combined_min['B-1'] & combined_min['C'] > combined_min['C-1']), (combined_min['A'] < combined_min['A-1']) & (combined_min['B'] < combined_min['B-1'] & combined_min['C'] < combined_min['C-1'])]
And my choices are,
choices = [ "same", 'kj_greater', 'mi_greater' ]
Then I tried,
combined_min['que'] = np.select(conditions,choices, default=np.nan)
But it is throwing error message,
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
In the end I need a dataframe like this,
A B C A-1 B-1 C-1 que 190 289 300 190 287 267 kj_greater
The if the columns A, B, and C, are higher then kj_greater otherwise mi_greater, if all 6 are same then same.
Edit
After a bit of digging/reflection, I realized that I was wrong: it turns out that &
is a logical operator in Pandas. &
implements pairwise logical and between pd.Series
and pd.DataFrame
objects. Unfortunately, &
has different operator precedence than and
, so you have to be careful with it (in this case, &
has higher precedence than ==
, >
, or <
). The bug in the OP's code just comes down to a lack of parentheses in the right places.
So to get the kind of labeling that the OP was originally after, the code would be:
import numpy as np import pandas as pd data= [ [191, 289, 300, 190, 287, 267], [191, 289, 300, 200, 312, 400], [191, 289, 300, 191, 289, 300], [191, 289, 300, 200, 287, 400], ] combined_min = pd.DataFrame(data=data, columns=['A', 'B','C','A-1','B-1','C-1']) cond = lambda x: [(x['A'] == x['A-1']) & (x['B'] == x['B-1']) & (x['C'] == x['C-1']), (x['A'] > x['A-1']) & (x['B'] > x['B-1']) & (x['C'] > x['C-1']), (x['A'] < x['A-1']) & (x['B'] < x['B-1']) & (x['C'] < x['C-1'])] choices = ['same', 'kj_greater', 'mi_greater'] combined_min['que'] = np.select(cond(combined_min), choices, default=np.nan) print(combined_min)
This outputs:
A B C A-1 B-1 C-1 que 0 191 289 300 190 287 267 kj_greater 1 191 289 300 200 312 400 mi_greater 2 191 289 300 191 289 300 same 3 191 289 300 200 287 400 nan
Optionally, cond
can be boiled down to a one-liner:
from functools import reduce from operator import eq, gt, lt, and_ cond = lambda x: [reduce(and_, (op(x[c], x['{}-1'.format(c)]) for c in 'ABC')) for op in (eq, gt, lt)]
Though this reduces readability somewhat.
pandas.DataFrame.rank, How to rank the group of records that have the same value (i.e. ties):. average: average rank For DataFrame objects, rank only numeric columns if set to True. Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share your research! But avoid … Asking for help, clarification, or responding to other answers. Making statements based on opinion; back them up with references or personal experience. To learn more, see our tips on writing great
The problem is that you are missing parenthesis on conditions. Each conditions has to be surrounded by parenthesis.
conditions = [(combined_min['A'] == combined_min['A-1']) & (combined_min['B'] == combined_min['B-1']) & (combined_min['C'] == combined_min['C-1']), (combined_min['A'] > combined_min['A-1']) & (combined_min['B'] > combined_min['B-1']) & (combined_min['C'] > combined_min['C-1']), (combined_min['A'] < combined_min['A-1']) & (combined_min['B'] < combined_min['B-1']) & (combined_min['C'] < combined_min['C-1'])]
Python, Pandas Dataframe.rank() method returns a rank of every respective index of a series passed. 'min', 'max', 'first', 'dense') which tells pandas what to do with same values. value and the rank function works on non-numeric value only if it's False. All the values in Name column are unique and hence there is no need to Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more Comparing two columns in pandas dataframe to create a third one
You're error is in your conditions. The problem is that you are not directly comparing booleans, but rather a set of pd.Series containing a boolean, which connot be directly compared as you do.
So:
df['A'] == df['A-1']
Returns:
0 True dtype: bool
So when you do:
df['A'] == df['A-1'] & df['A'] == df['A-1']
You get the error you mentioned. Try separating each term using parenthesis, and using any()
to get the boolean from the pd.Series:
((df['A'] == df['A-1']) & (df['A'] == df['A-1'])).any()
Ranking Rows of Pandas DataFrame, Example #1 : Here we will create a DataFrame of movies and rank them based on their ratings. filter_none. edit Set the index to newly created column, Rating_Rank We will rank the students based on the highest mark they have scored. filter_none This can be seen for other marks in the table as well. My Personal If you are looking for a way to compare columns for any two or more cells with the same values within the same row, use an IF formula with an OR statement: =IF(OR(A2=B2, B2=C2, A2=C2), "Match", "") In case there are many columns to compare, your OR statement may grow too big in size.
Pandas DataFrame: rank() function, Got it! This site uses cookies to deliver our services and to show you relevant ads. method, How to rank the group of records that have the same value (i.e. ties):. average: average rank of the group; min: lowest rank in the group numeric_only, For DataFrame objects, rank only numeric columns if set to Compare columns of two DataFrames and create Pandas Series It's also possible to use direct assign operation to the original DataFrame and create new column - named 'enh1' in this case. For this purpose the result of the conditions should be passed to pd.Series constructor.
Machine Learning for Email: Spam Filtering and Priority Inbox, We will treat these lookups slightly differently due to differences in column labels in the thread.weights data frame, so we need The process here is fairly straightforward, as we use the match function to find the elements in the weight data frame The rank.message function uses similar rules to the get.weights function for I need to compare 1 column from each data frame to make sure they match and fix any values in that column that don't match. So far I can find the differences in the columns: df1.loc[(df1['col1] != df2['col2'])]
Social Media Data Mining and Analytics, (In a similar vein, we could have used the clusters we obtained from After ranking users' tag frequencies in a decreasing order, we also need to normalize them so relative interest strengths over all the users as a function of the topics' ranks. Create a data frame where the user.id column is the poster's ID, # and the tag Test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.
Comments
- this is due to operator precedence. You should enclose each condition under a parenthesis .
- With your above data, I think the result will be
combined_min['que'] == np.nan
, sincecombined_min['A'] == combined_min['A-1'] == 190
. Maybe makecombined_min['A'] == 191
in your example? - Thanks , it is printing NAH for all rows
- @user1017373 I realized that I was wrong about
&
not being a logical operator, at least for Pandas objects. I posted a corrected answer with code that should label each row according to your original intention. - Thanks with this new editing it complains TypeError: object of type 'function' has no len()
- @user1017373 I think I know what might be causing that
TypeError
. Is there a line in your code that looks like:np.select(cond, choices, default=np.nan)
? You can't pass thecond
lambda directly tonp.select
, you have to call it and pass the result. The line should instead look likenp.select(cond(combined_min), choices, default=np.nan)
. If you're having trouble, first try just copy/pasting the code from my answer and see if you can get that to run as is.