Pandas: check whether at least one of values in duplicates' rows is 1

pandas check if value is in column
pandas check if all values in column are true
pandas drop duplicates
pandas check if value in dataframe
pandas any
pandas check if value in multiple columns
pandas any() example
pandas isin multiple columns

This problem may be rather specific, but I bet many may encounter this as well. So I have a DataFrame in a form like:

asd = pd.DataFrame({'Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1]})

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a    -- 0
2 -- b    -- 0
3 -- b    -- 0
4 -- a    -- 1
5 -- a    -- 1

What I am trying to do is to: if at least one "a" value in Col1 has a corresponding value of 1 in Col2, then in Col3 we put 1 for all values of "a" otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a" And then repeat for all other values in Col1.

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a    -- 0    -- 1     because "a" has value of 1 in 4th and 5th lines
2 -- b    -- 0    -- 0     because all "b" have values of 0
3 -- b    -- 0    -- 0
4 -- a    -- 1    -- 1
5 -- a    -- 1    -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = []

for i in col1_uniques:
    small_df = asd.loc[asd.Col1 == i]
    if small_df.Col2.max() == 1:
        small_df['Col3'] = 1

    small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
  Col1  Col2  Col3
0    a     0     1
1    b     0     0
2    b     0     0
3    a     1     1
4    a     1     1

pandas.DataFrame.any, DataFrame.drop_duplicates · pandas.DataFrame. 1 / 'columns' : reduce the columns, return a Series whose index is the original index. Exclude NA/null values. If the Whether each column contains at least one True element (the default). Pandas : Drop rows from a dataframe with missing values or NaN in columns; Pandas : Select first or last N rows in a Dataframe using head() & tail() Python Pandas : How to add new columns in a dataFrame using [] or dataframe.assign() Python Pandas : Replace or change Column & Row index names in DataFrame

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

    Col1    Col2    Col3
0   a       0       1
1   b       0       0
2   b       0       0
3   a       1       1
4   a       1       1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

pandas.Series.isin, Check whether values are contained in Series. showing whether each element in the Series matches an element in the passed sequence of values exactly. Checking for duplicates. First of all, you may want to check if you have duplicate records. If you don’t, you may not need the rest of this post at all. This checks if the whole row appears

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

           Col2
Col1           
a     (0, 1, 1)
b        (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
    if 1 in lists.loc[i, "Col2"]:
        asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

    Col1    Col2    Col3
0   a   0   1
1   b   0   0
2   b   0   0
3   a   1   1
4   a   1   1

The Python Workbook: A Brief Introduction with Exercises and Solutions, One list will contain all of the names that have been used for girls. Misspelled words will be identified by checking each word in the file against a list of known duplicate a word, as shown in the following sentence: At least one value must be​  pandas.Series.duplicated¶ Series.duplicated (self, keep = 'first') [source] ¶ Indicate duplicate Series values. Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated. Parameters

pandas.DataFrame.all, Return whether all elements are True, potentially over an axis. Returns True unless there at Specify axis='columns' to check if row-wise values all return True. keep{‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to mark. first : Mark duplicates as True except for the first occurrence. last : Mark duplicates as True except for the last occurrence. False : Mark all duplicates as True. Returns. Series. pandas.DataFrame.dropna pandas.DataFrame.eq.

Comp-Informatic Practices-TB-11-R1, Check whether that name is in the list or not. Sample list : ['abc', 'xyz', 'aba', '​1221'] Output : 2 Write a Python program to remove duplicates from a list. Write a Python program that takes two lists and returns True if they have at least one odd values with thrice their value and elements having even values with twice their  pandas.DataFrame.all¶ DataFrame.all (self, axis = 0, bool_only = None, skipna = True, level = None, ** kwargs) [source] ¶ Return whether all elements are True, potentially over an axis. Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty). Parameters

Dealing with duplicates in pandas DataFrame, First of all, you may want to check if you have duplicate records. This checks if there are duplicate values in a particular column of your DataFrame. Sometimes, you may want to drop duplicates just from one column. pandas.Series.isin¶ Series.isin (self, values) [source] ¶ Check whether values are contained in Series. Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly. Parameters values set or list-like. The sequence of values to test. Passing in a single string will raise a

Comments
  • By far the fastest method and the most intuitive. Thanks a lot.
  • Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
  • @AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
  • Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
  • you can use all np functions in pandas, as pandas is built on numpy
  • In general, if you're writing loops for dataframes, there's a better way :)