pandas - find first occurrence

first valid index pandas
pandas iloc
pandas groupby
pandas where
pandas argmax
pandas ne
pandas find index of string
pandas first

Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'],
                   "B":[1]*5})

The A column has previously been sorted. I wish to find the first row index of where df[df.A!='a']. The end goal is to use this index to break the data frame into groups based on A.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'. Therefore it is important that whatever method that you use the scanning stops once the first element is found.


idxmax and argmax will return the position of the maximal value or the first position if the maximal value occurs more than once.

use idxmax on df.A.ne('a')

df.A.ne('a').idxmax()

3

or the numpy equivalent

(df.A.values != 'a').argmax()

3

However, if A has already been sorted, then we can use searchsorted

df.A.searchsorted('a', side='right')

array([3])

Or the numpy equivalent

df.A.values.searchsorted('a', side='right')

3

pandas.DataFrame.idxmax, Return index of first occurrence of maximum over requested axis. NA/null values are excluded. Parameters: axis : {0 or 'index', 1 or '  Get the rows for the first 3 days: >>> ts . first ( '3D' ) A 2018-04-09 1 2018-04-11 2 Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.


I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index()

3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

                      total_time_sec    ratio wrt fastest algo
searchsorted numpy:        0.0007        1.00
argmax numpy:              0.0009        1.29
for loop:                  0.0045        6.43
searchsorted pandas:       0.0075       10.71
idxmax pandas:             0.0267       38.14
index[0]:                  0.0295       42.14
first_valid_index pandas:  0.1181      168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

                     total_time_sec ratio wrt fastest algo
searchsorted numpy:        0.0007       1.00
searchsorted pandas:       0.0076      10.86
argmax numpy:              0.0117      16.71
index[0]:                  0.0815     116.43
idxmax pandas:             0.0904     129.14
first_valid_index pandas:  0.1691     241.57
for loop:                  9.6504   13786.29

The code to produce these results is below:

import timeit

# code snippet to be executed only once 
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''

# code snippets whose execution time is to be measured   
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]

mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")

mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")

mycode_set.append(  '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")

mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")

mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")

mycode_set.append( '''for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        ans = index
        break
        ''')
message.append("for loop: ")

total_time_in_sec = []
for i in range(len(mycode_set)):
    mycode = mycode_set[i]
    total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\
         stmt = mycode, number = 100),4))

output = pd.DataFrame(total_time_in_sec, index = message, \
                      columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] = \
np.round(output.total_time_sec/output["total_time_sec"].min(),2)

output = output.sort_values(by = "total_time_sec")
display(output)

For the larger dataframe:

mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''

Python, Pandas str.find() method is used to search a substring in each string present in a the Indexes column is equal to the position first occurrence of character in the  You can also achieve this with .groupby ().head (1): >>> df.loc[df.Value > 3].groupby('Trace').head(1) Date Trace Value 2 3 1 3.1 5 2 2 3.6. This finds the first occurrence (given whatever order your DataFrame is currently in) of the row with Value > 3 for each Trace. the First.


If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.

df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
    if df['A'][index] != 'a':
        print(index)
        break

The index is the row number of the 1st index of where df.A!='a'

Python, Pandas dataframe. idxmax() function returns index of first occurrence of maximum over requested axis. While finding the index of the maximum value across any index, all NA/null values are excluded. Pandas str.find () method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1. Start and end points can also be passed to search a specific part of string for the passed character or substring.


For multiple conditions:

Let's say we have:

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than a and c, we do:

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

Times:

import numpy as np
import pandas as pd
from datetime import datetime

ITERS = 1000

def pandas_multi_condition(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = s[(s != 'a') & (s != 'c')].index[0]
    print(n)
    print(datetime.now() - ts)

def numpy_bitwise_and(s):
    ts = datetime.now()
    for i in range(ITERS):
        n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
    print(n)
    print(datetime.now() - ts)

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)

Output:

pandas_multi_condition():
4
0:00:01.144767

numpy_bitwise_and():
4
0:00:00.019013

FInd the index of the first occurrence of a value in an array, Python code example 'FInd the index of the first occurrence of a value in an array' for the package array, powered by Kite. Get first row where A > 4 AND B > 3: >>> df[(df.A > 4) & (df.B > 3)].iloc[0] A 5 B 4 C 5 Name: 4, dtype: int64 Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2) >>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0] A 4 B 6 C 3 Name: 2, dtype: int64


You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:

def getMaxIndex(df, col)
    max = -999999
    rtn_index = 0
    for index, row in df.iterrows():
            if row[col] > max:
                max = row[col]
                rtn_index = index
    return rtn_index 

pandas.Index.duplicated, Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can  I'd suggest to use .nth(0) rather than .first() if you need to get the first row.. The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.


Python – Find the index of first occurrence of substring in a String, To find the position of first occurrence of a string, you can use string.find() method​. index = string.find(substring, start, end). where string is the string in which  The callable must not change input Series/DataFrame (though pandas doesn’t check it). other scalar, Series/DataFrame, or callable Entries where cond is False are replaced with corresponding value from other .


Python: Find indexes of an element in pandas dataframe , It returns a list of index positions ( i.e. row,column) of all occurrences of the given value in the dataframe i.e.. def getIndexes(dfObj, value): ''' Get  pandas.Series.idxmin¶ Series.idxmin (self, axis = 0, skipna = True, * args, ** kwargs) [source] ¶ Return the row label of the minimum value. If multiple values equal the minimum, the first row label with that value is returned. Parameters axis int, default 0. For compatibility with DataFrame.idxmin. Redundant for application on Series. skipna


Retrieving the first occurrence of every unique value from a CSV , If there are duplicates, this will give the last index of each value, because duplicates get overwritten in the construction of unique_vals . To get the first index  Right now, I have it iterating over and comparing each value in order. If a unique value appears, it only stores the first occurrence in the dictionary. I changed it to now also check if that value has already occurred before, and if so, to skip it.