## pandas - find first occurrence

Suppose I have a structured dataframe as follows:

df = pd.DataFrame({"A":['a','a','a','b','b'], "B":[1]*5})

The `A`

column has previously been sorted. I wish to find the first row index of where `df[df.A!='a']`

. The end goal is to use this index to break the data frame into groups based on `A`

.

Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since `A`

has been sorted already, it would be faster if I can just **find the 1st index** of where `df.A!='a'`

. Therefore it is important that whatever method that you use **the scanning stops once the first element is found**.

`idxmax`

and `argmax`

will return the position of the maximal value or the first position if the maximal value occurs more than once.

use `idxmax`

on `df.A.ne('a')`

df.A.ne('a').idxmax() 3

or the `numpy`

equivalent

(df.A.values != 'a').argmax() 3

However, if `A`

has already been sorted, then we can use `searchsorted`

df.A.searchsorted('a', side='right') array([3])

Or the `numpy`

equivalent

df.A.values.searchsorted('a', side='right') 3

**pandas.DataFrame.idxmax,** Return index of first occurrence of maximum over requested axis. NA/null values are excluded. Parameters: axis : {0 or 'index', 1 or ' Get the rows for the first 3 days: >>> ts . first ( '3D' ) A 2018-04-09 1 2018-04-11 2 Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:

df[df.A!='a'].first_valid_index() 3

However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:

df.loc[df.A!='a','A'].index[0]

Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:

total_time_sec ratio wrt fastest algo searchsorted numpy: 0.0007 1.00 argmax numpy: 0.0009 1.29 for loop: 0.0045 6.43 searchsorted pandas: 0.0075 10.71 idxmax pandas: 0.0267 38.14 index[0]: 0.0295 42.14 first_valid_index pandas: 0.1181 168.71

Notice numpy's searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it's just because the dataframe has very few entries.

For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:

total_time_sec ratio wrt fastest algo searchsorted numpy: 0.0007 1.00 searchsorted pandas: 0.0076 10.86 argmax numpy: 0.0117 16.71 index[0]: 0.0815 116.43 idxmax pandas: 0.0904 129.14 first_valid_index pandas: 0.1691 241.57 for loop: 9.6504 13786.29

The code to produce these results is below:

import timeit # code snippet to be executed only once mysetup = '''import pandas as pd import numpy as np df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5}) ''' # code snippets whose execution time is to be measured mycode_set = [''' df[df.A!='a'].first_valid_index() '''] message = ["first_valid_index pandas:"] mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''') message.append("index[0]: ") mycode_set.append( '''df.A.ne('a').idxmax()''') message.append("idxmax pandas: ") mycode_set.append( '''(df.A.values != 'a').argmax()''') message.append("argmax numpy: ") mycode_set.append( '''df.A.searchsorted('a', side='right')''') message.append("searchsorted pandas: ") mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' ) message.append("searchsorted numpy: ") mycode_set.append( '''for index in range(len(df['A'])): if df['A'][index] != 'a': ans = index break ''') message.append("for loop: ") total_time_in_sec = [] for i in range(len(mycode_set)): mycode = mycode_set[i] total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,\ stmt = mycode, number = 100),4)) output = pd.DataFrame(total_time_in_sec, index = message, \ columns = ['total_time_sec' ]) output["ratio wrt fastest algo"] = \ np.round(output.total_time_sec/output["total_time_sec"].min(),2) output = output.sort_values(by = "total_time_sec") display(output)

For the larger dataframe:

mysetup = '''import pandas as pd import numpy as np n = 10000 lt = ['a' for _ in range(n)] b = ['b' for _ in range(5)] lt[-5:] = b df = pd.DataFrame({"A":lt,"B":[1]*n}) '''

**Python,** Pandas str.find() method is used to search a substring in each string present in a the Indexes column is equal to the position first occurrence of character in the You can also achieve this with .groupby ().head (1): >>> df.loc[df.Value > 3].groupby('Trace').head(1) Date Trace Value 2 3 1 3.1 5 2 2 3.6. This finds the first occurrence (given whatever order your DataFrame is currently in) of the row with Value > 3 for each Trace. the First.

If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.

df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5}) for index in range(len(df['A'])): if df['A'][index] != 'a': print(index) break

The index is the row number of the 1st index of where `df.A!='a'`

**Python,** Pandas dataframe. idxmax() function returns index of first occurrence of maximum over requested axis. While finding the index of the maximum value across any index, all NA/null values are excluded. Pandas str.find () method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1. Start and end points can also be passed to search a specific part of string for the passed character or substring.

For multiple conditions:

Let's say we have:

s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])

And we want to find the first item different than **a** and **c**, we do:

n = np.logical_and(s.values != 'a', s.values != 'c').argmax()

**Times:**

import numpy as np import pandas as pd from datetime import datetime ITERS = 1000 def pandas_multi_condition(s): ts = datetime.now() for i in range(ITERS): n = s[(s != 'a') & (s != 'c')].index[0] print(n) print(datetime.now() - ts) def numpy_bitwise_and(s): ts = datetime.now() for i in range(ITERS): n = np.logical_and(s.values != 'a', s.values != 'c').argmax() print(n) print(datetime.now() - ts) s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd']) print('pandas_multi_condition():') pandas_multi_condition(s) print() print('numpy_bitwise_and():') numpy_bitwise_and(s)

**Output:**

pandas_multi_condition(): 4 0:00:01.144767 numpy_bitwise_and(): 4 0:00:00.019013

**FInd the index of the first occurrence of a value in an array,** Python code example 'FInd the index of the first occurrence of a value in an array' for the package array, powered by Kite. Get first row where A > 4 AND B > 3: >>> df[(df.A > 4) & (df.B > 3)].iloc[0] A 5 B 4 C 5 Name: 4, dtype: int64 Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2) >>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0] A 4 B 6 C 3 Name: 2, dtype: int64

You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:

def getMaxIndex(df, col) max = -999999 rtn_index = 0 for index, row in df.iterrows(): if row[col] > max: max = row[col] rtn_index = index return rtn_index

**pandas.Index.duplicated,** Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can I'd suggest to use .nth(0) rather than .first() if you need to get the first row.. The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.

**Python – Find the index of first occurrence of substring in a String,** To find the position of first occurrence of a string, you can use string.find() method. index = string.find(substring, start, end). where string is the string in which The callable must not change input Series/DataFrame (though pandas doesn’t check it). other scalar, Series/DataFrame, or callable Entries where cond is False are replaced with corresponding value from other .

**Python: Find indexes of an element in pandas dataframe ,** It returns a list of index positions ( i.e. row,column) of all occurrences of the given value in the dataframe i.e.. def getIndexes(dfObj, value): ''' Get pandas.Series.idxmin¶ Series.idxmin (self, axis = 0, skipna = True, * args, ** kwargs) [source] ¶ Return the row label of the minimum value. If multiple values equal the minimum, the first row label with that value is returned. Parameters axis int, default 0. For compatibility with DataFrame.idxmin. Redundant for application on Series. skipna

**Retrieving the first occurrence of every unique value from a CSV ,** If there are duplicates, this will give the last index of each value, because duplicates get overwritten in the construction of unique_vals . To get the first index Right now, I have it iterating over and comparing each value in order. If a unique value appears, it only stores the first occurrence in the dictionary. I changed it to now also check if that value has already occurred before, and if so, to skip it.