Updating Pandas row without iterrows

pandas iterrows previous row
pandas iterate over rows and update
pandas update row value
pandas at
pandas for loop update row
update dataframe with iterrows
change row values in pandas dataframe
pandas iterate rows and update values

I have a local dataframe that gets appended with new entries daily. Once in a while, an old entry is updated. The give away is a bunch of columns will match, but the timestamp is more recent.

With the goal of removing the old entry, and keeping the new (updated) entry, I append the new entry and then "clean" the dataframe by looping through the rows and finding the old entry:

del_rows=[]
df2 = df.copy()
for index, row in df.iterrows():
    for index2, row2 in df2.iterrows():
        if row["crit1"]==row2["crit1"] and row["date"] > row2["date"]:
            del_rows.append(index2)

df = df.drop(df.index[del_rows])

While functional, I'd love to know the more "pandas" way of going about this process. I know that apply and NumPy vectorization are faster; however, I can't think of a function that would achieve this that I could map apply to, or a way to use the vectorization given different data types.

IIUC, you can use duplicated() to create a boolean filter, so for a sample dataframe:

    crit1        date
0   test1  01-01-2018
1   test2  01-02-2018
2   test3  01-03-2018
3   test4  01-04-2018
4   test5  01-05-2018
5   test6  01-06-2018
6   test3  01-07-2018
7   test7  01-08-2018
8   test8  01-09-2018
9   test2  01-10-2018
10  test9  01-11-2018

Simply do:

df[~df.duplicated(subset=['crit1'], keep='last')].reset_index(drop=True)

Yields:

   crit1        date
0  test1  01-01-2018
1  test4  01-04-2018
2  test5  01-05-2018
3  test6  01-06-2018
4  test3  01-07-2018
5  test7  01-08-2018
6  test8  01-09-2018
7  test2  01-10-2018
8  test9  01-11-2018

You can assign values in the loop using df.set_value: for i, row in df.iterrows(): ifor_val = something if <condition>: ifor_val = something_else df.set_value(i,'ifor'​  Pandas has iterrows () function that will help you loop through each row of a dataframe. Pandas’ iterrows () returns an iterator containing index of each row and the data in each row as a Series. Since iterrows () returns iterator, we can use next function to see the content of the iterator. We can see that it iterrows returns a tuple with

This can be done using a groupby on the crit1 and selecting the latest row, as such:

df.sort_values('date').groupby('crit1').tail(1)

iterrows() returns a copy of the dataframe contents in tuple, so updating it will have no effect on actual dataframe. So, to update the contents of  Pandas DataFrame – Iterate Rows – iterrows() To iterate through rows of a DataFrame, use DataFrame.iterrows() function which returns an iterator yielding index and row data for each row. In this tutorial, we shall go through examples demonstrating how to iterate over rows of a DataFrame. Example 1: Iterate through rows of Pandas DataFrame

Probably the new entry has a date older than the one already existing. then doping simply by first or last might not be correct.

another alternative is to drop the duplicate by finding the minimum entry.

below is a worked out example.

import pandas as pd

date = pd.date_range(start='1/1/2018', end='1/5/2018')

crit = ['a', 'b', 'c', 'd', 'e']

df = pd.DataFrame({'crit':crit, 'date':date})

# insert a new entry to df
df.loc[len(df)] = ['b', '1/6/2016']

#convert date to datetime
df['date'] = pd.to_datetime(df['date'])

print(df, '\n')


#find the duplicated row in crit

print(df[df.duplicated('crit', keep=False)]['date'].min(), '\n')
print(df['date'] != df[df.duplicated('crit', keep=False)]['date'].min())

#apply 
df[df['date'] != df[df.duplicated('crit', keep=False)]['date'].min()]

Using df.at() that access a single value for a row/column label pair: for i, row in df.​iterrows():. ifor_val = something. if <condition>: ifor_val =  Using pandas iterrows function. The pandas iterrows function returns a pandas Series for each row, with the down side of not preserving dtypes across rows.

Because iterrows returns a Series for each row, it does not preserve dtypes types, the iterator returns a copy and not a view, and writing to it will have no effect. To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows. You should never modify something you are iterating over. This is not guaranteed to work in all cases.

Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value. Parameters. otherDataFrame, or object coercible into a​  pandas.DataFrame.update¶ DataFrame.update (self, other, join='left', overwrite=True, filter_func=None, errors='ignore') → None [source] ¶ Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value. Parameters other DataFrame, or object coercible into a DataFrame

Updating Pandas row without iterrows. pandas iterrows previous row pandas iterate over rows and update 'dataframe' object has no attribute 'iterrows' pandas​  DataFrame - iterrows() function. The iterrows() function is used to iterate over DataFrame rows as (index, Series) pairs. Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

Comments
  • Please try to include a simple example dataset that shows what your data looks like.
  • This is perfect: elegant, and simple. Thanks so much I didn't know duplicated() existed!
  • I get that certain items could be removed with loc, but how would the script know the old v new items without checking each item against every other item? Or are you suggesting conditioning the df before appending the new item?
  • I think this will work but the actual dataset has good number of additional criteria, and adding a couple of criteria in the subset portion of df[~df.duplicated(subset=['crit1'], keep='last')] seems like an easier way to go instead of repeated/levels of groupby
  • @user129818 Makes sense. Just note that keep='last' keeps the last row encountered, which is not necessarily the latest row in terms of date/time.