Copy DataFrame with NaN values in Column

Related searches

I have a DataFrame that looks like the example below.

# define DataFrame for reproducability

df = pd.DataFrame({'date': ['2019-05-06', '2019-05-07', '2019-05-07', '2019-05-09', '2019-05-10', '2019-05-11'],
                   'Identifier': [1, 1, 1, 1, 1, 1],
                   'B': [2.4, 3.9, 3.9, 4.3, 2.5, 3.14],
                   'C': [0.214, 0.985, 0.985, 0.839, 0.555, 0.159],
                   'Name': [np.nan, "CD", "AD", np.nan, np.nan, np.nan]})

print(df)

    date        Identifier  B       C       Name
0   2019-05-06  1           2.40    0.214   NaN
1   2019-05-07  1           3.90    0.985   CD
2   2019-05-07  1           3.90    0.985   AD
3   2019-05-09  1           4.30    0.839   NaN
4   2019-05-10  1           2.50    0.555   NaN
5   2019-05-11  1           3.14    0.159   NaN

What can be seen is that, for a given identifier, there can be more than one name. However, the name is only appended to the DataFrame once at a single date. What I need is to basically forward and backward fill the names at every date. Currently, I have a solution that works, but that is extremely slow for the full dataframe that I am working on. The code is shown below

final_df = pd.DataFrame()

for i in df.Identifier.unique():
    # select the current identifier
    identifier_df = df.loc[df.Identifier == i]
    # allow a given identifier to have different names
    for n in df.Name.unique():
        if pd.isna(n):
            continue
        else:
            intermediate = identifier_df.copy()
            intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
            final_df = final_df.append(intermediate)

final_df = final_df.drop_duplicates()

Note that the loop through identifiers is required for my full DataFrame. In this instance, however, it seems rather pointless. This code, nevertheless, results in the following DataFrame (which is how I would like the output to be):

print(final_df)

    date        Identifier  B       C       Name
0   2019-05-06  1           2.40    0.214   CD
1   2019-05-07  1           3.90    0.985   CD
3   2019-05-09  1           4.30    0.839   CD
4   2019-05-10  1           2.50    0.555   CD
5   2019-05-11  1           3.14    0.159   CD
0   2019-05-06  1           2.40    0.214   AD
1   2019-05-07  1           3.90    0.985   AD
3   2019-05-09  1           4.30    0.839   AD
4   2019-05-10  1           2.50    0.555   AD
5   2019-05-11  1           3.14    0.159   AD

Is there any way to perform this operation with a groupby, or is there any other way to make it faster?

Thanks!

From what I understand, if dates are sorted and each date has same length:

from itertools import islice,cycle
m=df.name.isna() #pull where name is NaN
l=df.loc[~m,'name'].tolist() #create a list for not null names
df.loc[m,'name']=list(islice(cycle(l),len(df[m]))) #repeat the list for all dates and assign to NaN
print(df)

         date  identifier    B      C name
0  2019-05-07           1  2.4  0.214   AB
1  2019-05-07           1  2.4  0.214   CD
2  2019-05-08           1  3.9  0.985   AB
3  2019-05-08           1  3.9  0.985   CD
4  2019-05-09           1  2.5  0.555   AB
5  2019-05-09           1  2.5  0.555   CD

pandas.DataFrame.fillna — pandas 1.1.1 documentation, Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values any other views on this object (e.g., a no-copy slice for a column in a DataFrame). Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1, 2, and 3� In order to drop a null values from a dataframe, we used dropna() function this function drop Rows/Columns of datasets with Null values in different ways. Syntax: DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False) Parameters: axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.

Use itertools.product for all combination of all 3 columns:

from  itertools import product

df1 = pd.DataFrame(list(product(df['date'].unique(), 
                                df['Identifier'].unique(),
                                df['Name'].dropna().unique())), 
                   columns=['date','Identifier','Name'])
print (df1)
         date  Identifier Name
0  2019-05-06           1   CD
1  2019-05-06           1   AD
2  2019-05-07           1   CD
3  2019-05-07           1   AD
4  2019-05-09           1   CD
5  2019-05-09           1   AD
6  2019-05-10           1   CD
7  2019-05-10           1   AD
8  2019-05-11           1   CD
9  2019-05-11           1   AD

Left join by DataFrame.merge and create MultiIndex by DataFrame.set_index:

df2 = df1.merge(df, how='left').set_index(['date','Identifier'])

Use DataFrame.drop_duplicates for possible replace missing values by DataFrame.combine_first:

df3 = df.drop_duplicates(['date','Identifier']).set_index(['date','Identifier'])
print (df3)
                          B      C Name
date       Identifier                  
2019-05-06 1           2.40  0.214  NaN
2019-05-07 1           3.90  0.985   CD
2019-05-09 1           4.30  0.839  NaN
2019-05-10 1           2.50  0.555  NaN
2019-05-11 1           3.14  0.159  NaN

df4 = df2.combine_first(df3).reset_index()
print (df4)
         date  Identifier     B      C Name
0  2019-05-06           1  2.40  0.214   CD
1  2019-05-06           1  2.40  0.214   AD
2  2019-05-07           1  3.90  0.985   CD
3  2019-05-07           1  3.90  0.985   AD
4  2019-05-09           1  4.30  0.839   CD
5  2019-05-09           1  4.30  0.839   AD
6  2019-05-10           1  2.50  0.555   CD
7  2019-05-10           1  2.50  0.555   AD
8  2019-05-11           1  3.14  0.159   CD
9  2019-05-11           1  3.14  0.159   AD

pandas.DataFrame.stack — pandas 1.1.1 documentation, Whether to drop rows in the resulting Frame/Series with missing values. Stacking a column level onto the index axis can create combinations of index and� (1) Check for NaN under a single DataFrame column. In the following example, we’ll create a DataFrame with a set of numbers and 3 NaN values: import pandas as pd import numpy as np numbers = {'set_of_numbers': [1,2,3,4,5,np.nan,6,7,np.nan,8,9,10,np.nan]} df = pd.DataFrame(numbers,columns=['set_of_numbers']) print (df)

Try this one-liner concat, replace, slicing, and ffill:

print(pd.concat([df[::2],df[::2].replace('AB','CD')]).ffill())

Output:

         date  identifier    B      C name
0  2019-05-07           1  2.4  0.214   AB
2  2019-05-08           1  3.9  0.985   AB
4  2019-05-09           1  2.5  0.555   AB
0  2019-05-07           1  2.4  0.214   CD
2  2019-05-08           1  3.9  0.985   CD
4  2019-05-09           1  2.5  0.555   CD

pandas.DataFrame.any — pandas 1.1.1 documentation, Not implemented for Series. skipnabool, default True. Exclude NA/null values. If the entire row/column is NA and skipna is True,� Contents of the Dataframe : Name Age City Experience 0 jack 34.0 Sydney 5.0 1 Riti 31.0 Delhi 7.0 2 Aadi 16.0 NaN 11.0 3 NaN NaN Delhi NaN 4 Veena 33.0 Delhi 4.0 5 Shaunak 35.0 Mumbai 5.0 6 Sam 35.0 Colombo 11.0 7 NaN NaN NaN NaN *** Drop Rows which contains missing value / NaN in any column *** Contents of the Modified Dataframe : Name Age

One way to speed up this code by a significant amount is by appending the intermediate DataFrames to a list first, and concatenate the list of DataFrames in one final step using pd.concat().

This would make the code look as follows:

final_df = []

for i in df.Identifier.unique():
    # select the current identifier
    identifier_df = df.loc[df.Identifier == i]
    # allow a given identifier to have different names
    for n in df.Name.unique():
        if pd.isna(n):
            continue
        else:
            intermediate = identifier_df.copy()
            intermediate.loc[:,"Name"] = np.repeat(n, len(intermediate))
            final_df.append(intermediate)


final_df = pd.concat(final_df).drop_duplicates()

This simple solution made me decrease execution time by a significant margin. Hopefully it helps someone else as well.

pandas.DataFrame.shift — pandas 1.1.1 documentation, The scalar value to use for newly introduced missing values. the default depends on the dtype of self . Copy of input object, shifted. df.shift(periods=1, axis=" columns") Col1 Col2 Col3 2020-01-01 NaN 10.0 13.0 2020-01-02 NaN 20.0 23.0 � Use axis=1 if you want to fill the NaN values with next column data. How pandas ffill works? ffill is a method that is used with fillna function to forward fill the values in a dataframe. so if there is a NaN cell then ffill will replace that NaN value with the next row or column based on the axis 0 or 1 that you choose.

pandas.DataFrame.update — pandas 1.1.1 documentation, Should have at least one matching index/column label with the original DataFrame. If other contains NaNs the corresponding values are not updated in the� Often you may want to filter a Pandas dataframe such that you would like to keep the rows if values of certain column is NOT NA/NAN. We can use Pandas notnull() method to filter based on NA/NAN values of a column. # filter out rows ina . dataframe with column year values NA/NAN >gapminder_no_NA = gapminder[gapminder.year.notnull()] 4.

pandas.DataFrame.copy¶ DataFrame.copy (deep = True) [source] ¶ Make a copy of this object’s indices and data. When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

If you want to drop the columns with missing values, we can specify axis =1. #drop column with missing value >df.dropna(axis=1) First_Name 0 John 1 Mike 2 Bill In this example, the only column with missing data is the First_Name column. So we end up with a dataframe with a single column after using axis=1 with dropna().

Comments
  • If I sort my dataframe (sort_values) on date and identifier, then using this approach will unfortunately yield me much more names per identifier than should actually be present. The same happens if I only sort on date.
  • @MennoVanDijk I see. I assumed all the dates have equal names.
  • I see now, that is indeed not the case. Each date has several different names, sometimes a couple of names get added at a given date, and sometimes a couple of names drop out at a given date. Thank you kindly for your attempt already.
  • Very nice use of the helper series there. :) Clever..!!
  • This solution unfortunately only works correctly for one out of all unique names for a given identifier. When using df = df.sort_values(["date", "Identifier"]) s = df.groupby(['date','Identifier']).cumcount() df['Name'] = df.groupby(['Identifier', s])['Name'].apply(lambda x: x.ffill().bfill()) It only fills up one Name at all dates correctly, all other Names only get filled at dates where they are not NaN. (sorry for horrible formatting of code, can't seem to get it to work properly)
  • @jezrael I will provide an example in a few hours, currently working with a deadline so have to get some stuff ready. Thank you for looking into this.
  • can you explain Cumcount() given 0,1 and then ['Name'].ffill() will take AB = 0 and CD = 1 and fill the name column @jezrael .. need some explanation
  • @jezrael Still giving me a memory error. Thank you a lot for your help, I'll probably just use my stick to my slower solution.
  • This would work for the given example, but would not generalize well to my overall DataFrame (as there are 1600+ total names).
  • Unfortunately, even with a workaround, using pd.concat([df[::2],df[::2].replace(df.name.unique())]).ffill().bfill() gives me much more names at each identifier than should be the case