Pandas dataframe self-dependency in data to fill a column

pandas fillna specific column
pandas groupby
pandas ffill
pandas rename column
pandas merge
pandas join
pandas fillna with mean
pandas bfill

I have dataframe with data as:

The value of "relation" is determined from the codeid. Leather has "codeid"=11 which is already appeared against bag, so in relation we put the value bag. Same happens for shoes.

ToDo: Fill the value of "relation", by putting check on codeid in terms of dataframes. Any help would be appreciated.

Edit: Same codeid e.g. 11 can appear > twice. But the "relation" can have only value as bag because bag is the first one to have codeid=11. i have updated the picture as well.

If want only first dupe value to last duplicated use transform with first and then set NaN values by loc with duplicated:

df = pd.DataFrame({'id':[1,2,3,4,5],
                   'name':list('brslp'),
                   'codeid':[11,12,13,11,13]})

df['relation'] = df.groupby('codeid')['name'].transform('first')
print (df)
   id name  codeid relation
0   1    b      11        b
1   2    r      12        r
2   3    s      13        s
3   4    l      11        b
4   5    p      13        s

#get first duplicated values of codeid
print (df['codeid'].duplicated(keep='last'))
0     True
1    False
2     True
3    False
4    False
Name: codeid, dtype: bool

#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows   
print (~df['codeid'].duplicated(keep=False))
0    False
1     True
2    False
3    False
4    False
Name: codeid, dtype: bool

#chain boolen mask together 
print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))
0     True
1     True
2     True
3    False
4    False
Name: codeid, dtype: bool

#replace True values by mask by NaN 
df.loc[df['codeid'].duplicated(keep='last') | 
       ~df['codeid'].duplicated(keep=False), 'relation'] = np.nan
print (df)
   id name  codeid relation
0   1    b      11      NaN
1   2    r      12      NaN
2   3    s      13      NaN
3   4    l      11        b
4   5    p      13        s

pandas.DataFrame.fillna, DataFrame. fillna (self, value=None, method=None, axis=None, Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values  data ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects. Changed in version 0.23.0: If data is a dict, column order follows insertion-order for Python 3.6 and later.

I think you want to do something like this:

import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'], 
                  ['shoes', 12, 'null'], 
                  ['shopper', 13, 'null'], 
                  ['leather', 11, 'bag'], 
                  ['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])

def codeid_analysis(rows):
    if rows['codeid'] == 11:
        rows['relation'] = 'bag'
    elif rows['codeid'] == 12:
        rows['relation'] = 'shirt' #for example. You should put what you want here
    elif rows['codeid'] == 13:
        rows['relation'] = 'pants' #for example. You should put what you want here
    return rows

result = df.apply(codeid_analysis, axis = 1)
print(result)

Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas , You now have a DataFrame called df that looks much like our CSV file. It has two columns and a numerical index for referencing the rows. >>> >>> df  Example Codes: Pandas shift function to shift along column Example Codes: DataFrame.shift method with fill_value parameter Pandas DataFrame.shift method is used to shift the index of DataFrame by a specified number of periods with an optional time frequency. Syntax of pandas.DataFrame.shift():

It is not the optimal solution since it is costly to your memory, but here is my try. df1 is created in order to hold the null values of the relation column, since it seems that nulls are the first occurrence. After some cleaning, the two dataframes are merged to provide into one.

import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'], 
                  ['shoes', 12, 'null'], 
                  ['shopper', 13, 'null'], 
                  ['leather', 11, 'bag'], 
                  ['plastic', 13, 'shopper'],
                  ['something',13,""]], columns = ['name', 'codeid', 'relation'])

df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation
df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry
df1=df1.drop("relation",axis=1)#drop the unneeded column

final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

Resampling time series data with pandas – Ben Alex Keen, We're going to be tracking a self-driving car at 15 minute periods over a year Our time series is set to be the index of a pandas DataFrame. df.speed.​resample() will be used to resample the speed column of our In this case we would want to forward fill our speed data, for this we can use ffil() or pad . <class 'pandas.core.frame.DataFrame'> Int64Index: 374251 entries, 0 to 374250 Data columns (total 4 columns): gameId 374251 non-null object player 374251 non-null object round 374251 non-null int64 action 374251 non-null int64 dtypes: int64(2), object(2) memory usage: 14.3+ MB

Things I Wish I'd Known About Spark When I Started (One Year , Shuffle is the transportation of data between workers across a Spark cluster's network. a reorganization of data is required, referred to as wide dependencies (See Wide vs filled columns or columns that over-represent particular values. They will set up a DataFrame for changes—like adding a column,  Assigning an index column to pandas dataframe ¶ df2 = df1.set_index("State", drop = False) Note: As you see you needed to store the result in a new dataframe because this is not an in-place operation. Also note that you should set the drop argument to False. If you don’t do that the State column will be deleted so if you set another index

pyspark.sql module, DataFrame A distributed collection of data grouped into named columns. Creates a DataFrame from an RDD , a list or a pandas. Similar to coalesce defined on an RDD , this operation results in a narrow dependency, e.g. if you go from DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other. Pandas - fill specific number of rows in a column with one value 1 adding a new column to pandas data frame and fill it with 2 values till the end of the column

Spark SQL and DataFrames, A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with​  df = pd.DataFrame (data) # Declare a list that is to be converted into a column. address = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] # Using 'Address' as the column name. # and equating it to the list. df ['Address'] = address. # Observe the result. df. chevron_right.

Comments
  • Will the codes appear only twice? And should one take the name of the first appearance of the code only?
  • Could you kindly explain the code, as it seems to be working but it's not working at my end
  • @frozenshine - Can you explain more why not working? Problem in sample data or in real?
  • testing your logic on real data. the last statement is making all values NaN not just the first ones.
  • @frozenshine - hmmm, so real data are different like sample data, is possible add more rows, create minimal, complete, and verifiable example ?
  • No the data is exactly following same pattern that I showed. I only need to figure out why only np.nan line is making all rows as "nan".
  • Thanks but unfortunately, the question showed only sample data, and real data is quiet big. Cant use manual if and else. :(