Merging Dataframe1 rows into Dataframe2 using existing column headers - Python Pandas

pandas merge on different column names
pandas merge on multiple columns
python pandas concatenate multiple columns
rename column pandas
pandas merge columns into one
pandas merge rows with same index
pandas merge select columns
drop column pandas

I am attempting to build a python dataframe to capture an initial dataset, and then update fields as csv files containing data become available in a specified folder location. I have built my initial table and saved it to a csv, but when I reopen the table and try to bring in the next text file data I am accidentally duplicating the headers for the new data everytime I run the code. What is the best way to read in the new text files and append the data to a row based on a common column name?

My current attempt is to build the new text files into their own database, match the column headers from the initial import, and merge on the common header with the starting dataset. However, like I said the columns just keep duplicating themselves and adding _x, _y etc. afterwards.

Sample of Initial Dataframe...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,,,,,,,,,,,,,,,,,,,,

Sample of Data to be Merged...

,Line Name,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
0,A901,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
1,A902,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
2,A903,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

I've tried merging, appending and concatenating with varying combinations of rules applied, but without any luck. Some error, others duplicate the columns...

df_proj = pd.concat([df_in,df_acc], axis=1,sort=False)

or

df_proj = pd.merge(df_in, df_acc, on='Line Name', how = 'right')

I've tried to find examples of this type of problem without luck, I be going about this the wrong way. Maybe I shouldn't be treating the new text files as their own dataframe to begin with. But any help would be greatly appreciated. Thank you!

EDIT:

The expected result would look like this...

Line Name,Bearing,SOL X,SOL Y,EOL X,EOL Y,FSP 0.5m,FSP 1m,Length km,Length m,LSP 0.5m,LSP 1m,Date,Julian Day,Seq,Storage Db,Time Start,SOG Start,Fix Start,SOL Bearing,Line Length,SOL Easting,SOL Northing,Obs SOL X,Obs SOL Y,Time End,Fix End,SOG End,Planned EOL X,Planned EOL Y,Obs EOL X,Obs EOL Y
A901,38.67269998,568453.03,4343701.73,569156.01,4344580.05,250,125,1.125,1125,2500,1250,9/26/2019,269,37,0037_JD269_X331 - 0001.db,09:29:54,2.73,12792,128.67,1000.0,587985.95,4380278.68,587811.22,4380427.4,09:43:51,15594,3.28,588766.68,4379653.81,588901.53,4379545.64
A902,38.67269998,568476.45,4343682.99,569179.43,4344561.31,250,125,1.125,1125,2500,1250,9/26/2019,269,38,0038_JD269_M104 - 0001.db,11:38:24,3.69,98260,218.67,42875.0,591391.62,4383593.91,591626.99,4383892.87,17:40:02,25764,3.02,564600.29,4350120.19,568980.53,4355589.75
A903,38.67269998,568499.87,4343664.24,569202.85,4344542.56,250,125,1.125,1125,2500,1250,9/29/2019,273,80,0080_JD273_M305 - 0001.db,00:50:53,3.64,27721,38.67,1125.0,576455.88,4351038.29,576365.15,4350932.06,01:03:26,30615,3.78,577158.86,4351916.61,577275.9,4352057.35

Solution if need replace missing values by another DataFrame:

df = (df_in.set_index('Line Name')
           .combine_first(df_acc.set_index('Line Name'))
           .reset_index()
           .reindex(columns=df_in.columns))
print (df)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51  15594.0    3.28     588766.68     4379653.81  588901.53   
1  17:40:02  25764.0    3.02     564600.29     4350120.19  568980.53   
2  01:03:26  30615.0    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

If need concat together by Line Name with remove only NaNs columns:

df_proj = (pd.concat([df_in.set_index('Line Name').dropna(how='all', axis=1),
                     df_acc.set_index('Line Name')], axis=1, sort=False)
             .reset_index())
print (df_proj)
  Line Name  Bearing      SOL X       SOL Y      EOL X       EOL Y  FSP 0.5m  \
0      A901  38.6727  568453.03  4343701.73  569156.01  4344580.05       250   
1      A902  38.6727  568476.45  4343682.99  569179.43  4344561.31       250   
2      A903  38.6727  568499.87  4343664.24  569202.85  4344542.56       250   

   FSP 1m  Length km  Length m  ...  SOL Northing  Obs SOL X   Obs SOL Y  \
0     125      1.125      1125  ...    4380278.68  587811.22  4380427.40   
1     125      1.125      1125  ...    4383593.91  591626.99  4383892.87   
2     125      1.125      1125  ...    4351038.29  576365.15  4350932.06   

   Time End  Fix End SOG End Planned EOL X  Planned EOL Y  Obs EOL X  \
0  09:43:51    15594    3.28     588766.68     4379653.81  588901.53   
1  17:40:02    25764    3.02     564600.29     4350120.19  568980.53   
2  01:03:26    30615    3.78     577158.86     4351916.61  577275.90   

    Obs EOL Y  
0  4379545.64  
1  4355589.75  
2  4352057.35  

[3 rows x 32 columns]

EDIT:

For check duplicates:

print (df_in[df_in['Line Name'].duplicated(keep=False)])
print (df_acc[df_acc['Line Name'].duplicated(keep=False)])

For remove duplicates by Line Name use:

df_in = df_in.drop_duplicates('Line Name')
df_acc = df_acc.drop_duplicates('Line Name')

Combine two columns of text in dataframe in pandas/python, What is the difference between merge and join in pandas? I want to just merge every two rows. *however, could you also show an implementation that does consider row value, for learning's sake? the year and month values shouldn't differ - they are same for every two rows. if they do differ, you can default it to the later year and month value – user3314418 Apr 24 '14 at 19:10

lets say your starting dataframe as df_in

columns = df_in.columns
df_proj = pd.concat([df_in,df_acc[columns]], axis=0,sort=False)

Pandas Join vs. Merge, How do I merge two data frames with the same index? extracting values from dataframe1 using conditions set in dataframe2 (pandas, python) 'One or more row labels was not found' appears when there is/are labels not

You could do something like following:

#using append 
final_df = df_in.append(df_acc)
OR

use axis=0, meaning row wise

df_proj = pd.concat([df_in,df_acc], axis=0,sort=False)

How do I merge two data frames in Python Pandas?, How do I add a column from one DataFrame to another panda? Merge two dataframes with both the left and right dataframes using the subject_id key pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id') Merge with outer join “Full outer join produces the set of all records in Table A and Table B, with matching records from both sides where available.

Pandas : Merge Dataframes on specific columns or on index in , In our previous article our focus was on merging using 'how' argument In both the above dataframes two column names are common i.e. ID & Experience. on 'ID' column from Dataframe 1 and 'EmpID' column from dataframe 2 i.e. on column names or row index labels using Dataframe.sort_index()  merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(), with the calling DataFrame being implicitly considered the left object in the join. The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.

Pandas : How to Merge Dataframes using Dataframe.merge() in , In this article we will discuss how to merge different Dataframes into a single left_on : Specific column names in left dataframe, on which merge will be done. bonus) from dataframe 2 with the columns of dataframe 1 based on 'ID' Include all rows from Left dataframe and add NaN for values which are We can merge two data frames in pandas python by using the merge () function. The different arguments to merge () allow you to perform natural join, left join, right join, and full outer join in pandas. left − Dataframe1. right – Dataframe2. on − Columns (names) to join on.

Combining DataFrames with Pandas – Data Analysis and , Combine data from multiple files into a single DataFrame using merge and concat. We can use the concat function in pandas to append either columns or rows It will automatically detect whether the column names are the same and will  Perform column-wise combine with another DataFrame. Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two. Parameters other DataFrame. The DataFrame to merge column-wise. func function. Function that takes two series as inputs and

Comments
  • Can you add expected output after processing ?
  • @jezrael edits added! Thanks!
  • Hi @jezrael, thank you for the feedback! The result you posted looks great. However I'm having a strange ValueError when I apply to the full size dataframe... ValueError: Shape of passed values is (220, 52), indices imply (219, 52) I'll definitely use this as a basis and see if I can isolate the problem but if you have any suggestions I'd appreciate it! Cheers.
  • @DennisWilson - one idea - are Line Name unique for both DataFrames ?
  • @DennisWilson - Because I think error means there is one dupe value.
  • @DennisWilson - test it by print (df_in[df_in['Line Name'].duplicated(keep=False)]) and print (df_acc[df_acc['Line Name'].duplicated(keep=False)])
  • Yes you were right, the code worked great! Thank you for the extra edits to help me troubleshoot.
  • Hi Rajith! Thanks for the suggestion. I'm getting an error using this approach.... KeyError: "['SOL X', 'LSP 0.5m', 'EOL Y', 'EOL X', 'FSP 0.5m', 'Bearing', 'SOL Y', 'Length km', 'Distance Sailed', 'LSP 1m', 'Length m', 'FSP 1m'] not in index" Any thoughts?
  • This because second dataframe doesnt have all the columns which are in main dataframe.you want to merge two dataframes with nan values?
  • Hey @dexter thanks for the reply! This gets the dataframes together, but doesn't merge on the line name. The idea is have a single row that contains the in data (planned line data) and recorded data (df_acc) together based on the 'Line Name' column.