How to compare 2 huge CSV files, based on column names specified at run time and ignoring few columns?

compare two large csv files for differences python
compare two csv files for differences in python pandas
compare two csv files python
csv diff
csvdiff

I need to write a program that compares 2 CSV files and reports the differences in an excel file. It compares the records based on a Primary key (and sometimes a few Secondary keys) ignoring a list of other columns specified. All these parameters are read from an excel. I have written a code that does this and works okay for small files but the performance is very poor for huge files (some files that are to be compared have way more than 200K rows).

The current logic uses csv.DictReader to read the files. I iterate over the rows of first file reading row by row, each time finding the corresponding record in the second file (comparing Primary and Secondary keys). If the record is found, I then compare all the columns ignoring those specified in the excel. If there is a difference in any of the columns, I write both records in the excel report highlighting the difference. Below is the code I have so far. It would be very kind if someone could provide any tips to optimize this program or suggest a different approach.

primary_key = wb['Parameters'].cell(6,2).value              #Read Primary Key

secondary_keys = []                                         #Read Secondary Keys into a list
col = 4
while wb['Parameters'].cell(6,col).value:
    secondary_keys.append(wb['Parameters'].cell(6,col).value)
    col += 1
len_secondary_keys = len(secondary_keys)

ignore_col = []                                             #Read Columns to be ignored into a list
row = 8
while wb['Parameters'].cell(row,2).value:
    ignore_col.append(wb['Parameters'].cell(row,2).value)
    row += 1

with open (filename1) as csv_file_1, open (filename2) as csv_file_2:
    file1_reader = csv.DictReader(filename1, delimiter='~')
    for row_file1 in file1_reader:
        record_found = False
        file2_reader = csv.DictReader(filename2, delimiter='~')
        for row_file2 in file2_reader:
            if row_file2[primary_key] == row_file1[primary_key]:
                for key in secondary_keys:
                    if row_file2[key] != row_file1[key]:
                        break
                compare(row_file1, row_file2)
                record_found = True
                break
        if not record_found:
            report_not_found(sheet_name1, row_file1, row_no_file1)

def compare(row_file1, row_file2):
    global row_diff
    data_difference = False
    for key in row_file1:
        if key not in ignore_col:
            if (row_file1[key] != row_file2[key]):
                data_difference = True
                break
    if data_difference:
        c = 1
        for key in row_file1:
            wb_report['DW_Diff'].cell(row = row_diff, column = c).value = row_file1[key]
            wb_report['DW_Diff'].cell(row = row_diff+1, column = c).value = row_file2[key]
            if (row_file1[key] != row_file2[key]):
                wb_report['DW_Diff'].cell(row = row_diff+1, column = c).fill = PatternFill(patternType='solid',
                                        fill_type='solid', 
                                        fgColor=Color('FFFF0000'))
            c += 1
        row_diff += 2

You are running into speed issues because of the structure of your comparison. You are using a nested loop comparing each entry in one collection to every entry in another, which is O(N^2) slow.

One way you could modify your code slightly is to redo the way you ingest the data and instead of using csv.DictReader to make a list of dictionaries for each file, would be to create a single dictionary of each file manually using the the primary & secondary keys as dictionary keys. This way you could compare entries between the two dictionaries very easily, and with constant time.

This construct assumes that you have unique primary/secondary keys in each file, which it seems like you are assuming from above.

Here is a toy example. In this I'm just using an integer and animal type as a tuple for the (primary key, secondary key) key

In [7]: file1_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, '
   ...: cat'): [6, 8, 90]}                                                      

In [8]: file2_dict = {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}        

In [9]: file1_dict                                                              
Out[9]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [55, 20, 1], (15, 'cat'): [6, 8, 90]}

In [10]: file2_dict                                                             
Out[10]: {(1, 'dog'): [45, 22, 66], (3, 'bird'): [4, 20, 1]}

In [11]: for k in file1_dict: 
    ...:     if k in file2_dict: 
    ...:         if file1_dict[k] == file2_dict[k]: 
    ...:             print('matched %s' % str(k)) 
    ...:         else: 
    ...:             print('different %s' % str(k)) 
    ...:     else: 
    ...:         print('no corresponding key for %s' % str(k)) 
    ...:                                                                        
matched (1, 'dog')
different (3, 'bird')
no corresponding key for (15, 'cat')

csvdiff � PyPI, Generate a diff between two CSV files. a column from the comparison then you can do so by specifying a comma seperated list of column names to ignore. This article shows the python / pandas equivalent of SQL join. You can find how to compare two CSV files based on columns and output the difference using python and pandas. The advantage of pandas is the speed, the efficiency and that most of the work will be done for you by pandas: * reading the CSV files(or any other) * parsing the information into tabular form * comparing the columns

I was able to achieve this using the Pandas library as suggested by @Vaibhav Jadhav using the below steps: 1. Import the 2 CSV files into dataframes. e.g.:

try:
    data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-8', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
    print (data1[keys[0]])
except:
    data1 = pd.read_csv(codecs.open(filename1, 'rb', 'utf-16', errors = 'ignore'), sep = delimiter1, dtype='str', error_bad_lines=False)
  1. Delete the columns not to be compared from both the dataframes.

for col in data1.columns: if col in ignore_col: del data1[col] del data2[col]

  1. Merge the 2 dataframes with indicator=True

merged = pd.merge(data1, data2, how='outer', indicator=True)

  1. From the merged dataframe, delete the rows that were available in both dataframes.

merged = merged[merged._merge != 'both']

  1. Sort the dataframe with the key(s)

merged.sort_values(by = keys, inplace = True, kind = 'quicksort')

  1. Iterate the rows of the dataframe, compare keys of the first 2 rows. If the keys are different row1 exists only in one of the 2 CSV files. If keys are same iterate over individual columns and compare to find which column value is different.

CSV diff - Online compare tool, This tool allows to compare CSV files and visualize the differences. It compares line by line, and indicates which fields are different. Field 2. Field 3. Line 1. id. name. year of birth. Line 2. 1. albert einstein. 1879 You can also directly compare two CSV files by specifying their urls in the GET parameters url1 and url2. This is a follow up question from Sort large CSV files (90GB), Disk quota exceeded. So now I have two CSV files sorted, as file1.csv and file2.csv each CSV file has 4 columns, e.g. file 1: ID Date Feature Value 01 0501 PRCP 150 01 0502 PRCP 120 02 0501 ARMS 5.6 02 0502 ARMS 5.6 file 2:

It is a good use case for Apache Beam.

Features like "groupbykey" will make matching by keys more efficient.

Using an appropriate runner you can efficiently scale to much larger datasets.

Possibly there is no Excel IO, but you could output to a csv, database etc.

https://beam.apache.org/documentation/
https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey/
https://beam.apache.org/documentation/runners/capability-matrix/
https://beam.apache.org/documentation/io/built-in/

How to import flat files with a varying number of columns in SQL Server, Ever been as frustrated as I have when importing flat files to a SQL FIRSTROW = 2, be wrong (depending on the tool used to import the file) The metadata of the file has changed. FROM 'c:\source\personlistswitchedcolumns.csv' matter of generating a list of column names from the source table that� So you have two CSV files which are different in some unknown but predictable ways? What I mean by this is that the CSV files in question are generated by the same or similar process.

15 ways to read CSV file with pandas, This tutorial explains how to read a CSV file in python using read_csv function of Example 11 : Read only specific columns; Example 12 : Read some rows and column type while importing CSV; Example 15 : Measure time taken to import big skiprows = 1 means we are ignoring first row and names= option is used to � In this article I wanted to show you how to compare two CSV files using Compare-Object command. It might be useful if you run some scans on regular basis and want to check if they contains the same data. Compare-Object. Script is based on Compare-Object cmdlet which compares two sets of objects.

The most (time) efficient ways to import CSV data in Python, 2. 1 First we will create a CSV file with some random data in it (integers, N = 1000000# creating a pandas dataframe (df) with 8 columns and integers between 999 and 999999 and with column names from A to H on memory for large input files and given an optimal chunksize found Ignoring csv. Solution: A CSV is a comma separated values file, what you provided is not a valid CSV formatHave a look at this:Clear-Host# Path of the 2 CSVs you want to So, I am trying to compare one column from two .csv files and only return the results to another .csv using powershell.

Pandas DataFrame: Playing with CSV files, In all probability, most of the time, we're going to load the data from a persistent Great job. df_csv = pd.read_csv('csv_example', header=[0,1,2]) well skip first few rows and then start looking at the table from a specific row Though we're reading the data from CSV files with Column headers, we can� I have been trying to compare 2 csv files and export the difference to a separate csv file and it is not working properly when I test it, it will give result if there is nothing in the Unique ID column but not when the number is different which is what I need it to. Heading of CSV files First Name, Surname, Unique ID, Tags. Current Script:

Comments
  • Try using pandas library by creating dataframes