Compare two CSV files and search for similar items

how to compare two csv files in python using pandas
compare two large csv files for differences python
python compare columns in two csv files
python script to compare two csv files and return the difference
compare two columns of different files and print if it matches python
compare two csv files and remove duplicates
shell script to compare two csv files and output differences
code to compare two csv files

So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below:

Path    Filename    Size    Signature
C:\     a.txt       14kb    012345
D:\     b.txt       99kb    678910
C:\     c.txt       44kb    111213

The second file, masterlist.csv is shown below:

Filename    Signature
b.txt       678910
x.txt       111213
b.txt       777777
c.txt       999999

As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion. I know this would look something like:

hosts[3] == masterlist[1]

I am looking for a solution that will give me something like the following (basically the hosts.csv file with a new RESULTS column):

Path    Filename    Size    Signature    RESULTS
C:\     a.txt       14kb    012345       NOT FOUND in masterlist
D:\     b.txt       99kb    678910       FOUND in masterlist (row 1)
C:\     c.txt       44kb    111213       FOUND in masterlist (row 2)

I've searched the posts and found something similar to this here but I don't quite understand it as I'm still learning python.

Edit Using Python 2.6

Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.

You can find the documentation for the python CSV module here.

What you're looking for is something like this:

import csv

f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

masterlist = list(c2)

for hosts_row in c1:
    row = 1
    found = False
    for master_row in masterlist:
        results_row = hosts_row
        if hosts_row[3] == master_row[1]:
            results_row.append('FOUND in master list (row ' + str(row) + ')')
            found = True
            break
        row = row + 1
    if not found:
        results_row.append('NOT FOUND in master list')
    c3.writerow(results_row)

f1.close()
f2.close()
f3.close()

Python : Compare two csv files and print out differences, The problem is that you are comparing each line in fileone to the same line in filetwo . As soon as there is an extra line in one file you will find  The second file, masterlist.csv is shown below: Filename Signature b.txt 678910 x.txt 111213 b.txt 777777 c.txt 999999. As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion.

The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:

import csv

with open('masterlist.csv', 'rb') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open('hosts.csv', 'rb') as hosts:
    with open('results.csv', 'wb') as results:    
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

This produces a dictionary, mapping signatures from masterlist.csv to a line number first. Lookups in a dictionary take constant time, making the second loop over hosts.csv rows independant from the number of rows in masterlist.csv. Not to mention code that's a lot simpler.

For those using Python 3, the above only needs to have the open() calls adjusted to open in text mode (remove the b from the file mode), and you want to add new line='' so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (use encoding=...). The master_indices mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}).

Compare two CSV files and search for similar items, So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below: Path Filename Size Signature  The expected o/p is a new csv file say - file3.csv should contain the details of IDs which are present in both the files but with some of the data related to it being different (here version and cost) - and the o/p should be as shown below. o/p needed (o/p format)

Python's CSV and collections module, specifically OrderedDict, are really helpful here. You want to use OrderedDict to preserve the order of the keys, etc. You don't have to, but it's useful!

import csv
from collections import OrderedDict


signature_row_map = OrderedDict()


with open('hosts.csv') as file_object:
    for line in csv.DictReader(file_object, delimiter='\t'):
        signature_row_map[line['Signature']] = {'line': line, 'found_at': None}


with open('masterlist.csv') as file_object:
    for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
        if line['Signature'] in signature_row_map:
            signature_row_map[line['Signature']]['found_at'] = i


with open('newhosts.csv', 'w') as file_object:
    fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
    writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
    writer.writer.writerow(fieldnames)
    for signature_info in signature_row_map.itervalues():
        result = '{0} FOUND in masterlist {1}'
        # explicit check for sentinel
        if signature_info['found_at'] is not None:
            result = result.format('', '(row %s)' % signature_info['found_at'])
        else:
            result = result.format('NOT', '')
        payload = signature_info['line']
        payload['RESULTS'] = result

        writer.writerow(payload)

Here's the output using your test CSV files:

Path    Filename        Size    Signature       RESULTS
C:\     a.txt   14kb    012345  NOT FOUND in masterlist 
D:\     b.txt   99kb    678910   FOUND in masterlist (row 1)
C:\     c.txt   44kb    111213   FOUND in masterlist (row 2)

Please excuse the misalignment, they are tab separated :)

compare two csv files and fetch matching data, So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below: Path Filename  So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below: The second file, masterlist.csv is shown below: As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the

The csv module comes in handy in parsing csv files. But just for fun, I am simply splitting the input on whitespace to get at the data.

Just parse in the data, build a dict for the data in masterlist.csv with the signature as key and the line number as value. Now, for each row of hosts.csv, we can just query the dict and find out whether or not a corresponding entry exists in masterlist.csv and if so at which line.

#! /usr/bin/env python

def read_data(filename):
        input_source=open(filename,'r')
        input_source.readline()
        return [line.split() for line in input_source]

if __name__=='__main__':
        hosts=read_data('hosts.csv')
        masterlist=read_data('masterlist.csv')
        master=dict()
        for index,data in enumerate(masterlist):
                master[data[-1]]=index+1
        for row in hosts:
                try:
                        found="FOUND in masterlist (row %s)"%master[row[-1]]
                except KeyError:
                        found="NOT FOUND in masterlist"
                line=row+[found]
                print "%s    %s    %s    %s    %s"%tuple(line)

How to compare two CSV files and produce common output in Excel , The following grep should return the desired results assuming file1.csv has only one column for each row. This uses each line in file1.csv as a search string  I have two CSV files, each with one column of thousands of entries in the same format. I want to compare them with each other, and generate a list of entries that are present in both files . I could also merge the files into one, with two columns next to each other, if that helps.

I just fixed a small thing in Martijn Pieters code in order to make it work in Python 3, and in this code, I am trying to match the first column elements in the file1 row[0] with the first column elements in file2 row[0].

import csv
with open('file1.csv', 'rt', encoding='utf-8') as master:
    master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))

    with open('file2.csv', 'rt', encoding='utf-8') as hosts:
        with open('result.csv', 'w') as results:    
            reader = csv.reader(hosts)
            writer = csv.writer(results)

            writer.writerow(next(reader, []) + ['RESULTS'])

            for row in reader:
                index = master_indices.get(row[0])
                if index is not None:
                    message = 'FOUND in master list (row {})'.format(index)
                    writer.writerow(row + [message])

                else:
                     message = 'NOT FOUND in master list'
                     writer.writerow(row + [message])

        results.close()

Comparing two CSV files: Finding items that are in one and not the , So you have two CSV files which are different in some unknown but What I mean by this is that the CSV files in question are generated by the same or similar process. You can also search for keywords compare csv files on Google​. I have  Open the two sheets you want to compare between, and activate one sheet and click View > View Side by Side. See screenshot: Then the two sheets in two workbooks have been displayed horizontally. And then you can compare two sheets as you need.

csvmatch · PyPI, It's the same as a SQL outer join; only rows unique to each dataframe will be returned. Edit: this is only an option if your data is set up in table format, either on read  My problem is that the files are actually simplified .csv files, and I must use a comma as a separator rather than a space. I have tried everything I can think of to make this work (i.e -F, -F',' -F"," everywhere in the command) and no success.

In Python, How do I read 2 CSV files, compare column 1 from both , Find (fuzzy) matches between two CSV files in the terminal. You can also compare multiple columns, so if we wanted to find which name and location combinations are in both files we could: $ csvmatch Other things can also be ignored. By default the columns used in the output are the same ones used for matching. Tags: Compare Two CSV Files Using PowerShell, Compare-Object cmdlet, CSV Files Comparison through PowerShell, CSV Files Powershell Comparison Results Explanation 0 Any Two files can be compared with the use of the Compare-Object cmdlet in PowerShell

Python Pandas Compare Two CSV files based on a Column, I am looking for a Python way to compare the 2 CSV files (only Column 1), and if column1 is the same in both CSV files, then write the entire row from CSV1.csv to a new CSV file. First, read both the csv files and store the data in two different dataframes. Next how can i randomly select items from a list? To view more than 2 Excel files at a time, open all the workbooks you want to compare, and click the View Side by Side button. The Compare Side by Side dialog box will appear, and you select the files to be displayed together with the active workbook.

Comments
  • This is pretty good. Using csv.DictReader might be clearer too, since you could replace master_row[1] with master_row['signature'].
  • This produces a blank line after every result.
  • The blank line issue is system dependent. If you get a blank line after every result, replace the f3 = file('results.csv', 'w') line with f3 = file('results.csv', 'wb')
  • This works as needed. Easy to read through too! Thanks for the help!
  • Why a list comprehension when masterlist = list(c2) would do?
  • The script together with the example inputs will give the error: "IndexError: list index out of range"
  • @Chubaka: take into account that the inputs are comma separated, not tab separated. The OP only formatted them that way in the question.
  • Is it possible to implement a solution like this but instead of comparing specific indices compare the whole row contents (assuming you have csv files that are identical with only a few different rows).
  • @ssbsts: put all rows from file 1 into a set: existing = {tuple(r) for r in reader1} (converting rows to tuples is needed to make them hashable), then test your other file against the existing set with if tuple(row) in existing:.
  • I'm getting an ImportError: cannot import name OrderedDict. I'm using Python 2.6 and a portable version of python 3. Is OrderedDict specific only to 2.7?
  • Yes. You can change OrderedDict to dict() and it will work fine.
  • You can backport 2.7 OrderedDict to 2.6. The module can be found here: hg.python.org/cpython/file/291bc0097cc1/Lib/collections/…