Intersection of Large csv Files

I have two large csv files of thousands of entries, each file contains two columns of ID's of the form :

BRADI5G01462.1_1    NCRNA_34654_1853
BRADI5G01462.1_1    NCRNA_34398_1942
BRADI5G01462.1_1    NCRNA_2871_1959

I've tried this, but it's not giving the expected results:

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in range(len(files)):
    name=files[i][files[i].find("#")+1:files[i].find(".")]
    with open(files[i],"r",newline='') as source:
         rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
         with open("@"+name+".csv","w",newline='') as result:
              wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
              for r in rdr:
                  wtr.writerow( (r[1],r[2]) )


l1 = set(open('@Left(Brachypodium_Japonica).csv'))
l2 = set(open('@Right(Brachypodium_Japonica).csv'))
open('Intersection(Brachypodium_Japonica).csv', 'w').writelines(l1 & l2)

what is the most efficient pythonic way to find the intersection between both files ?! by which i.e. whole match of the two columns in both files .

I've asked this question before, but no one bothered to help.

I'm really stuck in this and desperately need help that would be highly appreciated.

Edit: File 1 (Left) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

File 2 (Right) input sample:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI4G45180.1_240  OS03T0103800-01_2473
BRADI2G12470.2_487  OS04T0470600-00_3504
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78330.1_878  OS06T0155600-01_4411

Intersection file of Left & Right:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621

When i looked in to your code it looked like this,

->you have two files

->you write them again into two files

->finally add them to the Intersection file

Lets remove that second step

import csv
files=["#Left(Brachypodium_Japonica).csv","#Right(Brachypodium_Japonica).csv"]
for i in files:
    with open(i,"r",newline='') as source:
        rdr= csv.reader( source,delimiter="\t",skipinitialspace=True )
        with open('Intersection(Brachypodium_Japonica).csv',"aw",newline='') as result:
            wtr= csv.writer( result,delimiter="\t",skipinitialspace=True )
            for r in rdr:
                wtr.writerow( (r[0],r[1]) )

the steps:

->read the first file and add it to the intersection

->read the next file and continue appending

try this and if does not work then do let me know. I have no idea why did u use that newline and skip but if you know why, then u may continue with it. i have rewritten the code in a proper way following your format.

I tried it on my system and it worked if it is not working do point out the traceback or the error.

Intersection of Large csv Files, bash intersection of two files awk compare two files column print difference. I have two large csv files of thousands of entries, each file contains two columns of​  Hi, I like to intersect two files based on their first columns. Here is the code which does the trick for me: awk 'NR==FNR{A;next}$1 in A' file1 file2 However, this only looks for exact matches between the two files in the first column.

The following script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

import csv

files = ["input1.csv", "input2.csv"]
ldata = []

for file in files:
    with open(file, "r") as f_input:
        csv_input = csv.reader(f_input, delimiter="\t", skipinitialspace=True)
        set_rows = set()
        for row in csv_input:
            set_rows.add(tuple(row))
        ldata.append(set_rows)

with open("Intersection(Brachypodium_Japonica).csv", "wb") as f_output:
    csv_output = csv.writer(f_output, delimiter="\t", skipinitialspace=True)
    csv_output.writerows(set.intersection(*ldata))

You will need to add your file name mangling. This format made it easier to test. Tested using Python 2.7.

Efficient way to intersect multiple large files containing geodata , I'm having trouble with processing some csv data files for a project. Someone suggested using python/csv reader to help break down the files, which I've had  If you're looking to open a large CSV file, CSV Explorer is the simplest and quickest way to open big CSV files. The Difficulty with Opening Big CSVs in Excel Spreadsheet software , like Excel and Google Sheets, work by loading entire files into a computer's high speed memory (RAM).

Getting a Cross-Section from Two CSV Files, Open Office is loaded on this machine, along with MySQL (queries are alright). What's the easiest way to create a third CSV with the intersection of data? share. The following script reads two (or more) CSV files in and writes the intersection of row entries to a new CSV file. By that I mean if row1 in input1.csv is found anywhere in input2.csv, the row is written to the output, and so on.

I just had a similar issue where I needed to find the intersection of two rather large CSV files (74M lines vs. 4.5M lines). The basic idea is to read the smaller file and build a lookup table which can be used while iterating over the larger file.

The fastest way I found was using jq command line tool.

smaller.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055

larger.csv:

BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI3G00440.1_243  OS03T0274400-01_2650
BRADI3G58610.1_438  OS01T0112500-01_899
BRADI1G73670.1_850  OS11T0481500-01_6621
BRADI1G78150.1_870  OS02T0543300-00_2055
...

Prepare a lookup table of the smaller.csv in form of a json dict and write it to smaller.json:

$ jq -s -R 'split("\n")|map({key:.,value:1})|from_entries' smaller.csv | tee smaller.json
{
  "BRADI5G16060.1_36   OS08T0547100-02_5715": 1,
  "BRADI3G00440.1_243  OS03T0274400-01_2650": 1,
  "BRADI3G58610.1_438  OS01T0112500-01_899": 1,
  "BRADI1G73670.1_850  OS11T0481500-01_6621": 1,
  "BRADI1G78150.1_870  OS02T0543300-00_2055": 1
}

Now iterate the larger.csv and filter with the smaller.json as a lookup table:

$ jq -R -r --slurpfile F smaller.json 'select($F[0][.])' larger.csv | tee intersection.csv
BRADI5G16060.1_36   OS08T0547100-02_5715
BRADI1G73670.1_850  OS11T0481500-01_6621

Intersection and other operations on CSV file (Example), Note that multiple files (i.e., positional data from a large number of foci) can be read the output will be a .csv file, with columns corresponding to MSD values for  I have two csv files, there are some overlap columns between these two files. Assume one file is called as A.csv and another is called as B.csv. The intersection of A and B is called as C. I would like to generate three new csv files: C; the subset of B after subtracting C from it, which is called as D; the union of A and D, which is called as E.

Mechanisms of DNA Recombination and Genome Rearrangements: , The script was used to filter large patient data files. The first file contained information for a large set of images taken of retina from an eye  Parse VERY LARGE CSV Need help parsing large firewall logs 1.5+ GB. I've wrote script to break the file into manageable pieces but now I need to get rid of extra columns.

python script to find the intersection of files using multiple , I have 3 replicates data outputs in separate three .csv files.E Each files the first column is G Content. I have some CSV files that I need to import into the MATLAB (preferably in a .mat format). I already tried to break down these files using csvread function into 100 cluster pieces, but it's very slow and it doesn't go further than 60 steps even though my computer machine is fairly new.

getting intersect between two lists of genes in R, Hi, I have Data on CSV file, so i need a python code to subtract column from other one and add the result on a new column . I try to make this in gff file and it is Ok  Query returned 25 million lines. Saved as csv becomes almost 8 GB. I think there are multiple lines per file. I want to run PowerShell against the csv and get a new csv with only unique values. Plan to then use that csv with PowerShell (Test-Path ) to get a list of what is/isn’t there so I can try to remediate.

Comments
  • Is the input ordered?
  • @Sirko No, it's not ordered and the two files are NOT necessarily of the same size
  • I have not tried it myself but I have read in many places that for working on large sets like yours numpy package is often suggested.
  • @Marco Numpy is for numerical data, and my data is of string type !!
  • Hi, could you give us a little more informations please? For example: what (maximum) sizes are the datasets and on which system will it work (ie: linux 64bit, windows 32bit...), does it have to be done in "one operation" or are you allowed to make a first iteration to "refine" the inputs.
  • Hhhhh you might haven't read the original post I wrote earlier 4 months ago. I did rewrite the files into new ones, to eliminate the first column that used to represent the similarity in order to use the set and intersection function without much trouble. BTW, it's NOT about appending, it's about finding the intersection between two large files of different sizes !!
  • @Bara'a I shall try a fix for that soon.
  • I would really appreciate that if you could help, I'm waiting for an answer for over 4 months :(
  • @Bara'a can you show a small snippet of your expected input of two files and expected output in the single file. edit it in your question.
  • Please, check the edit in the thread ... Thank you for your help
  • he had the idea of dictionary and list but the OP says since it is about 1000 and more lines for each file, he opted out of it.