Find common lines to multiple files

I have nearly 200 files and I want to find lines that are common to all 200 files,the lines are like this:

HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1
HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/1
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/2
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/1
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/2

is there a way to do it in a batch way?

awk '(NR==FNR){a[$0]=1;next}
     (FNR==1){ for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }
     ($0 in a) { a[$0]=1 }
     END{for (i in a) if (a[i]) print i}' file1 file2 file3 ... file200

This method processes each file line-by-line. The idea is to keep track which lines have been seen in the current file by using an associative array a[line]. 1 means that the line is seen in the current file, 0 indicates that the line is not seen.

  1. (NR==FNR){a[$0]=1;next} store the first file into an array indexed by the line, and mark it as seen. (NR==FNR) is a condition used to check for the first line.
  2. (FNR==1){for(i in a) if(a[i]) {a[i]=0} else {delete a[i]} }: if we read the first line of a file, check which lines have been seen in the previous file. If the line in the array is not seen, delete it, if it is seen, reset it to not-seen (0). This way, we clean up the memory and handle duplicate lines in a single file.
  3. ($0 in a) { a[$0]=1 }: per line, check if the line is a member of the array, if it is, mark it as seen (1)
  4. END{for (i in a) if(a[i]) print i}: when all lines are processed, check which lines to print.

Find the Common Lines in Multiple Files, What if I want to find the lines exist in all the given files under linux. Of course I can write a small script but I do not want to do that, cause I … Compare sorted files FILE1 and FILE2 line by line. With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. The secret in finding these information are the info pages.

I don't think there is a unix command which you could just use for the task. But you could create a little shell script around the comm and grep commands as shown in the following example:

#!/bin/bash    

# Prepare 200 (small) test files
rm data-*.txt
for i in {1..200} ; do
    echo "${i}" >> "data-${i}.txt"
    # common line
    echo "foo common line" >> "data-${i}.txt"
done

# Get the common lines between file1 and file2.
# file1 and file2 can be random files out of the set,
# ideally they are the smallest ones
comm -12 data-1.txt data-2.txt > common_lines

# Now grep through the remaining files for those lines
for file in data-{3..100}.txt ; do
    # For each remaining file reduce the common_lines to those
    # which are found in that file
    grep -Fxf common_lines "${file}" > tmp_common_lines \
        && mv tmp_common_lines > common_lines
done

# Print the common lines
cat common_lines

The same approach can be used for bigger files. It will take longer but the memory consumption stays linear.

Get common lines from multiple files, I want a script (awk preferably or python) that will look for common lines in the 4 different files. Files are sorted on Col1, but can be resorted if� Hello everyone A few years Ago the user radoulov posted a fancy solution for a problem, which was about finding common lines (gene variation names) between multiple samples (files). The code was: pre | The UNIX and Linux Forums

Could you please try following. Fair warning, this will be memory consuming, since data is getting stored into an array.

awk '
FNR==1{
  file++
}
{
  a[$0]++
}
END{
 for(i in a){
   if(a[i]==file){
     print "Line " i " is found in all "file " files."
   }
 }
}' file1 file2 ....file200

Get common lines, for only specific fields, from multiple files, The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0] ), and the values are� Find Multiple File Names in Linux. One of the many utilities for locating files on a Linux file system is the find utility and in this how-to guide, we shall walk through a few examples of using find to help us locate multiple filenames at once. Before we dive into the actual commands, let us look at a brief introduction to the Linux find utility.

My approach would be to generate a super-file that has a column at the start for filename and line number, then the corresponding line of content, sort this file on the content column.

Grep could generate the first part of this, especially if you can exclude some part of the file

how to find common lines of text in files in linux, There are two related Linux commands that lets you compare files from command line. These commands lets you find either the common lines or� Question: Finding common variants in multiple VCF files. 0. 4.4 years ago by. priyanka16161 • 0. priyanka16161 • 0 wrote: Hi, I am having 54 vcf files of same

[SOLVED] compare two files and print the common lines, Hi, I have multiple files and I want to do pairwise comparisons and print in an output file the common lines. For example two files might be: file1. [Log in to get rid of this advertisement]. Hi, I have multiple files and I want to do� Hi, I have multiple files and I want to do pairwise comparisons and print in an output file the common lines. For example two files might be: file1 [SOLVED] compare two files and print the common lines

Removing common lines between two files � zaiste.net, To remove common lines between two files you can use grep , comm or join command. grep only works for small files. Use -v along with -f . grep -vf file2 file1. refer to awk guide to get some awk basics, when the pattern value of a line is true this line will be printed. dup[$0] is a hash table in which each key is each line of the input, the original value is 0 and increments once this line occurs, when it occurs again the value should be 1, so dup[$0]++ == 1 is true. Then this line is printed.

comm command in Linux with examples, comm compare two sorted files line by line and write to standard output Suppose you have two lists of people and you are asked to find out the last column contains lines common to both the files. comm command only Note that you can also suppress multiple columns using these options together as: Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two. Some other handy variations include-n flag to show the line number of each matched line-c to only count the number of lines that match-v to display only the lines in file2 that differ (or use diff).

Comments
  • On SO we do encourage users to add their efforts which they have put in order to solve their own problems so please do add them in your question and let us know then.
  • Also please do mention file format in which you want to traverse and check details.
  • ok, restored. For the OP: I recommend to use awk as shown here. But maybe the use of grep and comm is still interesting for educational purposes.
  • Thanks, I tried something similar..could you explain what all those after awk mean? Also do I have to list all my 200 files?
  • yes thanks hek2mgl a lot, I do not use this style a lot, so it is definitely very useful...
  • @user3224522 You can use shell expansion to avoid listing out all 200 files manually. E.g. if the files are named file1.dat file2.dat, you can do awk '<CODE>' file*.dat. The shell will expand the file names before awk is invoked.
  • @user3224522 sorry there was a tiny bug in the code. This is now fixed.
  • I checked this is working, probably my files dont have any lines in common... :/
  • This could be problematic. Imagine you have 200 files of 1GB each, but not a single line in common. You will attempt to store 200GB of data in your array a.
  • @kvantour, added a warning in it, if memory is enough then this should be simplest one IMHO.