Bash Script: count unique lines in file

Situation:

I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:

ip.ad.dre.ss[:port]
Desired result:

There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format

ip.ad.dre.ss[:port] count

where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.

So far, I'm using this command to scrape all of the ip addresses from the log file:

grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt

From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)

I can then use the following to extract the unique entries:

sort -u ips.txt > intermediate.txt

I don't know how I can aggregate the line counts somehow with sort.

You can use the uniq command to get counts of sorted repeated lines:

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

sort ips.txt | uniq -c | sort -bgr

Linux and Unix uniq command tutorial with examples, The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields. Please change <filename> value with your actual file name and it will return number of lines in a file as output. Example. Following command will count number of lines in /etc/passwd files and print on terminal. We can also use –lines in place of -l as command line switch. wc -l /etc/passwd . You can also count number of line on piped output.

To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:

sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l

Awk's arrays are associative so it may run a little faster than sorting.

Generating text file:

$  for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175

real    0m1.193s
user    0m0.701s
sys     0m0.388s

$ time awk '!seen[$0]++' random.txt | wc -l
31175

real    0m0.675s
user    0m0.108s
sys     0m0.171s

how many unique lines in a file - UNIX and Linux Forums, In other words, I need a script that only prints UNIQUE lines, then I can just wc ( word count) the result. Thank you very much! It also might help to know that each � command-line bash. share | improve this or with non-blank matching string if you want to count non-empty lines grep -c . file Script that rename a folder if

This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:

awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n

If you don't care about performance and you want something easier to remember, then simply run:

sort ips.txt | uniq -c | sort -n

PS:

sort -n parse the field as a number, that is correct since we're sorting using the counts.

Sort and count number of occurrence of lines, | sort | uniq -c. As stated in the comments. Piping the output into sort organises the output into alphabetical/numerical order. This is a requirement because uniq � There is no need for the redirection with sort and what's the point of piping to uniq when you could just do sort -u file -o file what you're doing is removing the duplicate values i.e your fileb contains 1,2,3,5,7 the OP wants the unique lines only which is 2,3 and is achieved by uniq -u file File extension has nothing to with it, your answer

Counting with grep and uniq, A common idiom in Unix is to count the lines of output in a file or pipe with example shows a use of uniq to filter a sorted list into unique rows: It is linear in the number of lines and the space usage is linear in the number of different lines. OTOH, the awk solution needs to keeps all the different lines in memory, while (GNU) sort can resort to temp files. – Lars Noschinski Apr 21 '17 at 5:55

How to remove duplicate lines from files preserving their order, A tutorial for adding tab completion to your scripts using the Bash Programmable Completion functionality. Find palindrome dates in the Linux� -c, --count prefix lines by the number of occurrences sort options:-n, --numeric-sort compare according to string numerical value -r, --reverse reverse the result of comparisons In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

Linux Bash count and summarize by unique columns, Write a shell script to output the total unique count of sessions per user Bash Script: count unique lines in file, You can use the uniq command to get counts of � Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to (16 Replies)

Comments
  • I like how -bgr coincidentally looks like a mnemonic for bigger, which is what we want at the top.
  • As a small function for your .bashrc or .bash_aliases file: function countuniquelines () { sort "$1" | uniq -c | sort -bgr; }. Call by countuniquelines myfile.txt.
  • Not sure why not sort -nr.
  • Interesting. Might make an appreciable difference for huge datasets
  • The ! in {!seen[$0]++} is redundant here, as we only do the printing at the END.