How to delete duplicate lines in a file without sorting it in Unix?

linux remove duplicate lines keep one
awk remove duplicate lines
sed remove duplicate lines
awk remove duplicate lines based on multiple columns
remove duplicate lines unix file
remove duplicates in unix
vim remove duplicate lines without sorting
awk print duplicate lines based on column

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk. Is that possible?

awk '!seen[$0]++' file.txt

seen is an associative-array that Awk will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is a logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2, and so on. Awk evaluates everything but 0 and "" (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

How to remove duplicate lines from files preserving their order, Explains how to sort and remove duplicate lines from a text file under UNIX / Linux sort command – Sort lines of text files in Linux and Unix-like systems. Without any options, the sort compares entire lines in the file and  Remove duplicate lines from a text file. The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead only print duplicate lines. For uniq to work, you must first sort the output.

From http://sed.sourceforge.net/sed1line.txt: (Please don't ask me how this works ;-) )

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

How can we remove duplicate line without sorting?, but you tell us we can't do that without saying why. Is there a requirement to output lines in the same order they were in in the input file? If so, is it important to​  Here is my task : I need to sort two input files and remove duplicates in the output files : Sort by 13 characters from 97 Ascending Sort by 1 characters from 96 Ascending If duplicates are found retain the first value in the file the input files are variable length, convert

Perl one-liner similar to @jonas's awk solution:

perl -ne 'print if ! $x{$_}++' file

This variation removes trailing whitespace before comparing:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

This variation edits the file in-place:

perl -i -ne 'print if ! $x{$_}++' file

This variation edits the file in-place, and makes a backup file.bak

perl -i.bak -ne 'print if ! $x{$_}++' file

Linux Shell - How To Remove Duplicate Text Lines, Usually whenever we have to remove duplicate entries from a file, we do a sort of the entries and then eliminate the duplicates using "uniq"  seen is an associative-array that Awk will pass every line of the file to. If a line isn't in the array then seen[$0] will evaluate to false. The ! is a logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true.

The one-liner that Andre Miller posted above works except for recent versions of sed when the input file ends with a blank line and no chars. On my Mac my CPU just spins.

Infinite loop if last line is blank and has no chars:

sed '$!N; /^\(.*\)\n\1$/!P; D'

Doesn't hang, but you lose the last line

sed '$d;N; /^\(.*\)\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems this would cause, changing the N command to print (rather than delete) the pattern space was more consistent with one's intuitions about how a command to "append the Next line" ought to behave. Another fact favoring the change was that "{N;command;}" will delete the last line if the file has an odd number of lines, but print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting the pattern space upon reaching the EOF) to scripts compatible with all versions of sed, change a lone "N;" to "$d;N;".

Bash - remove duplicates without sort, The uniq command is used to remove duplicate lines from a text file in Linux. By default, this command discards all but the first of adjacent repeated lines, so that no output lines are repeated. For uniq to work, you must first sort the output. Differences between Linux and Windows · Differences between Unix and Linux  Only the sort command without uniq command: $ sort -u file AIX Linux Solaris Unix sort with -u option removes all the duplicate records and hence uniq is not needed at all. Without changing order of contents: The above 2 methods change the order of the file. The unique records may not be in the order in which it appears in the file.

An alternative way using Vim(Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq

Remove duplicates without sorting file - BASH, If you like to delete duplicate lines from a file using certain pattern, you can use sed delete command. file. $ uniq -D -w 8 testNew hi Linux hi LinuxU hi LinuxUnix not be compared and then next field 'Linux' in both lines are same so For these commands to work the original file must be properly sorted. Remove duplicate lines, sort it and save it as file itself Hi, all I have a csv file that I would like to remove duplicate lines based on 1st field and sort them by the 1st field. If there are more than 1 line which is same on the 1st field, I want to keep the first line of them and remove the rest.

Remove duplicate lines from a text file, Using awk, find duplicates in a file without sorting, which reorders the contents. The best command line collection on the internet, submit yours and save Both solutions are very elegant and easily replicated in unix. thanks. This option is to print only duplicate repeated lines in file. As you see below, this didn’t display the line “xx”, as it is not duplicate in the test file. $ uniq -d test aa bb. The above example displayed all the duplicate lines, but only once. But, this -D option will print all duplicate lines in file.

7 Linux Uniq Command Examples to Remove Duplicate Lines from , Specialized utils that print unique lines without sorting: results without having to sort data? and Unix: removing duplicate lines without sorting. The script keeps an associative array with indices equal to the unique lines of the file and values equal to their occurrences. For each line of the file, if the line occurrences are zero, then it increases them by one and prints the line, otherwise, it just increases the occurrences without printing the line.

Remove duplicate entries in a file without sorting. Using awk, Find The Best Deals For Duplicate File Remover. Compare Prices Online And Save Today!

Comments
  • if you mean consecutive duplicates then uniq alone is enough.
  • and otherwise, I believe it's possible with awk, but will be quite resource consuming on bigger files.
  • Duplicates stackoverflow.com/q/24324350 and stackoverflow.com/q/11532157 have interesting answers which should ideally be migrated here.
  • To save it in a file we can do this awk '!seen[$0]++' merge_all.txt > output.txt
  • An important caveat here: if you need to do this for multiple files, and you tack more files on the end of the command, or use a wildcard… the 'seen' array will fill up with duplicate lines from ALL the files. If you instead want to treat each file independently, you'll need to do something like for f in *.txt; do gawk -i inplace '!seen[$0]++' "$f"; done
  • @NickK9 that de-duping cumulatively across multiple files is awesome in itself. Nice tip
  • geekery;-) +1, but resource consumption is inavoidable.
  • '$!N; /^(.*)\n\1$/!P; D' means "If you're not at the last line, read in another line. Now look at what you have and if it ISN'T stuff followed by a newline and then the same stuff again, print out the stuff. Now delete the stuff (up to the newline)."
  • 'G; s/\n/&&/; /^([ -~]*\n).*\n\1/d; s/\n//; h; P' means, roughly, "Append the whole hold space this line, then if you see a duplicated line throw the whole thing out, otherwise copy the whole mess back into the hold space and print the first part (which is the line you just read."
  • Is the $! part necessary? Doesn't sed 'N; /^\(.*\)\n\1$/!P; D' do the same thing? I can't come up with an example where the two are different on my machine (fwiw I did try an empty line at the end with both versions and they were both fine).
  • The second solution doesn't work for me (on GNU sed 4.2.1), on a test file with only lowercase English letters and spaces. However, replacing [ -~] with . or [^\n] or even [ -z{|}~] (the exact same set of characters) does the job. If anyone can explain the difference, that would be nice...
  • This will disturb the order of the lines.
  • What is about 20 GB text file? Too slow.
  • As ever, the cat is useless. Anyway, uniq already does this by itself, and doesn't require the input to be exactly one word per line.