remove the first duplicate line based on a matched field and keep the second matched line

remove lines with duplicate field
awk remove duplicate lines based on multiple columns
linux remove duplicate lines keep one
bash remove duplicate fields
unix remove duplicates based on column
linux remove duplicate lines without sort
sed remove duplicate lines
vim remove duplicate lines
  1. Input file has 3 fields. Each field separated by a | (PIPE).
  2. First field is the key field and sorted. Each key in first field may occur once or twice.
  3. If a same key exists twice in first field, then remove the line of first occurrence and do not remove the line of second occurrence.
  4. If a key occurs only once then do not remove the line.
  5. Input Data in the third field will be unique through out the file.

Tried the below command which keeps the first duplicate line and removes the rest of duplicate lines. Is there any option in awk command to remove the first matched duplicate line and keep the second matched line. Command other than awk is also okay. Input file size can be 50 GB size. I am testing now on 12 GB file.

awk -F'|' '!a[$1]++'

Input File Content:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....

Output expected after processing the input file:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....

EDIT

Tried below solutions provided by @RavinderSingh13 & @RomanPerekhrest repectively.

For 12GB input file, below solution took 1 minute 20 seconds in first run and 1 minute 46 seconds in second run:

awk '
BEGIN{
  FS="|"
}
!a[$1]++{
  b[++count]=$1
}
{
  c[$1]=$0
}
END{
  for(i=1;i<=count;i++){
    print c[b[i]]
  }
}
' Inputfile  > testawk.txt

For 12GB input file, below solution took 2 minutes 31 seconds in first run, 4 minutes 43 seconds in second run and 2 minutes in 3rd run:

awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' Inputfile > testawk2.txt

Both the solutions are working as expected. I will use any one of the above after doing few more performance tests.

1st solution: If you are not at all worried about order of your lines in output then do simply.

awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file


2nd solution: Adding 1 more solution with awk less arrays and sort in case you worried about order.

awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file | sort -t'|' -k1


3rd solution: Could you please try following. If you are worried about order of your output should be same as shown Input_file.

awk '
BEGIN{
  FS="|"
}
!a[$1]++{
  b[++count]=$1
}
{
  c[$1]=$0
}
END{
  for(i=1;i<=count;i++){
    print c[b[i]]
  }
}
'  Input_file

Output will be as follows.

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

Remove partial duplicates consecutive lines but keep first and last , uniq has a -f flag which allows you to skip the first few fields. Fields are skipped before chars. Example with uniq But missing the last matching line in group . Remove duplicates but keep first instance with Kutools for Excel. Beside the VBA code, you can use the Select Duplicate & Unique Cells utility of Kutools for Excel add-in to easily remove duplicates but keep the first instances in Excel.

Efficiently with awk expression:

awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' file

The "magic" is based on capturing each current record (efficiently overwriting it without constant accumulation) and performing analysis on next row.

Sample output:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

Remove lines based on duplicates within one column without sort , Just change your awk command to the column you want to perform to remove duplicated lines based on (in your case third column): Add Prefix/Suffix into Line; Add/Remove Line Breaks; Add Repeats Tool; Count Characters, Words, Sentences, Lines; Delimited Column Extractor; Find and Replace Text; Letter Case Converter; Join/Merge Text (Line by Line) Remove Duplicate Lines; Remove Duplicate Words; Remove Empty Lines; Remove Extra Spaces; Remove Letter Accents; Remove Lines

This one-liner will only remove the first duplicate (the 2nd occurrence) from your file.

awk 'a[$1]++ !=1' file

Let's see an example:

kent$  cat f
1
2
3
2 <- should be removed
4
3 <- should be removed
5
6
7
8
9
2 <- should be kept
3 <- should be kept
10

kent$  awk 'a[$1]++ !=1' f
1
2
3
4
5
6
7
8
9
2
3
10

Uniq, The following command will sort all lines and remove duplicates (keeping unique the second): the first leaves only the last line, the second leaves only the first line. but it will start to matter if you want to do stuff like match the first word only. In the output above, we have two duplicate records with ID 1 and 3. Emp ID 1 has two occurrences in the Employee table Emp ID 3 has three occurrences in the Employee table We require to keep a single row and remove the duplicate rows. We need to remove only duplicate rows from the table.

Reverse the file and stable unique sort:

cat <<EOF |
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
EOF
tac | sort -s -t'|' -k1,1 -u

would output:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

The tac is a GNU utility. Because your file is big, pass the filename to tac so it can read the file from the back and use -T, --temporary-directory=DIR option with sort to allow it to sort such big files (or not, if you have enough ram).

Delete Identical—Help, This tool finds identical records based on input field values, then deletes all but are matched by the values in the first field, then by the values of the second field, and so on. DeleteIdentical function to identify duplicate records of a table or feature class. Current Workspace � Maintain Spatial Index � Extent � XY Tolerance� Remove duplicate lines (the first matching line by field criteria) Hello to all, I have this file 2002 1 23 0 0 2435.60 131.70 5.60 20.99 0.89 0.00 285.80 2303.90 2002 1 23 15 0 2436.60 132.90 6.45 21.19 1.03 0.00 285.80 2303.70 2002 1 23

How to remove duplicates in Excel, To remove partial duplicates based on one or more key columns, select If you want to eliminate duplicate rows including first occurrences, use one of header of the "Duplicate" column, and then check the "Duplicate row" box. or duplicate values, absolute duplicate rows or partially matching rows, in 1� One more approach keeping the order of lines same as input. The good thing about this is that it can be applied if we need to remove duplicate based on a field or fields. $ awk ‘!x[$0]++’ garbage.txt. Output: this is a test food that are killing you wings of fire we hope that the labor spent in creating this software unix ips as well as

Remove duplicate entries, keeping latest only, In a second sheet, perform a Remove Duplicates on the UID column only. of the first - and only - row that is a match for the previous conditions (matching UID � From there we simply keep track of the first line we see with a unique value in the second field, and a flag indicating whether we've printed this one before. When we first find a duplicate we print the first line that had that key, and mark it as having been printed, then we print the current line.

sort - Identify duplicate lines in a file without deleting them?, Since you want to see duplicate lines (using the first field as key), you cannot directly use uniq . The issue Here is an AWK script (save it to script.awk ) that takes your text file as input and prints all duplicate lines so you can decide which to delete. Unique lines based on the first field sort, CLI baz foo bar� How to remove duplicate rows and keep highest values only in Excel? For example you have a purchase table in Excel as the first screenshot shown. Now you need to remove duplicate rows based on the Fruit column, and keep highest values of corresponding fruits in the Amount column as the second screenshot shown.

Comments
  • I didn't see any duplicated remove from the expected output. Please elaborate.
  • @JamesBrown it did. the key is $1
  • What about the 3rd, 4th, ... should they be removed like the 1st ? What about the case of NO duplicate, do you still need to see the first instance ?
  • In your example, is the prefix '5|xxx|' just mean line 5 of input file xxx ? or is it part of the input ?
  • Why does the expected output include the 'xyz' from line 1 - should it be removed because of the 'xyz' in line 6 ?
  • keep in mind that OP said the file is 50G. the c[] stores the whole file, if I understood the codes right.
  • @Kent, yes true, lemme put 2 times Input_file reading concept too, but not sure if that will be faster than this or not.
  • @Kent, Hi sir, how about my 2nd solution? IMHO this should be faster than my previous one I believe, what you say sir.
  • it is not about faster or slower. I don't know what it would happen if OP's system has <50G free memory.
  • For me, the first occurrence should be removed and the 2nd occurrence should be present.
  • @inzero so you only want to have duplicated data? E.g. 5 in my example, there is no dups, it should be kept or not? please make a better input/output example. I am completely confused.
  • @inzero if you have 10 lines with the same key, you want to have 2nd-10th lines in your output?
  • there will be only 1 or 2 lines with same key.