detecting "duplicate" entries in a tab separated file using bash & commands

detecting synonym
detecting meaning
detect antonym
detect sentence
detecting antonyms
synonym for the word detecting
detect pok�mon
how to pronounce detect

I have a tab-separated text file I need to check for duplicates. The layout looks roughly like so. (The first entries in the file are the column names.) Sample input file:

+--------+-----------+--------+------------+-------------+----------+
| First  |   Last    | BookID |   Title    | PublisherID | AuthorID |
+--------+-----------+--------+------------+-------------+----------+
| James  | Joyce     |     37 | Ulysses    |         344 |     1022 |
| Ernest | Hemingway |    733 | Old Man... |         887 |      387 |
| James  | Joyce     |    872 | Dubliners  |         405 |     1022 |
| Name1  | Surname1  |      1 | Title1     |           1 |        1 |
| James  | Joyce     |     37 | Ulysses    |         345 |     1022 |
| Name1  | Surname1  |      1 | Title1     |           2 |        1 |
+--------+-----------+--------+------------+-------------+----------+

The file can hold up to 500k rows. What we're after is checking that there are no duplicates of the BookID and AuthorID values. So for instance, in the table above there can be no two rows with a BookID of 37 and AuthorID 1022.

It's likely, but not guaranteed, that the author will be grouped on consecutive lines. If it isn't, and it's too tricky to check, I can live with that. But otherwise, if the author is the same, we need to know if a duplicate BookID is there.

One complication-- we can have duplicate BookIDs in the file, but it's the combo of AuthorID + BookID that is not allowed.

Is there a good way of checking this in a bash script, perhaps some combo of sed and awk or another means of accomplishing this?

Raw tab-separated file contents for scripting:

First   Last    BookID  Title   PublisherID AuthorID
James   Joyce   37  Ulysses 344 1022
Ernest  Hemingway   733 Old Man...  887 387
James   Joyce   872 Dubliners   405 1022
Name1   Surname1    1   Title1  1   1
James   Joyce   37  Ulysses 345 1022
Name1   Surname1    1   Title1  2   1

This is pretty easy with awk:

$ awk 'BEGIN { FS = "\t" }
       ($3,$6) in seen { printf("Line %d is a duplicate of line %d\n", NR, seen[$3,$6]); next }
       { seen[$3,$6] = NR }' input.tsv

It saves each bookid, authorid pair in a hash table and warns if that pair already exists.

Detecting Synonyms, Detecting Antonyms, 24 synonyms of detecting from the Merriam-Webster Thesaurus, plus 6 related words, definitions, and antonyms. Find another word for detecting. Detect definition is - to discover the true character of. How to use detect in a sentence.

If you want to find and count the duplicates you can use

awk '{c[$3 " " $6]+=1} END { for (k in c) if (c[k] > 1) print k "->" c[k]}'

which saves the combinations count in an associative array and then prints the counts if greater than 1

DETECTING, detecting meaning: 1. present participle of detect 2. to notice something that is partly hidden or not clear, or to…. Learn more. Define detecting. detecting synonyms, detecting pronunciation, detecting translation, English dictionary definition of detecting. tr.v. de·tect·ed , de·tect·ing , de·tects 1. To discover or ascertain the existence, presence, or fact of.

Detect, Detect definition, to discover or catch (a person) in the performance of some act: to detect someone cheating. See more. Detect definition, to discover or catch (a person) in the performance of some act: to detect someone cheating. See more.

As @Cyrus already said in the comment, your questions is not really clear, but looks interesting and I attempted to understand it and provide solution giving a few assumptions.

Assuming we have the following records.txt file:

First   Last        BookID      Title           PublisherID     AuthorID
James   Joyce       37          Ulysses         344             1022
Ernest  Hemingway   733         Old Man...      887             387
James   Joyce       872         Dubliners       405             1022
Name1   Surname1    1           Title1          1               1
James   Joyce       37          Ulysses         345             1022
Name1   Surname1    1           Title1          2               1

we are going to remove lines, which has duplicated BookID (column 3) and AuthorID (Column 6) values at the same time. We assume that First, Last name and Title are also the same and we don't have to take it into consideration and PublisherID may be different or the same (it doesn't matter). Location of the records in the file doesn't matter (duplicated lines don't have to be grouped together).

Having these assumptions in mind, expected output for the input provided above will be as follows:

Ernest  Hemingway   733         Old Man...      887             387
James   Joyce       872         Dubliners       405             1022
James   Joyce       37          Ulysses         344             1022
Name1   Surname1    1           Title1          1               1

Duplicated records of the same books of the same author for one publisher were removed.

Here's my solution for this problem in Bash

#!/usr/bin/env bash

file_name="records.txt"
repeated_books_and_authors_ids=($(cat $file_name | awk '{print $3$6}' | sort | uniq -d))

for i in "${repeated_books_and_authors_ids[@]}"
do
    awk_statment_exclude="$awk_statment_exclude\$3\$6 != $i && "
    awk_statment_include="$awk_statment_include\$3\$6 ~ $i || "
done

awk_statment_exclude=${awk_statment_exclude::-3}
awk_statment_exclude="awk '$awk_statment_exclude {print \$0}'"
not_repeated_records="cat $file_name | $awk_statment_exclude | sed '1d'"
eval $not_repeated_records

awk_statment_include=${awk_statment_include::-3}
awk_statment_include="awk '$awk_statment_include {print \$0}'"
repeated_records_without_duplicates="cat $file_name | $awk_statment_include | sort | awk 'NR % 2 != 0'"
eval $repeated_records_without_duplicates

It's probably not the best possible solution, but it works.

Regards,

Piotr

Detect Synonyms, Detect Antonyms, Synonyms for detect at Thesaurus.com with free online thesaurus, antonyms, and definitions. Find descriptive alternatives for detect. 24 synonyms of detecting from the Merriam-Webster Thesaurus, plus 6 related words, definitions, and antonyms. Find another word for detecting. Detecting: to come upon after searching, study, or effort.

detect, detect meaning, definition, what is detect: to notice or discover something, especia Word family (noun) detection detective (adjective) detectable ≠ indetectible� Note for Metal Detectorists and Collectors: Discuss plans for detecting finds before your detect. On private hunts, leave all finds and copies of your notes at the property, including GPS coordinates, depth, photographs, etc. in case it can assist future Archaeological work.

Detecting, Define detecting. detecting synonyms, detecting pronunciation, detecting translation, English dictionary definition of detecting. tr.v. de�tect�ed , de�tect�ing� de·tec·tion (dĭ-tĕk′shən) n. 1. The act or process of detecting; discovery: detection of a crime; detection of radiation from a distant galaxy. 2. See demodulation

Metal detector, Uses include detecting land mines, the detection of weapons such as knives and guns (especially in airport security), geophysical prospecting, archaeology and� I have used this for metal detecting and to dig small holes for pipes and posts at a disc golf course. I was a little surprised how easily the paint has chipped off and how much surface rust there is. I'm pretty old and nobody would consider me very strong anymore, but I managed to bend the part where the blade and handle connect pretty easily.

Comments
  • Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment).
  • The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself.
  • @Cyrus the editor mangled my code table, which is why I used an image, to help with readablilty. Not sure how best to paste in a tab-separated table to make it look good on screen?
  • SO is more amenable to CSV than TSV data. I suggest using that in questions and making the (usually trivial) adjustments from commas to tabs for your actual code.
  • @larryq, not an answer to your actual question, but do look up ascii table generators e.g. this. I usually use them to paste tables in forums.
  • +1. I would explicitly use -F$'\t' here since default FS is [ \t\n]+, which would potentially break on titles containing spaces.
  • Also, you don't need an END section. This oneliner would work just as well. awk -F$'\t' '!c[$3 FS $6]++ && NR>1' records.txt
  • I appreciate this. What part of my question wasn't clear, so I can edit and make it better? I thought I explained it well, however if there are bits you're not sure about, I'm happy to clarify.
  • I would just upvote for by hand recreating the input file. But: 1) eval is completely evil and unnecessary here. What is wrong with just executing awk "$awk_statment_exclut" | sed '1d' ? 2) what is the : doing there? 3) The cat $file_name is unquoted everywhere and it is an unnecessery use of cat
  • I'm not an expert in bash, so this script may not follow the best practices. I'll try to fix this script later.
  • After publishing answer, I noticed that this script won't work for the case with more than 2 duplicates, but I see that people below gave more concise solutions, which can be considered instead of mine.