How to remove duplicates and get SUM of a particular column in unix shell?

awk group by multiple columns and sum
awk group by multiple columns and count
awk group by sum
unix sum column
bash group by sum
awk '( sum columns)
sum of columns in bash
awk '( sum all columns)

I have a file is having data as below(number of file count is always varies)

May 1 09:00 ./archive/xxx_cs_app_gmas_reject_MDM_20180501090001.csv 0.000       2   ✔
May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055       185 ✈

I want the output in one line like below but need the sum of record count(column 6) and its better to print the latest file(according to timestamp)

May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501090001.csv 0.000       187   ✔

I tried using below sed command so that I can ignore the timestamp of the file. But can you please help me to remove the duplicates along with SUM of counts (column 6)

sed -e 's/\(.*\)[0-9]\{6\}\.\([^.]*\)/\1.\2/'

Borrowing @Allan's sample input:

$ cat tst.awk
{
    base = time = $4
    sub(/_[^_]+$/,"",base)
    sub(/.*_/,"",time)
    if (time > max[base]) {
        max[base] = time
        pre[base] = $1 OFS $2 OFS $3 OFS $4 OFS $5
        suc[base] = $7
    }
    tot[base] += $6
}
END {
    for (base in tot) {
        print pre[base], tot[base], suc[base]
    }
}

$ awk -f tst.awk file
May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055 187 ✈
May 2 12:00 ./archive/xxx_123_20180502120001.csv 0.000 3 ✒
May 1 18:45 ./archive/xxx_uvw_ABC_20180501184500.csv 0.055 138 ✕
May 1 19:45 ./archive/xxx_456_20180501194500.csv 0.055 135 ✕

Column sum group by uniq records, Dear All, I want to get help for below case. Shell Programming & Scripting. awk to Sum columns when other column has duplicates and append one No action to be taken for rows where value in column 5 is unique. Bring values in the second column into single line (comma sep) for uniq value in the first column. 1> No, File is having only 7 columns and there is no way to get more columns. 2> yes the entries are always in order by timestamp – Barcode May 2 '18 at 4:56 add a comment |

Count & sum no. of occurrences of each unique line in a file, index is the name (banana, apple, grapefruit) and sums up the values in the first column. Run it with perl script.pl file1 file2 file3 file4 . How to remove duplicate lines in a .txt file and save result to the new file. Try any one of the following syntax: Conclusion. The sort command is used to order the lines of a text file and uniq filters duplicate adjacent lines from a text file. These commands have many more useful options.

awk solution that works even if the csv files are not in timestamp order and not in order at all!!!

INPUT:

$ more csv_list.input 
May 1 09:00 ./archive/xxx_cs_app_gmas_reject_MDM_20180501090001.csv 0.000       2   ✔
May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055       185 ✈
May 1 12:00 ./archive/xxx_uvw_ABC_20180501120001.csv 0.000       3   ✒
May 2 12:00 ./archive/xxx_123_20180502120001.csv 0.000       3   ✒
May 1 18:45 ./archive/xxx_uvw_ABC_20180501184500.csv 0.055       135 ✕
May 1 19:45 ./archive/xxx_456_20180501194500.csv 0.055       135 ✕

AWK 1-LINER CMD:

awk '{tmp=$4;gsub(/_[0-9]{14}\.csv/,"",$4);a[$4]+=$6;sub(/\.csv$/,"",tmp); tmp=substr(tmp,length(tmp)-13, length(tmp));if(!timestamp[$4] || tmp>timestamp[$4]){timestamp[$4]=tmp;line1[$4]=$1 OFS $2 OFS $3; line2[$4]=$5; line3[$4]=$7};}END{for(i in a){print line1[i] OFS i"_"timestamp[i]".csv" OFS line2[i] OFS a[i] OFS line3[i]}}' csv_list.input 

AWK SCRIPT & EXPLANATIONS:

# gawk profile, created Wed May  2 15:00:50 2018

# Rule(s)

{
        tmp = $4 
        gsub(/_[0-9]{14}\.csv/, "", $4) #find the filename without timestamp
        a[$4] += $6 #sum the 6th column value, key=filename without timestamp
        sub(/\.csv$/, "", tmp) #remove the .csv
        tmp = substr(tmp, length(tmp) - 13, length(tmp)) # get the timestamp of the file
        if (! timestamp[$4] || tmp > timestamp[$4]) { # if the element is empty or if the new timesptamp is bigger than the previous one
                timestamp[$4] = tmp #save the new timestamp
                line1[$4] = $1 OFS $2 OFS $3 #save the 3 first columns of the latest file
                line2[$4] = $5 # save the 5th column
                line3[$4] = $7 # save the 6th column
        }
}

# END rule(s)

END {
        for (i in a) { #recombine the information to generate the ouput
                print line1[i] OFS i "_" timestamp[i] ".csv" OFS line2[i] OFS a[i] OFS line3[i]
        }
}

OUTPUT:

May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055 187 ✈
May 2 12:00 ./archive/xxx_123_20180502120001.csv 0.000 3 ✒
May 1 18:45 ./archive/xxx_uvw_ABC_20180501184500.csv 0.055 138 ✕
May 1 19:45 ./archive/xxx_456_20180501194500.csv 0.055 135 ✕

How to remove duplicate records from a file in , How to remove / delete duplicate records / lines from a file? 1 and hence the inverse is 'False' and hence the pattern does not get printed. 4. cat dupl.sh #!/​bin/bash TEMP="temp"`date '+%d%m%Y%H%M%S'` touch $TEMP $​IGNOREEOF vs ignoreeof · 8 examples to find sum of all columns / numbers  H ow to remove / delete duplicate records / lines from a file? Let us consider a file with the following content. The duplicate record is 'Linux' with 2 entries : $ cat file Unix Linux Solaris AIX Linux 1. Using sort and uniq: $ sort file | uniq AIX Linux Solaris Unix uniq command retains only unique records from a file. In other words, uniq

Here is Perl one-liner

$ cat barcode.txt
May 1 09:00 ./archive/xxx_cs_app_gmas_reject_MDM_20180501090001.csv 0.000       2   ✔
May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055       185 ✈
May 1 19:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501194500.csv 0.055       3 ⤈
May 1 17:45 ./archive/aaa_cs_app_g_reject_MDM_20180502174500.txt 0.055       1 ✈
May 1 17:45 ./archive/aaa_cs_app_g_reject_MDM_20180502184500.txt 0.015       2 ✈

$ perl -lane ' $x=$F[3];$x=~s/(.*)(_.*)/$1/g; $kv{$x}+=$F[5];$kv2{$x}=$_; END {for(keys %kv) { $kv2{$_}=~s/(.*)(\d+)(\s\S+)$/$1$kv{$_}$3/g; print $kv2{$_} } } ' barcode.txt
May 1 17:45 ./archive/aaa_cs_app_g_reject_MDM_20180502184500.txt 0.015       3 ✈
May 1 19:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501194500.csv 0.055       190 ⤈

Bash: find duplicates based on 2 columns and keep duplicate with , hello folks , i am not expert in programming , I want to repeat a row for all its corresponding Remove duplicate lines based on specific columns. I  UNIX - command to split the file into multiple files with all the lines for every 3 unique values in a column 2 Reading values from a file and printing those lines present in another file

Sum a particular column and take the unique record of another , I want to sum the column value of 6 alone, and in 4th, 5th, 9th, and 10th columns, I want to remove the duplicate and print a single value as PL and 2175 . Sum the values in a column except the header. Ask Question How to sum each column and print column name and column sum using awk? UNIX is a registered

Display unique values of a column Using awk, Credits goes to here (posted by \"era\"): http://www.unix.com/shell-programming-​scripting/ Find the unique values of a column utilizing awk. A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table). What I need is to remove all the repetiti

How to find and remove duplicate files using shell script in Linux , Related Searches: How to remove duplicate files in Linux or Unix. check hash value of file, make this variable null local SUM="0" # If file exists and on columns and rows of the input text to find duplicate files using shell script. How to restrict or allow ssh only from certain users, groups or hosts in Linux  Sort and count number of occurrence of lines. Then uniq will remove the repeating lines. In the particular case were the lines you are sorting are numbers,

Comments
  • are you sure the latest file according to timestamp is not the other one? giving the output: May 1 17:45 ./archive/xxx_cs_app_gmas_reject_MDM_20180501174500.csv 0.055 187 ✈ since 20180501174500=May 1 17:45 is more recent than 20180501090001=May 1 09:00, After this clarification I will answer your case.
  • Hi Allan, yes 17:45 is the latest file and It is not the other one.
  • Another 2 questions: does your file have more lines? Are the entries always in the order older file -> most recent file?
  • 1> No, File is having only 7 columns and there is no way to get more columns. 2> yes the entries are always in order by timestamp
  • Thanks for your help ! you are awesome :)
  • Nice answer for ordered csv files ;-) +1!
  • Why are you explicitly stating OFS in the print? Use print x, y, not print x OFS y - that's what the value of OFS is for,