Uniq in awk; removing duplicate values in a column using awk

Posted by D W on Stack Overflow See other posts from Stack Overflow or by D W
Published on 2010-06-04T23:18:42Z Indexed on 2010/06/05 10:02 UTC
Read the original article Hit count: 291

Filed under:

bash

|

awk

|

unique

I have a large datafile in the following format below:

ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:

ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,

I tried the following code below but it doesn't seem to remove the duplicate values.

awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3

}' knownGeneFromUCSC.txt

How can I remove the duplicates in column 2 correctly?

© Stack Overflow or respective owner

Related posts about bash

launching a program from bash causes bash to go to new prompt

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
When I run a program from the console, e.g. me@box:~$ firefox I expect the console to log error messages (I think this is std out or std err?) and other items from the program, firefox in this case. But today I notice that bash just opens the program and goes to a new prompt, e.g. me@box:~$… >>> More
How to debug a .bash_profile

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I was updating my .bash_profile, and unfortunetly I made a few updates and now I am getting: env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory -bash: tar: command… >>> More
Every command fails with "command not found" after changing .bash_profile?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I was updating my .bash_profile, and unfortunetly I made a few updates and now I am getting: env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory env: bash: No such file or directory -bash: tar: command… >>> More
Is there any fundamental difference between piping in mac and linux?

as seen on Super User - Search for 'Super User'
ps -e | grep bash sample output from a linux machine: 1128 pts/14 00:00:00 bash 7491 pts/7 00:00:00 bash 12651 pts/14 00:00:00 bash 16145 pts/2 00:00:00 bash sample output from a mac machine: 58352 ttys000 0:00.09 login -pfl username /bin/bash -c exec -la bash /bin/bash 58353 ttys000… >>> More
why is $0 set to -bash?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
First login process name seems to be set to "-bash", but if I subshell then it becomes "bash". for example: root@nowere:~# echo $0 -bash root@nowere:~# bash root@nowere:~# echo $0 bash -bash is causing some scripts to fail, such as . /usr/share/debconf/confmodule exec /usr/share/debconf/frontend… >>> More

Related posts about awk

problem with awk script

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, when I call my awk script, I keep getting an error : sam@sam-laptop:~/shell/td4$ awk -f agenda.awk -- -n Robert agenda.txt awk: agenda.awk:6: printf "Hello" awk: agenda.awk:6: ^ syntax error the script contains this : #!/usr/bin/awk BEGIN { } printf "Hello" END { } Thank you >>> More
Awk to grab colo(u)r codes from CSS files aka School me in Awk

as seen on Stack Overflow - Search for 'Stack Overflow'
Nice and (hopefully) easy. I am trying to work out how to grab the variable #XXX from a text file (css file) containing strings like hr { margin: 18px 0 17px; border-color: #ccc; } h1 a:hover, h2 a:hover, h3 a:hover { color: #001100; } Which I would like to return as ccc 777 The… >>> More
Parsing the output of "uptime" with bash

as seen on Super User - Search for 'Super User'
I would like to save the output of the uptime command into a csv file in a Bash script. Since the uptime command has different output formats based on the time since the last reboot I came up with a pretty heavy solution based on case, but there is surely a more elegant way of doing this. uptime… >>> More
AWK scripting :How to remove Field separator using awk

as seen on Stack Overflow - Search for 'Stack Overflow'
Need the following output ONGC044 ONGC043 ONGC042 ONGC041 ONGC046 ONGC047 from this input Medium Label Medium ID Free Blocks =============================================================================== [ONGC044] ECCPRDDB_FS_43 ac100076:4aed9b39:44f0:0001… >>> More
sort associative array in awk - help?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I have an associative array in awk that gets populated like this... chr_count[$3]++ When I try to print my chr_counts I use this: for (i in chr_count) { print i,":",chr_count[i]; } But not surprisingly, the order of i is not sorted in any way. Is there an easy way to iterate over… >>> More