Replacing quotes in a file

Posted by Matthijs on Super User See other posts from Super User or by Matthijs
Published on 2012-10-09T07:10:47Z Indexed on 2012/10/09 21:45 UTC
Read the original article Hit count: 301

Filed under:
|
|

I have a large number of large semicolon-separated data files. All string fields are surrounded by double quotes. In some of the files, there are extra quotes in the string fields, which messes up the subsequent importing of the data for analysis (I'm importing to Stata).

This code allows me to see the problematic quotes using gnu-awk:

echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {for ( i=1 ; i<=NF ; i++ ) if ($i ~ /^"(.*".*)+"$/) {print NR, $i}}'
1 "line" of" data"
1 ""with"
1 "extra quotes""

but I do not know how to replace them.

I was thinking of doing the replace manually, but it turns out that there are several hundred matches in some of the files. I know about awk's -sub-, -gsub-, and -match- functions, but I am not sure how to design a search and replace for this specific problem.

In the example above, the respective fields should be "This", "is", 1, "line of data", "with", "extra quotes", that is: all semicolons are separators, and all quotes except for the outermost quotes should be removed.

Should I may be use -sed-, or is -awk- the right tool? Hope you can help me out!

Thanks,

Matthijs

© Super User or respective owner

Related posts about unix

Related posts about regex