I have a large number of large semicolon-separated data files. All string fields are surrounded by double quotes. In some of the files, there are extra quotes in the string fields, which messes up the subsequent importing of the data for analysis (I'm importing to Stata).
This code allows me to see the problematic quotes using gnu-awk:
echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {for ( i=1 ; i<=NF ; i++ ) if ($i ~ /^"(.*".*)+"$/) {print NR, $i}}'
1 "line" of" data"
1 ""with"
1 "extra quotes""
but I do not know how to replace them.
I was thinking of doing the replace manually, but it turns out that there are several hundred matches in some of the files. I know about awk's -sub-, -gsub-, and -match- functions, but I am not sure how to design a search and replace for this specific problem.
In the example above, the respective fields should be "This", "is", 1, "line of data", "with", "extra quotes", that is: all semicolons are separators, and all quotes except for the outermost quotes should be removed.
Should I may be use -sed-, or is -awk- the right tool? Hope you can help me out!
Thanks,
Matthijs