SED - Regular Expression over multiple lines
- by herrherr
Hi there,
I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
Alan Smithee ist ein Anagramm von „The
[...]
„Alan Smythee“, und „Adam Smithee“."
,Alan Smithee
Die Aussagenlogik ist
der Bereich der Logik, der sich mit
[...]
ihrer Teilaussagen bestimmen.
,Aussagenlogik
As you can probably see the words ",Alan Smithee" and ",Aussagenlogik" should actually be on the same line as the foregoing sentence. Then it would look something like this:
Alan Smithee ist ein Anagramm von „The Smitheeeee
[...]
„Alan Smythee“, und „Adam Smithee“.,Alan Smithee
Die Aussagenlogik ist
der Bereich der Logik, der sich mit
[...]
ihrer Teilaussagen bestimmen.,Aussagenlogik
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Thanks,
Chris