SED - Regular Expression over multiple lines
Posted
by
herrherr
on Stack Overflow
See other posts from Stack Overflow
or by herrherr
Published on 2010-12-22T15:41:09Z
Indexed on
2010/12/22
15:54 UTC
Read the original article
Hit count: 497
Hi there,
I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
Alan Smithee ist ein Anagramm von „The [...] „Alan Smythee“, und „Adam Smithee“." ,Alan Smithee Die Aussagenlogik ist der Bereich der Logik, der sich mit [...] ihrer Teilaussagen bestimmen. ,Aussagenlogik
As you can probably see the words ",Alan Smithee" and ",Aussagenlogik" should actually be on the same line as the foregoing sentence. Then it would look something like this:
Alan Smithee ist ein Anagramm von „The Smitheeeee [...] „Alan Smythee“, und „Adam Smithee“.,Alan Smithee Die Aussagenlogik ist der Bereich der Logik, der sich mit [...] ihrer Teilaussagen bestimmen.,Aussagenlogik
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Thanks, Chris
© Stack Overflow or respective owner