Attempting to extract a pattern within a string
Posted
by Brian
on Stack Overflow
See other posts from Stack Overflow
or by Brian
Published on 2010-06-05T16:52:55Z
Indexed on
2010/06/05
17:22 UTC
Read the original article
Hit count: 626
I'm attempting to extract a given pattern within a text file, however, the results are not 100% what I want.
Here's my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ParseText1 {
public static void main(String[] args) {
String content = "<p>Yada yada yada <code> foo ddd</code>yada yada ...\n"
+ "more here <2004-08-24> bar<Bob Joe> etc etc\n"
+ "more here again <2004-09-24> bar<Bob Joe> <Fred Kej> etc etc\n"
+ "more here again <2004-08-24> bar<Bob Joe><Fred Kej> etc etc\n"
+ "and still more <2004-08-21><2004-08-21> baz <John Doe> and now <code>the end</code> </p>\n";
Pattern p = Pattern
.compile("<[1234567890]{4}-[1234567890]{2}-[1234567890]{2}>.*?<[^%0-9/]*>",
Pattern.MULTILINE);
Matcher m = p.matcher(content);
// print all the matches that we find
while (m.find()) {
System.out.println(m.group());
}
}
}
The output I'm getting is:
<2004-08-24> bar<Bob Joe>
<2004-09-24> bar<Bob Joe> <Fred Kej>
<2004-08-24> bar<Bob Joe><Fred Kej>
<2004-08-21><2004-08-21> baz <John Doe> and now <code>
The output I want is:
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-24> bar<Bob Joe>
<2004-08-21> baz <John Doe>
In short, the sequence of "date", "text (or blank)", and "name" must be extracted. Everything else should be avoided. For example the tag "Fred Kej" did not have any "date" tag before it, therefore, it should be flagged as invalid.
Also, as a side question, is there a way to store or track the text snippets that were skipped/rejected as were the valid texts.
Thanks, Brian
© Stack Overflow or respective owner