categorize a set of phrases into a set of similar phrases
Posted
by
Dingo
on Stack Overflow
See other posts from Stack Overflow
or by Dingo
Published on 2010-12-26T09:32:53Z
Indexed on
2010/12/26
9:54 UTC
Read the original article
Hit count: 349
I have a few apps that generate textual tracing information (logs) to log files. The tracing information is the typical printf() style - i.e. there are a lot of log entries that are similar (same format argument to printf), but differ where the format string had parameters.
What would be an algorithm (url, books, articles, ...) that will allow me to analyze the log entries and categorize them into several bins/containers, where each bin has one associated format?
Essentially, what I would like is to transform the raw log entries into (formatA, arg0 ... argN) instances, where formatA is shared among many log entries. The formatA does not have to be the exact format used to generate the entry (even more so if this makes the algo simpler).
Most of the literature and web-info I found deals with exact matching, a max substring matching, or a k-difference (with k known/fixed ahead of time). Also, it focuses on matching a pair of (long) strings, or a single bin output (one match among all input). My case is somewhat different, since I have to discover what represents a (good-enough) match (generally a sequence of discontinuous strings), and then categorize each input entries to one of the discovered matches.
Lastly, I'm not looking for a perfect algorithm, but something simple/easy to maintain.
Thanks!
© Stack Overflow or respective owner