categorize a set of phrases into a set of similar phrases

Posted by Dingo on Stack Overflow See other posts from Stack Overflow or by Dingo
Published on 2010-12-26T09:32:53Z Indexed on 2010/12/26 9:54 UTC
Read the original article Hit count: 345

I have a few apps that generate textual tracing information (logs) to log files. The tracing information is the typical printf() style - i.e. there are a lot of log entries that are similar (same format argument to printf), but differ where the format string had parameters.

What would be an algorithm (url, books, articles, ...) that will allow me to analyze the log entries and categorize them into several bins/containers, where each bin has one associated format?
Essentially, what I would like is to transform the raw log entries into (formatA, arg0 ... argN) instances, where formatA is shared among many log entries. The formatA does not have to be the exact format used to generate the entry (even more so if this makes the algo simpler).

Most of the literature and web-info I found deals with exact matching, a max substring matching, or a k-difference (with k known/fixed ahead of time). Also, it focuses on matching a pair of (long) strings, or a single bin output (one match among all input). My case is somewhat different, since I have to discover what represents a (good-enough) match (generally a sequence of discontinuous strings), and then categorize each input entries to one of the discovered matches.

Lastly, I'm not looking for a perfect algorithm, but something simple/easy to maintain.

Thanks!

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about string-matching