Intelligent "Subtraction" of one text logfile from another

Posted by Vi on Super User See other posts from Super User or by Vi
Published on 2010-07-27T16:33:43Z Indexed on 2010/12/22 16:56 UTC
Read the original article Hit count: 642

Filed under:

text

|

textutils

|

logfiles

|

subtract

|

artificial-intelligence

Example: Application generates large text log file A with many different messages. It generates similarly large log file B when does not function correctly.

I want to see what messages in file B are essentially new, i.e. to filter-out everything from A.

Trivial prototype is:

Sort | uniq both files
Join files
sort | uniq -c
grep -v "^2"

This produces symmetric difference and inconvenient. How to do it better? (including non-symmetric difference and preserving of messages order in B)

Program should first analyse A and learn which messages are common, then analyse B showing with messages needs attention.

Ideally it should automatically disregard things like timestamps, line numbers or other volatile things.

Example. A:

0:00:00.234  Received buffer 0x324234
0:00:00.237     Processeed buffer 0x324234
0:00:00.238     Send buffer 0x324255
0:00:03.334  Received buffer 0x324255
0:00:03.337     Processeed buffer 0x324255
0:00:03.339     Send buffer 0x324255
0:00:05.171  Received buffer 0x32421A
0:00:05.173     Processeed buffer 0x32421A
0:00:05.178     Send buffer 0x32421A

B:

0:00:00.134  Received buffer 0x324111
0:00:00.137     Processeed buffer 0x324111
0:00:00.138     Send buffer 0x324111
0:00:03.334  Received buffer 0x324222
0:00:03.337     Processeed buffer 0x324222
0:00:03.338     Error processing buffer 0x324222 
0:00:03.339     Send buffer 0x3242222
0:00:05.271  Received buffer 0x3242FA
0:00:05.273     Processeed buffer 0x3242FA
0:00:05.278     Send buffer 0x3242FA
0:00:07.280     Send buffer 0x3242FA failed

Result:

0:00:03.338     Error processing buffer 0x324222 
0:00:07.280     Send buffer 0x3242FA failed

One of ways of solving it can be something like that:

Split each line to logical units: 0:00:00.134 Received buffer 0x324111,0:00:00.134,Received,buffer,0x324111,324111,Received buffer, \d:\d\d:\d\d\.\d\d\d, \d+:\d+:\d+.\d+, 0x[0-9A-F]{6}, ... It should find individual words, simple patterns in numbers, common layouts (e.g. "some date than text than number than text than end_of_line"), also handle combinations of above. As it is not easy task, user assistance (adding regexes with explicit "disregard that","make the main factor","don't split to parts","consider as date/number","take care of order/quantity of such messages" rules) should be supported (but not required) for it.
Find recurring units and "categorize" lines, filter out too volatile things like timestamps, addresses or line numbers.
Analyse the second file, find things that has new logical units (one-time or recurring), or anything that will "amaze" the system which has got used to the first file.

Example of doing some bit of this manually:

$ cat A | head -n 1
0:00:00.234  Received buffer 0x324234

$ cat A | egrep -v "Received buffer" | head -n 1
0:00:00.237     Processeed buffer 0x324234

$ cat A | egrep -v "Received buffer|Processeed buffer" | head -n 1
0:00:00.238     Send buffer 0x324255

$ cat A | egrep -v "Received buffer|Processeed buffer|Send buffer" | head -n 1

$ cat B | egrep -v "Received buffer|Processeed buffer|Send buffer"
0:00:03.338     Error processing buffer 0x324222 
0:00:07.280     Send buffer 0x3242FA failed

This is a boring thing (there are a lot of message types); also I can accidentally include some too broad pattern. Also it can't handle complicated things like interrelation between messages.

I know that it is AI-related. May be there are already developed tools?

Developer IT