Intelligent "Subtraction" of one text logfile from another
- by Vi
Example: Application generates large text log file A with many different messages. It generates similarly large log file B when does not function correctly.
I want to see what messages in file B are essentially new, i.e. to filter-out everything from A.
Trivial prototype is:
Sort | uniq both files
Join files
sort | uniq -c
grep -v "^2"
This produces symmetric difference and inconvenient. How to do it better? (including non-symmetric difference and preserving of messages order in B)
Program should first analyse A and learn which messages are common, then analyse B showing with messages needs attention.
Ideally it should automatically disregard things like timestamps, line numbers or other volatile things.
Example. A:
0:00:00.234 Received buffer 0x324234
0:00:00.237 Processeed buffer 0x324234
0:00:00.238 Send buffer 0x324255
0:00:03.334 Received buffer 0x324255
0:00:03.337 Processeed buffer 0x324255
0:00:03.339 Send buffer 0x324255
0:00:05.171 Received buffer 0x32421A
0:00:05.173 Processeed buffer 0x32421A
0:00:05.178 Send buffer 0x32421A
B:
0:00:00.134 Received buffer 0x324111
0:00:00.137 Processeed buffer 0x324111
0:00:00.138 Send buffer 0x324111
0:00:03.334 Received buffer 0x324222
0:00:03.337 Processeed buffer 0x324222
0:00:03.338 Error processing buffer 0x324222
0:00:03.339 Send buffer 0x3242222
0:00:05.271 Received buffer 0x3242FA
0:00:05.273 Processeed buffer 0x3242FA
0:00:05.278 Send buffer 0x3242FA
0:00:07.280 Send buffer 0x3242FA failed
Result:
0:00:03.338 Error processing buffer 0x324222
0:00:07.280 Send buffer 0x3242FA failed
One of ways of solving it can be something like that:
Split each line to logical units: 0:00:00.134 Received buffer 0x324111,0:00:00.134,Received,buffer,0x324111,324111,Received buffer, \d:\d\d:\d\d\.\d\d\d, \d+:\d+:\d+.\d+, 0x[0-9A-F]{6}, ... It should find individual words, simple patterns in numbers, common layouts (e.g. "some date than text than number than text than end_of_line"), also handle combinations of above. As it is not easy task, user assistance (adding regexes with explicit "disregard that","make the main factor","don't split to parts","consider as date/number","take care of order/quantity of such messages" rules) should be supported (but not required) for it.
Find recurring units and "categorize" lines, filter out too volatile things like timestamps, addresses or line numbers.
Analyse the second file, find things that has new logical units (one-time or recurring), or anything that will "amaze" the system which has got used to the first file.
Example of doing some bit of this manually:
$ cat A | head -n 1
0:00:00.234 Received buffer 0x324234
$ cat A | egrep -v "Received buffer" | head -n 1
0:00:00.237 Processeed buffer 0x324234
$ cat A | egrep -v "Received buffer|Processeed buffer" | head -n 1
0:00:00.238 Send buffer 0x324255
$ cat A | egrep -v "Received buffer|Processeed buffer|Send buffer" | head -n 1
$ cat B | egrep -v "Received buffer|Processeed buffer|Send buffer"
0:00:03.338 Error processing buffer 0x324222
0:00:07.280 Send buffer 0x3242FA failed
This is a boring thing (there are a lot of message types); also I can accidentally include some too broad pattern. Also it can't handle complicated things like interrelation between messages.
I know that it is AI-related. May be there are already developed tools?