Intelligent "Subtraction" of one text logfile from another
Posted
by
Vi
on Super User
See other posts from Super User
or by Vi
Published on 2010-07-27T16:33:43Z
Indexed on
2010/12/22
16:56 UTC
Read the original article
Hit count: 546
Example: Application generates large text log file A
with many different messages. It generates similarly large log file B
when does not function correctly.
I want to see what messages in file B
are essentially new, i.e. to filter-out everything from A
.
Trivial prototype is:
- Sort | uniq both files
- Join files
- sort | uniq -c
- grep -v "^2"
This produces symmetric difference and inconvenient. How to do it better? (including non-symmetric difference and preserving of messages order in B
)
Program should first analyse A
and learn which messages are common, then analyse B
showing with messages needs attention.
Ideally it should automatically disregard things like timestamps, line numbers or other volatile things.
Example. A:
0:00:00.234 Received buffer 0x324234
0:00:00.237 Processeed buffer 0x324234
0:00:00.238 Send buffer 0x324255
0:00:03.334 Received buffer 0x324255
0:00:03.337 Processeed buffer 0x324255
0:00:03.339 Send buffer 0x324255
0:00:05.171 Received buffer 0x32421A
0:00:05.173 Processeed buffer 0x32421A
0:00:05.178 Send buffer 0x32421A
B:
0:00:00.134 Received buffer 0x324111
0:00:00.137 Processeed buffer 0x324111
0:00:00.138 Send buffer 0x324111
0:00:03.334 Received buffer 0x324222
0:00:03.337 Processeed buffer 0x324222
0:00:03.338 Error processing buffer 0x324222
0:00:03.339 Send buffer 0x3242222
0:00:05.271 Received buffer 0x3242FA
0:00:05.273 Processeed buffer 0x3242FA
0:00:05.278 Send buffer 0x3242FA
0:00:07.280 Send buffer 0x3242FA failed
Result:
0:00:03.338 Error processing buffer 0x324222
0:00:07.280 Send buffer 0x3242FA failed
One of ways of solving it can be something like that:
- Split each line to logical units:
0:00:00.134 Received buffer 0x324111
,0:00:00.134
,Received
,buffer
,0x324111
,324111
,Received buffer
,\d:\d\d:\d\d\.\d\d\d
,\d+:\d+:\d+.\d+
,0x[0-9A-F]{6}
, ... It should find individual words, simple patterns in numbers, common layouts (e.g. "some date than text than number than text than end_of_line"), also handle combinations of above. As it is not easy task, user assistance (adding regexes with explicit "disregard that","make the main factor","don't split to parts","consider as date/number","take care of order/quantity of such messages" rules) should be supported (but not required) for it. - Find recurring units and "categorize" lines, filter out too volatile things like timestamps, addresses or line numbers.
- Analyse the second file, find things that has new logical units (one-time or recurring), or anything that will "amaze" the system which has got used to the first file.
Example of doing some bit of this manually:
$ cat A | head -n 1
0:00:00.234 Received buffer 0x324234
$ cat A | egrep -v "Received buffer" | head -n 1
0:00:00.237 Processeed buffer 0x324234
$ cat A | egrep -v "Received buffer|Processeed buffer" | head -n 1
0:00:00.238 Send buffer 0x324255
$ cat A | egrep -v "Received buffer|Processeed buffer|Send buffer" | head -n 1
$ cat B | egrep -v "Received buffer|Processeed buffer|Send buffer"
0:00:03.338 Error processing buffer 0x324222
0:00:07.280 Send buffer 0x3242FA failed
This is a boring thing (there are a lot of message types); also I can accidentally include some too broad pattern. Also it can't handle complicated things like interrelation between messages.
I know that it is AI-related. May be there are already developed tools?
© Super User or respective owner