How to replace pairs of strings in two files to identical IDs?
Posted
by Péter Török
on Stack Overflow
See other posts from Stack Overflow
or by Péter Török
Published on 2010-04-20T10:49:13Z
Indexed on
2010/04/20
12:03 UTC
Read the original article
Hit count: 295
Sorry if the title is not very intelligible, I couldn't come up with anything better. Hopefully my explanation is clear enough:
I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
That is, wherever the first file contains UnifiedClassLoader3@19518cc
, the second contains UnifiedClassLoader3@d0357a
, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]
I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3@19518cc
in file1 and UnifiedClassLoader3@d0357a
in file2 with UnifiedClassLoader3@1
; all occurrences of both JBossRMIClassLoader@13c2d7f
in file1 and JBossRMIClassLoader@191777e
in file2 with JBossRMIClassLoader@2
etc.
Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with
grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq
However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n
I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.
I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1
, concatenate the results and sort them again. The result would be something like
2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...
Then I could (either manually or with sed
itself) massage this into a sed
command like
sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g'
-e 's/ClassLoader@13c2d7f/ClassLoader@137/g'
-e 's/ClassLoader3@1267649/ClassLoader3@563/g'
file1.log > file1_processed.log
and similarly for file2.
However, before I start, I would like to verify that my plan is the simplest possible working solution to this.
Is there any flaw in this approach? Is there a simpler way?
© Stack Overflow or respective owner