Sorry if the title is not very intelligible, I couldn't come up with anything better. Hopefully my explanation is clear enough:
I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
That is, wherever the first file contains UnifiedClassLoader3@19518cc, the second contains UnifiedClassLoader3@d0357a, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]
I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3@19518cc in file1 and UnifiedClassLoader3@d0357a in file2 with UnifiedClassLoader3@1; all occurrences of both JBossRMIClassLoader@13c2d7f in file1 and JBossRMIClassLoader@191777e in file2 with JBossRMIClassLoader@2 etc.
Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with
grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq
However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.
I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1, concatenate the results and sort them again. The result would be something like
2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...
Then I could (either manually or with sed itself) massage this into a sed command like
sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g'
-e 's/ClassLoader@13c2d7f/ClassLoader@137/g'
-e 's/ClassLoader3@1267649/ClassLoader3@563/g'
file1.log > file1_processed.log
and similarly for file2.
However, before I start, I would like to verify that my plan is the simplest possible working solution to this.
Is there any flaw in this approach? Is there a simpler way?