How to determine a text block of a file in one version come from which file in the previous version?

Posted by Muhammad Asaduzzaman on Stack Overflow See other posts from Stack Overflow or by Muhammad Asaduzzaman
Published on 2010-03-20T20:43:47Z Indexed on 2010/03/20 21:31 UTC
Read the original article Hit count: 234

Filed under:

The problem is described below: Suppose I have a list of files in one version(say A,B,C,D). In the next version I have the following files(A,E,F,G). There are some similarities in their contents. The files in the later version comes from the previous version by file name renaming, content addition, deletion or partial modification or without any change( for example A is not changed).

I take a block of text from a file(E, 2nd version) and check which files(in the 1st version) contain this text block. I found that B,C and D contain the text fragment. I want to determine from which file(B or c or d) this text block actually comes from.(I assume that E is a file whose name change in the second version).

Since the contents may be changed, added or deleted in the later version, so in order to determine similarity I use LCS algorithm. But I cannot map the file with its previous version. I think one possible approach might be to use the location information of the match text blocks. But this heuristics not always work. Is there any research or algorithm exist to find so. Any direction will be helpful. Thanks in advance.

© Stack Overflow or respective owner

Related posts about similarity