What algorithms can I use to detect if articles or posts are duplicates?
- by michael
I'm trying to detect if an article or forum post is a duplicate entry within the database. I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):
simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own
Prepping Text For Analysis
Basically any anomalies; the goal is to make the text as "pure" as possible.
For more accurate results, the text is "standardized" by:
Stripping duplicate white spaces and trimming leading and trailing.
Newlines are standardized to \n.
HTML tags are removed.
Using a RegEx called Daring Fireball URLs are stripped.
I use BB code in my application so that goes to.
(ä)ccented and foreign (besides Enlgish) are converted to their non foreign form.
I store information about each article in (1) statistics table and in (2) keywords table.
(1) Statistics Table
The following statistics are stored about the textual content (much like this post)
text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score
For European languages Coleman-Liau and Automated Readability Index should be used as they do not use syllable counting, so should produce a reasonably accurate score.
(2) Keywords Table
The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.
Sample Data
text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police
It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.
How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?
I'm aware anything I'll design will not be perfect, the biggest risk being (1) Content that is not a duplicate will be flagged as duplicate (2) The system allows the duplicate content through.
So the algorithm should generate a risk assessment number from 0 being no duplicate risk 5 being possible duplicate and 10 being duplicate. Anything above 5 then there's a good possibility that the content is duplicate. In this case the content could be flagged and linked to the article's that are possible duplicates and a human could decide whether to delete or allow.
As I said before I'm storing keywords for the whole article, however I wonder if I could do the same on paragraph basis; this would also mean further separating my data in the DB but it would also make it easier for detecting (2) in my initial post.
I'm thinking weighted average between the statistics, but in what order and what would be the consequences...