"A" is related to "B" and "C". How do I show that "B" and "C" might, by this context, be related as well?
Example:
Here are a few headlines about a recent Broadway play:
1 - David Mamet's Glengarry Glen Ross, Starring Al Pacino, Opens on Broadway
2 - Al Pacino in 'Glengarry Glen Ross': What did the critics think?
3 - Al Pacino earns lackluster reviews for Broadway turn
4 - Theater Review: Glengarry Glen Ross Is Selling Its Stars Hard
5 - Glengarry Glen Ross; Hey, Who Killed the Klieg Lights?
Problem:
Running a fuzzy-string match over these records will establish some relationships, but not others, even though a human reader could pick them out from context in much larger datasets.
How do I find the relationship that suggests #3 is related to #4? Both of them can be easily connected to #1, but not to each other.
Is there a (Googlable) name for this kind of data or structure? What kind of algorithm am I looking for?
Goal:
Given 1,000 headlines, a system that automatically suggests that these 5 items are all probably about the same thing.
To be honest, it's been so long since I've programmed I'm at a loss how to properly articulate this problem. (I don't know what I don't know, if that makes sense).
This is a personal project and I'm writing it in Python. Thanks in advance for any help, advice, and pointers!