Interpreting Search Results
- by Simon
Hi all,
I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon