Extracting pure content / text from HTML Pages by excluding navigation and chrome content
- by Ankur Gupta
Hi,
I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff…