Dear Experts,
I am trying to identify the type of the web site(In English) by machine. I try to download the homepage of the web iste, download html page, parsing and get the content of the web page. Such as here are some context from CNN.com. I try to get the keywords of the web page, mapping with my database. If the keywords include like news, breaking news. The web site will go to the news web sites. If there exist some words like healthy, medical, it will be the medical web site.
There exist some tools can do the text segmentation, but it is not easy to find a tool do the semantic, such as online shopping, it is a keywords, should not spilt two words. The combination will be helpful information. But "oneline", "shopping" will be less useful as it may exist online travel...
• Newark, JFK airports reopen
• 1 runway reopens at LaGuardia Airport
• Over 4,155 flights were cancelled Monday
• FULL STORY
* LaGuardia Airport snowplows busy Video
* Are you stranded? | Airport delays
* Safety tips for winter weather
* Frosty fun Video | Small dog, deep snow
Latest news
* Easter eggs used to smuggle cocaine
* Salmonella forces cilantro, parsley recall
* Obama's surprising verdict on Vick
* Blue Note baritone Bernie Wilson dead
* Busch aide to 911: She's not waking up
* Girl, 15, last seen working at store in '90
* Teena Marie's death shocks fans
* Terror network 'dismantled' in Morocco
* Saudis: 'Militant' had al Qaeda ties
* Ticker: Gov. blasts Obama 'birthers'
* Game show goof is 800K mistakeVideo
* Chopper saves calf on frozen pondVideo
* Pickpocketing becomes hands-freeVideo
* Chilean miners going to Disney World
* Who's the most intriguing of 2010?
* Natalie Portman is pregnant, engaged
* 'Convert all gifts from aunt' CNNMoney
* Who controls the thermostat at home?
* This Just In: CNN's news blog