wget crawling search results of news website
- by kiltek
I am trying to crawl the search results of a news website using wget.
The name of the website is www.voanews.com.
After typing in my search keyword and clicking search, it proceeds to the results. Then i can specify a "to" and a "from"-date and hit search again.
After this the URL becomes:
http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article
and the actual content of the results is what i want to download.
To achieve this I created the following wget-command:
wget --reject=js,txt,gif,jpeg,jpg \
--accept=html \
--user-agent=My-Browser \
--recursive --level=2 \
www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article
Unfortunately, the crawler doesn't download the search results. It only gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..." links and saves the articles they link to.
It seems like he crawler doesn't check the search result links at all.
What am I doing wrong and how can I modify the wget command to download the results search list links (and of course the sites they link to) only ?