wget crawling search results of news website
Posted
by
kiltek
on Super User
See other posts from Super User
or by kiltek
Published on 2013-11-02T23:27:13Z
Indexed on
2013/11/03
3:59 UTC
Read the original article
Hit count: 295
I am trying to crawl the search results of a news website using wget.
The name of the website is www.voanews.com.
After typing in my search keyword and clicking search, it proceeds to the results. Then i can specify a "to" and a "from"-date and hit search again.
After this the URL becomes:
http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article
and the actual content of the results is what i want to download.
To achieve this I created the following wget-command:
wget --reject=js,txt,gif,jpeg,jpg \
--accept=html \
--user-agent=My-Browser \
--recursive --level=2 \
www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article
Unfortunately, the crawler doesn't download the search results. It only gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..." links and saves the articles they link to.
It seems like he crawler doesn't check the search result links at all.
What am I doing wrong and how can I modify the wget command to download the results search list links (and of course the sites they link to) only ?
© Super User or respective owner