Spider a Website and Return URLs Only
Posted
by Rob Wilkerson
on Stack Overflow
See other posts from Stack Overflow
or by Rob Wilkerson
Published on 2010-05-10T16:37:18Z
Indexed on
2010/05/10
16:54 UTC
Read the original article
Hit count: 290
I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider
option, but when piping that output through a grep
, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
The grep
filter seems to have absolutely no affect on the wget
output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
Thanks.
UPDATE
So I just found out offline that, by default, wget
writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
I'd still be interested in other/better means for doing this kind of thing, if any exist.
© Stack Overflow or respective owner