Website crawler/spider to get site map
- by ack__
I need to retrieve a whole website map, in a format like :
http://example.org/
http://example.org/product/
http://example.org/service/
http://example.org/about/
http://example.org/product/viewproduct/
I need it to be linked-based (no file or dir brute-force), like :
parse homepage - retrieve all links - explore them - retrieve links, ...
And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :
http://example.org/product/viewproduct?id=1
http://example.org/product/viewproduct?id=2
http://example.org/product/viewproduct?id=3
I need to get only once the http://example.org/product/viewproduct
I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.
The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.
Thanks