Download/update webpages listed in XML sitemap

Posted by unor on Super User See other posts from Super User or by unor
Published on 2012-10-13T15:04:17Z Indexed on 2012/10/13 15:42 UTC
Read the original article Hit count: 336

I'm searching a FLOSS tool that downloads all pages (and embedded resources, e.g. images) linked in a XML sitemap (built according to http://www.sitemaps.org/).

The tool should "crawl" the sitemap regularly and look for new and deleted URLs and changes in the lastmod element. So whenever a page gets added/deleted/updated, the tool should apply the changes.

Some sitemaps list sub-sitemaps in sitemapindex?sitemap. The tool should understand this and load all linked sub-sitemaps and look for URLs in there.


I know there are tools that allow me to extract all URLs from the sitemap, so that I could feed them to wget or similar tools (see for example: Extract Links from a sitemap(xml)). But this wouldn't help in getting noticed about updates to pages. Tracking the webpages itself for updates doesn't work, because "secondary" content on the pages changes daily, but lastmod gets only updated when relevant content changed.

© Super User or respective owner

Related posts about software-rec

Related posts about download