Download/update webpages listed in XML sitemap
- by unor
I'm searching a FLOSS tool that downloads all pages (and embedded resources, e.g. images) linked in a XML sitemap (built according to http://www.sitemaps.org/).
The tool should "crawl" the sitemap regularly and look for new and deleted URLs and changes in the lastmod element. So whenever a page gets added/deleted/updated, the tool should apply the changes.
Some sitemaps list sub-sitemaps in sitemapindex?sitemap. The tool should understand this and load all linked sub-sitemaps and look for URLs in there.
I know there are tools that allow me to extract all URLs from the sitemap, so that I could feed them to wget or similar tools (see for example: Extract Links from a sitemap(xml)). But this wouldn't help in getting noticed about updates to pages. Tracking the webpages itself for updates doesn't work, because "secondary" content on the pages changes daily, but lastmod gets only updated when relevant content changed.