Download/update webpages listed in XML sitemap
Posted
by
unor
on Super User
See other posts from Super User
or by unor
Published on 2012-10-13T15:04:17Z
Indexed on
2012/10/13
15:42 UTC
Read the original article
Hit count: 332
I'm searching a FLOSS tool that downloads all pages (and embedded resources, e.g. images) linked in a XML sitemap (built according to http://www.sitemaps.org/).
The tool should "crawl" the sitemap regularly and look for new and deleted URLs and changes in the lastmod
element. So whenever a page gets added/deleted/updated, the tool should apply the changes.
Some sitemaps list sub-sitemaps in sitemapindex
?sitemap
. The tool should understand this and load all linked sub-sitemaps and look for URLs in there.
I know there are tools that allow me to extract all URLs from the sitemap, so that I could feed them to wget or similar tools (see for example: Extract Links from a sitemap(xml)). But this wouldn't help in getting noticed about updates to pages. Tracking the webpages itself for updates doesn't work, because "secondary" content on the pages changes daily, but lastmod
gets only updated when relevant content changed.
© Super User or respective owner