screen scraper templates for various websites
- by intuited
I'm looking specifically for a convenient way to locally archive posts from this and other similar sites. I'd like to separate the question itself from the answers, or maybe crop the question and store it, keeping the page title. Obviously I don't need to store the menu or the various other site interface chrome.
The best way to do this would seem to be to associate an XSLT template with a match on the URL and use that template to pull the various relevant informations and format them.
My two-part question:
Is there a tool specifically built for this task? I.E. something that takes a URL and checks it against a map of path-matching expressions to templates, and outputs the result of applying the template to that resource?
xmlto seems to be most of the way there, and could probably just be called from a script that does the pattern-matching, but something already integrated would be more convenient.
Is such a URL_pattern-to-XSLT_template map publicly available somewhere?
Question 2.5:
Is it legal to do this with sites like this one that have public licenses on their content?