input URL, output contents of "view page source", i.e. after javascript / etc, library or command-li
- by Ryan Berckmans
I need a scalable, automated, method of dumping the contents of "view page source" (DOM) to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.
My ideal solution looks like any of the following (fantasy solutions):
cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source
(fantasy command line, no idea if flags like these exist)
or
cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"
As a secondary concern, I also need:
dump all included javascript source to file (a la firebug)
dump pdf/image of page to file (print to file)