input URL, output contents of "view page source", i.e. after javascript / etc, library or command-li

Posted by Ryan Berckmans on Stack Overflow See other posts from Stack Overflow or by Ryan Berckmans
Published on 2010-05-26T13:45:56Z Indexed on 2010/05/26 13:51 UTC
Read the original article Hit count: 201

Filed under:
|
|

I need a scalable, automated, method of dumping the contents of "view page source" (DOM) to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.

My ideal solution looks like any of the following (fantasy solutions):

cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source  
(fantasy command line, no idea if flags like these exist)

or

cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"    

As a secondary concern, I also need:

  • dump all included javascript source to file (a la firebug)
  • dump pdf/image of page to file (print to file)

© Stack Overflow or respective owner

Related posts about JavaScript

Related posts about web