WGet or cURL: Mirror Site from http://site.com And No Internal Access

Posted by alharaka on Server Fault See other posts from Server Fault or by alharaka
Published on 2011-02-10T22:12:30Z Indexed on 2012/04/12 5:33 UTC
Read the original article Hit count: 526

Filed under:
|

I have tried wget -m wget -r and a whole bunch of variations. I am getting some of the images on http://site.com, one of the scripts, and none of the CSS, even with the fscking -p parameter. The only HTML page is index.html and there are several more referenced, so I am at a loss. curlmirror.pl on the cURL developers website does not seem to get the job done either. Is there something I am missing? I have tried different levels of recursion with only this URL, but I get the feeling I am missing something. Long story short, some school allows its students to submit web projects, but they want to know how they can collect everything for the instructor who will grade it, instead of him going to all the externally hsoted sites.

UPDATE: I think I figured out the issue. I though the links to the other pages were in the index.html page that downloaded. I was way off. Turns out the footer of the page, which has all the navigation links, is handled by a JavaScript file Include.js, which reads JLSSiteMap.js and some other JS files to do page navigation and the like. As a result, wget does not pick up an other dependencies because a lot of this crap is handled not on web pages. How can I handle such a website? This is one of several problem cases. I assume little can be done if wget cannot parse JavaScript.

© Server Fault or respective owner

Related posts about wget

Related posts about curl