Download HTML and Images with WGet without first few lines
- by St. John Johnson
I'm attempting to use wget with the -p option to download specific documents and the images linked in the HTML.
The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn't search for images.
Is there a way to have wget strip the first X lines and/or force searching for images?
Example URL:
http://www.sec.gov/Archives/edgar/data/13239/000119312510070346/ds4.htm
First Lines of Content:
<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>
Last Lines of Content:
</BODY></HTML>
</TEXT>
</DOCUMENT>