Download HTML and Images with WGet without first few lines
Posted
by St. John Johnson
on Stack Overflow
See other posts from Stack Overflow
or by St. John Johnson
Published on 2010-03-31T15:30:58Z
Indexed on
2010/03/31
15:33 UTC
Read the original article
Hit count: 588
I'm attempting to use wget
with the -p option to download specific documents and the images linked in the HTML.
The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget
to not interpret the document as HTML and doesn't search for images.
Is there a way to have wget
strip the first X lines and/or force searching for images?
Example URL:
First Lines of Content:
<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>
Last Lines of Content:
</BODY></HTML>
</TEXT>
</DOCUMENT>
© Stack Overflow or respective owner