Exclude list of specific files in wget
Posted
by
nanker
on Super User
See other posts from Super User
or by nanker
Published on 2012-10-13T08:24:03Z
Indexed on
2012/10/13
9:41 UTC
Read the original article
Hit count: 318
wget
I am trying to download a lot of pages from a website on dial-up and it can be brutally slow. I have almost got the perfect wget
command, but because I'm downloading pages from the same site wget
wastes times downloading the same standard images for each page.
If I know the name of the default page images, is there any way to have wget
ignore and thus avoid downloading those for each and every page?
Here is an example of one of the wget commands that my shell script generates into another shell script to download all of the pages:
mkdir candy-canes-on-the-flannel-board-in-preschool
cd candy-canes-on-the-flannel-board-in-preschool
wget -p -nd -A jpg,html -k http://www.teachpreschool.org/2011/12/candy-canes-on-the-flannel-board-in-preschool/
wget -c --random-wait --timeout=30 --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" http://www.teachpreschool.org/2011/12/candy-canes-on-the-flannel-board-in-preschool/ -O "candy-canes-on-the-flannel-board-in-preschool"
rm Baby-and-Toddler.jpg Childrens-Books.jpg Creative-Art.jpg Felt-Fun.jpg Happy_Rainbow-e1338766526528.jpg index.html Language-and-Literacy.jpg Light-table-Button.jpg Math.jpg Outdoor-Play.jpg outer-jacket1-300x153.jpg preschoolspot-button-small.jpg robots.txt Science-and-Nature.jpg Signature-2.jpg Story-Telling.jpg Tags-on-Preschool.jpg Teaching-Two-and-Three-Year-olds.jpg
cd ../
Now I realize the script is not likely as savvy as it could be but it is doing what I need at the moment except that you can see from the rm
command that I would just like to prevent wget
from downloading the files in the first place if possible.
I almost forgot to mention, there are two wget
commands and that is because the first one downloads the page as index.html
and for some reason it does not open in my browser, however, when I open it and look at it in vim
all of the page's content is there, so I am not sure why it does not open. But if I just issue the second wget
command as it is then that page, same file really with an alternate name, opens up fine. Something that if I could fix would also help to streamline the process.
© Super User or respective owner