WGET to get them all from the web!

You might have come across the famous tech joke where an Internet newbie  would ask: “I want to download the Internet. Do I need a bigger hard disk?” LOL!
Well there might not be any software or command currently available to download the entire Internet, but we do have a true awesome command in Linux that can download an entire website!!! Surprised? Then read further.

WGET is a funtastic command available in all Linux distros and it could be customised(parameterised) to download an entire website,or a part of it, as well files from the Internet. Simply put, Wget is to Linux what IDM is to Windows(I have feel that I am overstating the power of IDM). Let me show you how wget could do wonders!

Lets hit the bulls eye first.

$ wget    https://getch.wordpress.com

The above command downloads the homepage of getch.wordpress.com(index.html). It is saved in the current working directory.

So now you want to download, not only the homepage but also the entire site. In other words, you may want to recursively download all the content linked in  my blogs homepage. So we will supply the recursive parameter(r) to the wget command.

$ wget   -r  -p  https://getch.wordpress.com

This means that  you also get all pages (and images and other data) linked on the front page.The parameter p tells wget to include all files, including the images. This would make the downloaded files look as they would be online.

Some sites would try to block wget requests reasoning that  it doesn’t originate from a browser. So we will disguise the wget accesses to websites to make it appear as though it originates from a browser like Firefox. This is how you should do it:

$ wget  -r   -p   -U   Mozilla    https://getch.wordpress.com

-U Mozilla does the trick here.

In order that you may not be blacklisted for running wget over a site, pause for 20 seconds in between retrievals and set the download speed accordingly.

wget – –wait=30 – –limit-rate=50K  -r   -p   -U   Mozilla https://getch.wordpress.com

Here wget waits for 30 seconds between retrievals and the download rate is confined to  50KB/s.
What if you want to pause in between downloads. Yes there is a solution for that too. Use parameter c.

$ wget  -c   http://ubunturelease.hnsdc.com/maverick/ubuntu-10.10-desktop-i386.iso

Here we are trying to download Ubuntu Linux which sizes to 700MB, So if you ever had to interrupt the download, running the above command again will let you resume from where you stopped(paused) the download.

If you want to get things done under the hood use -b.This parameter performs the download in the background so you can just take care of the other tasks.

I am lazy to enter each url every time I need to download one. So I would just enter all those URLs once in a text file and provide it as an input to the wget command so that I could just sit back and have a cup of coffee.  Now I need not download it each time, “i” would do it.

And to the final tip!
Mirror an entire website for offline reading.The format is:
$ wget – –mirror   -p – –convert-links   -P  ./LOCAL-DIR WEBSITE-URL
and to quote an example
$ wget – –mirror   -p – –convert-links    -P  /home/manojkumar  https://getch.wordpress.com

– – mirror : turn on options suitable for mirroring.

-p : download all files that are necessary to properly display a given HTML page.

– -convert-links : after the download, convert the links in document for local viewing.

-P ./LOCAL-DIR : save all the files and directories to the specified directory.

Windows has a costly alternative to wget, that could do a  part of what wget does.Teleport Software for Windows lets you download the entire site for offline browsing.For more details  and purchasing options visit: http://www.tenmax.com/teleport/