WGET to get them all from the web!

You might have come across the famous tech joke where an Internet newbie  would ask: “I want to download the Internet. Do I need a bigger hard disk?” LOL!
Well there might not be any software or command currently available to download the entire Internet, but we do have a true awesome command in Linux that can download an entire website!!! Surprised? Then read further.

WGET is a funtastic command available in all Linux distros and it could be customised(parameterised) to download an entire website,or a part of it, as well files from the Internet. Simply put, Wget is to Linux what IDM is to Windows(I have feel that I am overstating the power of IDM). Let me show you how wget could do wonders!

Lets hit the bulls eye first.

$ wget    https://getch.wordpress.com

The above command downloads the homepage of getch.wordpress.com(index.html). It is saved in the current working directory.

So now you want to download, not only the homepage but also the entire site. In other words, you may want to recursively download all the content linked in  my blogs homepage. So we will supply the recursive parameter(r) to the wget command.

$ wget   -r  -p  https://getch.wordpress.com

This means that  you also get all pages (and images and other data) linked on the front page.The parameter p tells wget to include all files, including the images. This would make the downloaded files look as they would be online.

Some sites would try to block wget requests reasoning that  it doesn’t originate from a browser. So we will disguise the wget accesses to websites to make it appear as though it originates from a browser like Firefox. This is how you should do it:

$ wget  -r   -p   -U   Mozilla    https://getch.wordpress.com

-U Mozilla does the trick here.

In order that you may not be blacklisted for running wget over a site, pause for 20 seconds in between retrievals and set the download speed accordingly.

wget – –wait=30 – –limit-rate=50K  -r   -p   -U   Mozilla https://getch.wordpress.com

Here wget waits for 30 seconds between retrievals and the download rate is confined to  50KB/s.
What if you want to pause in between downloads. Yes there is a solution for that too. Use parameter c.

$ wget  -c   http://ubunturelease.hnsdc.com/maverick/ubuntu-10.10-desktop-i386.iso

Here we are trying to download Ubuntu Linux which sizes to 700MB, So if you ever had to interrupt the download, running the above command again will let you resume from where you stopped(paused) the download.

If you want to get things done under the hood use -b.This parameter performs the download in the background so you can just take care of the other tasks.

I am lazy to enter each url every time I need to download one. So I would just enter all those URLs once in a text file and provide it as an input to the wget command so that I could just sit back and have a cup of coffee.  Now I need not download it each time, “i” would do it.

And to the final tip!
Mirror an entire website for offline reading.The format is:
$ wget – –mirror   -p – –convert-links   -P  ./LOCAL-DIR WEBSITE-URL
and to quote an example
$ wget – –mirror   -p – –convert-links    -P  /home/manojkumar  https://getch.wordpress.com

– – mirror : turn on options suitable for mirroring.

-p : download all files that are necessary to properly display a given HTML page.

– -convert-links : after the download, convert the links in document for local viewing.

-P ./LOCAL-DIR : save all the files and directories to the specified directory.

Windows has a costly alternative to wget, that could do a  part of what wget does.Teleport Software for Windows lets you download the entire site for offline browsing.For more details  and purchasing options visit: http://www.tenmax.com/teleport/

Advertisements

16 comments on “WGET to get them all from the web!

  1. That is awesome and a sweet way to back up blog content. I wish I’d known about this before… I’ve started and deleted too many blogs in my time with no real method of backing it up.

  2. Hi @awkisopen
    I wonder why you deleted your sites. There is an export option in wordpress admin panel (tools–>export). you should have used it to backup your site.

  3. Actually wait. I think I’m right.

    If I visit your Posterous and right-click to view an image, it shows me the image comes from WordPress, not Posterous. I think it preserves the link to the image (which is just text after all) but not the image itself. So while it works for a currently active blog, it wouldn’t save the images off a deleted one, like wget would.

  4. Nice Post . You have made it a point to cover up all the significant features of wget.

    Period
    You told me about your difficulty in downloading .php files from the Web. I think the ‘-F’ option should solve the problem. I browsed the wget documentation now and found this option . I have pasted the documentation of ‘-F’ below .
    =======================================
    `-F’
    `–force-html’
    When input is read from a file, force it to be treated as an HTML
    file. This enables you to retrieve relative links from existing
    HTML files on your local disk, by adding `’ to
    HTML, or using the `–base’ command-line option.
    ========================================

    Hope that helps.

    Period .

    I just want to you to consider my ( million others’s ) perspective here – Linux is a kernel and is just a part of the operating system ( ~ 3% ) and other programs are system and application programs. Most of system and application programs belong to the GNU project ( http://www.gnu.org ) . So it is appropriate to call the Operating System as “GNU+Linux” and not just “Linux”. I urge you to read this link ( the article is written by Richard Stallman ) :
    http://www.gnu.org/gnu/linux-and-gnu.html

    I hope you start calling the System as “GNU+Linux” or “GNU/Linux” instead of just “Linux”. Thank you very much.

  5. Howdy pain ,

    Try if my previous comment works. I have not tried it myself but I across this while I was reading through the wget manual.

  6. I too am having trouble with it only downloading the index.html and nothing else. I tried the -F option Siddharth suggested and that didn’t work either. I have tried both from my local machine (os x) and a remote server running linux via ssh.

  7. Hello Jozz,

    Sorry for being so late with my response. I totally forgot the context under which I described the “-F” option in my previous comment. I am also quiet out of touch with wget. Therefore, I may not be of help to you.

    I suggest that you subscribe to the wget mailing list and tell them about your difficulties in downloading the index.html file.
    You may subscribe to the wget mailing list by going here :
    MailingList

    The official documentation is also a good destination to find answers given you have enough time and patience : GNU wget Manual.

    and here is the link to wget’s main page :
    GNU wget.

    Hope that helps.
    R.Siddharth.

  8. Hello R

    Thanks for your response. I was trying to archive an old website for a client off some weird CMS. Turns out that the way the CMS worked is that it generated a index.htm file with a meta refresh tag that redirected to the “home” page. Have never come across that before! Wget wasn’t picking up on this (fair enough). Think I ended up using some other tool to download the site, though I’m sure it would have worked if I pointed wget at the home page.

    Cheers,
    Jozz

  9. so what happens if I want to download all my albums from Facebook then? I’ve tried wget -m -r -A.jpg –no-check-certificate https://fbcdn-sphotos… blablabla.jpg BUT it only gets one picture of one album… any idea on how I can specify the desired album/s? 😀

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s