wget and curl – Technote

wget

simplest syntax:

wget url

The result is a single index.html file. On its own this file is fairly useless as the content is still pulled from Google and the images and stylesheets are still all held on Google.

To download the full site and all the pages you can use the following command:

wget -r url

This downloads the pages recursively up to a maximum of 5 levels deep.

5 levels deep might not be enough to get everything from the site. You can use the -l switch to set the number of levels you wish to go to as follows:

wget -r -l10 url

If you want infinite recursion you can use the following:

wget -r -l inf url

You can also replace the inf with 0 which means the same thing.

There is still one more problem. You might get all the pages locally but all the links in the pages still point to their original place. It is therefore not possible to click locally between the links on the pages.

You can get around this problem by using the -k switch which converts all the links on the pages to point to their locally downloaded equivalent as follows:

wget -r -k url

If you want to get a complete mirror of a website you can simply use the following switch which takes away the necessity for using the -r -k and -l switches.

wget -m url

Therefore if you have your own website you can make a complete backup using this one simple command.

Run wget As A Background Command

You can get wget to run as a background command leaving you able to get on with your work in the terminal window whilst the files download.

Simply use the following command:

wget -b url

You can of course combine switches. To run the wget command in the background whilst mirroring the site you would use the following command:

wget -b -m url

You can simplify this further as follows:

wget -bm url

Logging

If you are running the wget command in the background you won’t see any of the normal messages that it sends to the screen.

You can get all of those messages sent to a log file so that you can check on progress at any time using the tail command.

To output information from the wget command to a log file use the following command:

wget -o /path/to/mylogfile url

The reverse of course is to require no logging at all and no output to the screen. To omit all output use the following command:

wget -q url

Download From Multiple Sites

You can set up an input file to download from many different sites.

Open up a file using your favourite editor or even the cat command and simply start listing the sites or links to download from on each line of the file.

Save the file and then run the following wget command:

wget -i /path/to/inputfile

Apart from backing up your own website or maybe finding something to download to read on the train it is unlikely that you will want to download an entire website.

You are more likely to download a single URL with images or perhaps download files such as zip files, ISO files or image files.

With that in mind you don’t want to have to type the following into the input file as it is time consuming:

http://www.myfileserver.com/file1.zip
http://www.myfileserver.com/file2.zip
http://www.myfileserver.com/file3.zip

If you know the base URL is always going to be the same you can just specify the following in the input file:

file1.zip
file2.zip
file3.zip

You can then provide the base URL as part of the wget command as follows:

wget -B http://www.myfileserver.com -i /path/to/inputfile

Retry Options

If you have set up a queue of files to download within an input file and you leave your computer running all night to download the files you will be fairly annoyed when you come down in the morning to find that it got stuck on the first file and has been retrying all night.

You can specify the number of retries using the following switch:

wget -t 10 -i /path/to/inputfile

You might wish to use the above command in conjunction with the -T switch which allows you to specify a timeout in seconds as follows:

wget -t 10 -T 10 -i /path/to/inputfile

The above command will retry 10 times and will try to connect for 10 seconds for each link in the file.

It is also fairly annoying when you have partially download 75% of a 4 gigabyte file on a slow broadband connection only for your connection to drop out.

You can use wget to retry from where it stopped downloading by using the following command:

wget -c www.myfileserver.com/file1.zip

If you are hammering a server the host might not like it too much and might either block or just kill your requests.

You can specify a wait period which specifies how long to wait between each retrieval as follows:

wget -w 60 -i /path/to/inputfile

The above command will wait 60 seconds between each download. This is useful if you are downloading lots of files from a single source.

Some web hosts might spot the frequency however and will block you anyway. You can make the wait period random to make it look like you aren’t using a program as follows:

wget –random-wait -i /path/to/inputfile

Protecting Download Limits

Many internet service providers still apply download limits for your broadband usage, especially if you live outside of a city.

You may want to add a quota so that you don’t blow that download limit. You can do that in the following way:

wget -q 100m -i /path/to/inputfile

Note that the -q command won’t work with a single file. So if you download a file that is 2 gigabytes in size, using -q 1000m will not stop the file downloading.

The quota is only applied when recursively downloading from a site or when using an input file.

Getting Through Security

Some sites require you to log in to be able to access the content you wish to download.

You can use the following switches to specify the username and password.

wget –user=yourusername –password=yourpassword <URL>

Note on a multi user system if somebody runs the ps command they will be able to see your username and password.

Other Download Options

By default the -r switch will recursively download the content and will create directories as it goes.

You can get all the files to download to a single folder using the following switch:

wget -nd -r <url>

The opposite of this is to force the creation of directories which can be achieved using the following command:

wget -x -r <url>

How To Download Certain File Types

If you want to download recursively from a site but you only want to download a specific file type such as an mp3 or an image such as a png you can use the following syntax:

wget -A “*.mp3” -r <url>

The reverse of this is to ignore certain files. Perhaps you don’t want to download executables. In this case you would use the following syntax:

wget -R “*.exe” -r <url>