wget
simplest syntax:
wget url
The result is a single index.html file. On its own this file is fairly useless as the content is still pulled from Google and the images and stylesheets are still all held on Google.
To download the full site and all the pages you can use the following command:
wget -r url
This downloads the pages recursively up to a maximum of 5 levels deep.
5 levels deep might not be enough to get everything from the site. You can use the -l switch to set the number of levels you wish to go to as follows:
wget -r -l10 url
If you want infinite recursion you can use the following:
wget -r -l inf url
You can also replace the inf with 0 which means the same thing.
There is still one more problem. You might get all the pages locally but all the links in the pages still point to their original place. It is therefore not possible to click locally between the links on the pages.
You can get around this problem by using the -k switch which converts all the links on the pages to point to their locally downloaded equivalent as follows:
wget -r -k url
If you want to get a complete mirror of a website you can simply use the following switch which takes away the necessity for using the -r -k and -l switches.
wget -m url
Therefore if you have your own website you can make a complete backup using this one simple command.
Run wget As A Background Command
You can get wget to run as a background command leaving you able to get on with your work in the terminal window whilst the files download.
Simply use the following command:
wget -b url
You can of course combine switches. To run the wget command in the background whilst mirroring the site you would use the following command:
wget -b -m url
You can simplify this further as follows:
wget -bm url
Logging
If you are running the wget command in the background you won’t see any of the normal messages that it sends to the screen.
You can get all of those messages sent to a log file so that you can check on progress at any time using the tail command.
To output information from the wget command to a log file use the following command:
wget -o /path/to/mylogfile url
The reverse of course is to require no logging at all and no output to the screen. To omit all output use the following command:
wget -q url
Download From Multiple Sites
You can set up an input file to download from many different sites.
Open up a file using your favourite editor or even the cat command and simply start listing the sites or links to download from on each line of the file.
Save the file and then run the following wget command:
wget -i /path/to/inputfile
Apart from backing up your own website or maybe finding something to download to read on the train it is unlikely that you will want to download an entire website.
You are more likely to download a single URL with images or perhaps download files such as zip files, ISO files or image files.
With that in mind you don’t want to have to type the following into the input file as it is time consuming:
- http://www.myfileserver.com/file1.zip
- http://www.myfileserver.com/file2.zip
- http://www.myfileserver.com/file3.zip
If you know the base URL is always going to be the same you can just specify the following in the input file:
- file1.zip
- file2.zip
- file3.zip
You can then provide the base URL as part of the wget command as follows:
wget -B http://www.myfileserver.com -i /path/to/inputfile
Retry Options
If you have set up a queue of files to download within an input file and you leave your computer running all night to download the files you will be fairly annoyed when you come down in the morning to find that it got stuck on the first file and has been retrying all night.
You can specify the number of retries using the following switch:
wget -t 10 -i /path/to/inputfile
You might wish to use the above command in conjunction with the -T switch which allows you to specify a timeout in seconds as follows:
wget -t 10 -T 10 -i /path/to/inputfile
The above command will retry 10 times and will try to connect for 10 seconds for each link in the file.
It is also fairly annoying when you have partially download 75% of a 4 gigabyte file on a slow broadband connection only for your connection to drop out.
You can use wget to retry from where it stopped downloading by using the following command:
wget -c www.myfileserver.com/file1.zip
If you are hammering a server the host might not like it too much and might either block or just kill your requests.
You can specify a wait period which specifies how long to wait between each retrieval as follows:
wget -w 60 -i /path/to/inputfile
The above command will wait 60 seconds between each download. This is useful if you are downloading lots of files from a single source.
Some web hosts might spot the frequency however and will block you anyway. You can make the wait period random to make it look like you aren’t using a program as follows:
wget –random-wait -i /path/to/inputfile
Protecting Download Limits
Many internet service providers still apply download limits for your broadband usage, especially if you live outside of a city.
You may want to add a quota so that you don’t blow that download limit. You can do that in the following way:
wget -q 100m -i /path/to/inputfile
Note that the -q command won’t work with a single file. So if you download a file that is 2 gigabytes in size, using -q 1000m will not stop the file downloading.
The quota is only applied when recursively downloading from a site or when using an input file.
Getting Through Security
Some sites require you to log in to be able to access the content you wish to download.
You can use the following switches to specify the username and password.
wget –user=yourusername –password=yourpassword <URL>
Note on a multi user system if somebody runs the ps command they will be able to see your username and password.
Other Download Options
By default the -r switch will recursively download the content and will create directories as it goes.
You can get all the files to download to a single folder using the following switch:
wget -nd -r <url>
The opposite of this is to force the creation of directories which can be achieved using the following command:
wget -x -r <url>
How To Download Certain File Types
If you want to download recursively from a site but you only want to download a specific file type such as an mp3 or an image such as a png you can use the following syntax:
wget -A “*.mp3” -r <url>
The reverse of this is to ignore certain files. Perhaps you don’t want to download executables. In this case you would use the following syntax:
wget -R “*.exe” -r <url>