[Awesome Software NetWork] GNU Wget: A free software package for retrieving files using HTTP, HTTPS, FTP and FTPS

GNU Wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

GNU Wget has many features to make retrieving large files or mirroring entire web or FTP sites easy, including:

  • Can resume aborted downloads, using REST and RANGE

  • Can use filename wild cards and recursively mirror directories

  • NLS-based message files for many different languages

  • Optionally converts absolute links in downloaded documents to relative, so that downloaded documents may link to each other locally

  • Runs on most UNIX-like operating systems as well as Microsoft Windows

  • Supports HTTP proxies

  • Supports HTTP cookies

  • Supports persistent HTTP connections

  • Unattended / background operation

  • Uses local file timestamps to determine whether documents need to be re-downloaded when mirroring

  • GNU Wget is distributed under the GNU General Public License.

Installation

The wget package is pre-installed on most Linux distributions today.

1
2
3
4
5
6
7
8
# macOS
$ brew install wget

# Ubuntu and Debian
sudo apt install wget

# CentOS and Fedora
sudo yum install wget

Usages

Download a File with wget

In its simplest form, when used without any option, wget will download the resource specified in the [url] to the current directory.

In the following example, we are downloading the Linux kernel tar archive:

1
$ wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz

To turn off the output, use the -q option.

If the file already exists, wget will add .N (number) at the end of the file name.

Download File with Different Name

To save the downloaded file under a different name, pass the -O option followed by the chosen name:

1
wget -O latest-hugo.zip https://github.com/gohugoio/hugo/archive/master.zip

The command above will save the latest hugo zip file from GitHub as latest-hugo.zip instead of its original name master.zip.

Downloading a File to a Specific Directory

By default, wget will save the downloaded file in the current working directory. To save the file to a specific location, use the -P option:

1
$ wget -P /mnt/iso http://mirrors.mit.edu/centos/7/isos/x86_64/CentOS-7-x86_64-Minimal-1804.iso

The command above tells wget to save the CentOS 7 iso file to the /mnt/iso directory.

Resuming a Download

You can resume a download using the -c option. This is useful if your connection drops during a download of a large file, and instead of starting the download from scratch, you can continue the previous one.

In the following example, we are resuming the download of the Ubuntu 18.04 iso file:

1
$ wget -c http://releases.ubuntu.com/18.04/ubuntu-18.04-live-server-amd64.iso

If the remote server does not support resuming downloads, wget will start the download from the beginning and overwrite the existing file.

Downloading in Background

To download in the background, use the -b option. In the following example, we are downloading the OpenSuse iso file in the background:

1
$ wget -b https://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-DVD-x86_64-Current.iso

By default, the output is redirected to wget-log file in the current directory. To watch the status of the download, use the tail command:

1
$ tail -f wget-log

Changing the Wget User-Agent

Sometimes when downloading a file, the remote server may be set to block the Wget User-Agent. In situations like this, to emulate a different browser, pass the -U option.

1
wget --user-agent="Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0" http://wget-forbidden.com/

The command above will emulate Firefox 60 requesting the page from wget-forbidden.com

Downloading Multiple Files

If you want to download multiple files at once, use the -i option followed by the path to a local or external file containing a list of the URLs to be downloaded. Each URL needs to be on a separate line.

The following example shows how to download the Arch Linux, Debian, and Fedora iso files using the URLs specified in the linux-distros.txt file:

1
2
3
4
5
6
$ cat linux-distros.txt
http://mirrors.edge.kernel.org/archlinux/iso/2018.06.01/archlinux-2018.06.01-x86_64.iso
https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-9.4.0-amd64-netinst.iso
https://download.fedoraproject.org/pub/fedora/linux/releases/28/Server/x86_64/iso/Fedora-Server-dvd-x86_64-28-1.1.iso

$ wget -i linux-distros.txt

If you specify - as a filename, URLs will be read from the standard input.

Creating a Mirror of a Website

To create a mirror of a website with wget, use the -m option. This will create a complete local copy of the website by following and downloading all internal links as well as the website resources (JavaScript, CSS, Images).

1
wget -m https://example.com

It will create example.com/index.html file.

If you want to use the downloaded website for local browsing, you will need to pass a few extra arguments to the command above.

1
wget -m -k -p https://example.com

The -k option will cause wget to convert the links in the downloaded documents to make them suitable for local viewing. The -p option will tell wget to download all necessary files for displaying the HTML page.

Skipping Certificate Check

If you want to download a file over HTTPS from a host that has an invalid SSL certificate, use the --no-check-certificate option:

1
$ wget --no-check-certificate https://domain-with-invalid-ss.com

Downloading to the Standard Output

In the following example, wget will quietly ( flag -q) download and output the latest WordPress version to stdout ( flag -O -) and pipe it to the tar utility, which will extract the archive to the /var/www directory.

1
$ wget -q -O - "http://wordpress.org/latest.tar.gz" | tar -xzf - -C /var/www

Help

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
$ wget --help
GNU Wget 1.21.2, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...

Mandatory arguments to long options are mandatory for short options too.

Startup:
-V, --version display the version of Wget and exit
-h, --help print this help
-b, --background go to background after startup
-e, --execute=COMMAND execute a `.wgetrc'-style command

Logging and input file:
-o, --output-file=FILE log messages to FILE
-a, --append-output=FILE append messages to FILE
-d, --debug print lots of debugging information
-q, --quiet quiet (no output)
-v, --verbose be verbose (this is the default)
-nv, --no-verbose turn off verboseness, without being quiet
--report-speed=TYPE output bandwidth as TYPE. TYPE can be bits
-i, --input-file=FILE download URLs found in local or external FILE
-F, --force-html treat input file as HTML
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL
--config=FILE specify config file to use
--no-config do not read any config file
--rejected-log=FILE log reasons for URL rejection to FILE

Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits)
--retry-connrefused retry even if connection is refused
--retry-on-http-error=ERRORS comma-separated list of HTTP errors to retry
-O, --output-document=FILE write documents to FILE
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them)
--no-netrc don't try to obtain credentials from .netrc
-c, --continue resume getting a partially-downloaded file
--start-pos=OFFSET start downloading from zero-based position OFFSET
--progress=TYPE select progress gauge type
--show-progress display the progress bar in any verbosity mode
-N, --timestamping don't re-retrieve files unless newer than
local
--no-if-modified-since don't use conditional if-modified-since get
requests in timestamping mode
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server
-S, --server-response print server response
--spider don't download anything
-T, --timeout=SECONDS set all timeout values to SECONDS
--dns-timeout=SECS set the DNS lookup timeout to SECS
--connect-timeout=SECS set the connect timeout to SECS
--read-timeout=SECS set the read timeout to SECS
-w, --wait=SECONDS wait SECONDS between retrievals
(applies if more then 1 URL is to be retrieved)
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval
(applies if more then 1 URL is to be retrieved)
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals
(applies if more then 1 URL is to be retrieved)
--no-proxy explicitly turn off proxy
-Q, --quota=NUMBER set retrieval quota to NUMBER
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host
--limit-rate=RATE limit download rate to RATE
--no-dns-cache disable caching DNS lookups
--restrict-file-names=OS restrict chars in file names to ones OS allows
--ignore-case ignore case when matching files/directories
-4, --inet4-only connect only to IPv4 addresses
-6, --inet6-only connect only to IPv6 addresses
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none
--user=USER set both ftp and http user to USER
--password=PASS set both ftp and http password to PASS
--ask-password prompt for passwords
--use-askpass=COMMAND specify credential handler for requesting
username and password. If no COMMAND is
specified the WGET_ASKPASS or the SSH_ASKPASS
environment variable is used.
--no-iri turn off IRI support
--local-encoding=ENC use ENC as the local encoding for IRIs
--remote-encoding=ENC use ENC as the default remote encoding
--unlink remove file before clobber
--xattr turn on storage of metadata in extended file attributes

Directories:
-nd, --no-directories don't create directories
-x, --force-directories force creation of directories
-nH, --no-host-directories don't create host directories
--protocol-directories use protocol name in directories
-P, --directory-prefix=PREFIX save files to PREFIX/..
--cut-dirs=NUMBER ignore NUMBER remote directory components

HTTP options:
--http-user=USER set http user to USER
--http-password=PASS set http password to PASS
--no-cache disallow server-cached data
--default-page=NAME change the default page name (normally
this is 'index.html'.)
-E, --adjust-extension save HTML/CSS documents with proper extensions
--ignore-length ignore 'Content-Length' header field
--header=STRING insert STRING among the headers
--compression=TYPE choose compression, one of auto, gzip and none. (default: none)
--max-redirect maximum redirections allowed per page
--proxy-user=USER set USER as proxy username
--proxy-password=PASS set PASS as proxy password
--referer=URL include 'Referer: URL' header in HTTP request
--save-headers save the HTTP headers to file
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION
--no-http-keep-alive disable HTTP keep-alive (persistent connections)
--no-cookies don't use cookies
--load-cookies=FILE load cookies from FILE before session
--save-cookies=FILE save cookies to FILE after session
--keep-session-cookies load and save session (non-permanent) cookies
--post-data=STRING use the POST method; send STRING as the data
--post-file=FILE use the POST method; send contents of FILE
--method=HTTPMethod use method "HTTPMethod" in the request
--body-data=STRING send STRING as data. --method MUST be set
--body-file=FILE send contents of FILE. --method MUST be set
--content-disposition honor the Content-Disposition header when
choosing local file names (EXPERIMENTAL)
--content-on-error output the received content on server errors
--auth-no-challenge send Basic HTTP authentication information
without first waiting for the server's
challenge

HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, TLSv1, TLSv1_1, TLSv1_2 and PFS
--https-only only follow secure HTTPS links
--no-check-certificate don't validate the server's certificate
--certificate=FILE client certificate file
--certificate-type=TYPE client certificate type, PEM or DER
--private-key=FILE private key file
--private-key-type=TYPE private key type, PEM or DER
--ca-certificate=FILE file with the bundle of CAs
--ca-directory=DIR directory where hash list of CAs is stored
--crl-file=FILE file with bundle of CRLs
--pinnedpubkey=FILE/HASHES Public key (PEM/DER) file, or any number
of base64 encoded sha256 hashes preceded by
'sha256//' and separated by ';', to verify
peer against
--random-file=FILE file with random data for seeding the SSL PRNG

--ciphers=STR Set the priority string (GnuTLS) or cipher list string (OpenSSL) directly.
Use with care. This option overrides --secure-protocol.
The format and syntax of this string depend on the specific SSL/TLS engine.
HSTS options:
--no-hsts disable HSTS
--hsts-file path of HSTS database (will override default)

FTP options:
--ftp-user=USER set ftp user to USER
--ftp-password=PASS set ftp password to PASS
--no-remove-listing don't remove '.listing' files
--no-glob turn off FTP file name globbing
--no-passive-ftp disable the "passive" transfer mode
--preserve-permissions preserve remote file permissions
--retr-symlinks when recursing, get linked-to files (not dir)

FTPS options:
--ftps-implicit use implicit FTPS (default port is 990)
--ftps-resume-ssl resume the SSL/TLS session started in the control connection when
opening a data connection
--ftps-clear-data-connection cipher the control channel only; all the data will be in plaintext
--ftps-fallback-to-ftp fall back to FTP if FTPS is not supported in the target server
WARC options:
--warc-file=FILENAME save request/response data to a .warc.gz file
--warc-header=STRING insert STRING into the warcinfo record
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER
--warc-cdx write CDX index files
--warc-dedup=FILENAME do not store records listed in this CDX file
--no-warc-compression do not compress WARC files with GZIP
--no-warc-digests do not calculate SHA1 digests
--no-warc-keep-log do not store the log file in a WARC record
--warc-tempdir=DIRECTORY location for temporary files created by the
WARC writer

Recursive download:
-r, --recursive specify recursive download
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite)
--delete-after delete files locally after downloading them
-k, --convert-links make links in downloaded HTML or CSS point to
local files
--convert-file-only convert the file part of the URLs only (usually known as the basename)
--backups=N before writing file X, rotate up to N backup files
-K, --backup-converted before converting file X, back up as X.orig
-m, --mirror shortcut for -N -r -l inf --no-remove-listing
-p, --page-requisites get all images, etc. needed to display HTML page
--strict-comments turn on strict (SGML) handling of HTML comments

Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions
-R, --reject=LIST comma-separated list of rejected extensions
--accept-regex=REGEX regex matching accepted URLs
--reject-regex=REGEX regex matching rejected URLs
--regex-type=TYPE regex type (posix)
-D, --domains=LIST comma-separated list of accepted domains
--exclude-domains=LIST comma-separated list of rejected domains
--follow-ftp follow FTP links from HTML documents
--follow-tags=LIST comma-separated list of followed HTML tags
--ignore-tags=LIST comma-separated list of ignored HTML tags
-H, --span-hosts go to foreign hosts when recursive
-L, --relative follow relative links only
-I, --include-directories=LIST list of allowed directories
--trust-server-names use the name specified by the redirection
URL's last component
-X, --exclude-directories=LIST list of excluded directories
-np, --no-parent don't ascend to the parent directory

Email bug reports, questions, discussions to <[email protected]>
and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget.

References

[1] Wget - GNU Project - Free Software Foundation - https://www.gnu.org/software/wget/

[2] wget - 1.21.1-dirty - GNU Project - Free Software Foundation - https://www.gnu.org/software/wget/manual/

[3] Wget Command in Linux with Examples | Linuxize - https://linuxize.com/post/wget-command-examples/