To download a website for offline browsing using wget
, you can use the following command:
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains example.com --no-parent http://www.example.com
Replace example.com
with the website you want to download. This command will download the entire website, including all the elements that compose the page (images, CSS, etc.), and convert the links so that they work locally and offline[1].
Here’s a brief explanation of the options used in the command:
--recursive
: Download the entire website recursively.--no-clobber
: Don’t overwrite existing files.--page-requisites
: Download all the elements required to display the page properly (images, CSS, etc.).--html-extension
: Save files with the.html
extension.--convert-links
: Convert links so that they work locally and offline.--restrict-file-names=windows
: Modify filenames so that they work in Windows as well.--domains example.com
: Don’t follow links outside the specified domain.--no-parent
: Don’t follow links outside the specified directory.
If the website uses external resources like jQuery, you can try using the --span-hosts
option to download resources from other domains as well:
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains example.com --span-hosts --no-parent http://www.example.com
However, this may also download additional content from other domains that you might not need. To avoid downloading unnecessary content, you can manually download the required resources (like jQuery) and update the HTML files to use the local copies instead.
Keep in mind that some websites may have measures in place to prevent downloading their content using tools like wget
. In such cases, you may need to adjust the command options or use alternative methods to download the website for offline browsing[6].
Citations: [1] https://www.linuxjournal.com/content/downloading-entire-web-site-wget [2] https://winaero.com/make-offline-copy-of-a-site-with-wget-on-windows-and-linux/amp/ [3] https://stackoverflow.com/questions/10842263/wget-download-for-offline-viewing-including-absolute-references [4] https://askubuntu.com/questions/391622/download-a-whole-website-with-wget-or-other-including-all-its-downloadable-con [5] https://superuser.com/questions/970323/using-wget-to-copy-website-with-proper-layout-for-offline-browsing [6] https://www.computerhope.com/unix/wget.htm [7] https://superuser.com/questions/1672776/download-whole-website-wget [8] https://gist.github.com/stvhwrd/985dedbe1d3329e68d70 [9] https://simpleit.rocks/linux/how-to-download-a-website-with-wget-the-right-way/ [10] https://www.guyrutenberg.com/2014/05/02/make-offline-mirror-of-a-site-using-wget/ [11] https://linuxreviews.org/Wget:_download_whole_or_parts_of_websites_with_ease [12] https://brain-dump.space/articles/how-to-get-full-offline-website-copy-using-wget-on-mac-os/ [13] https://dev.to/jjokah/how-to-download-an-entire-website-for-offline-usage-using-wget-2lli [14] https://alvinalexander.com/linux-unix/how-to-make-offline-mirror-copy-website-with-wget [15] https://askubuntu.com/questions/979655/using-wget-and-having-websites-working-properly-offline
wget -mkEpnp
wget --mirror --convert-links --adjust-extension --page-requisites –no-parent http://example.org
Explanation of the various flags:
--mirror – Makes (among other things) the download recursive. --convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing. --adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type. --page-requisites – Download things like CSS style-sheets and images required to properly display the page offline. --no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.
wget -mpHkKEb -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ http://www.example.com
–m (--mirror) : turn on options suitable for mirroring (infinite recursive download and timestamps). -p (--page-requisites) : download all files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. -H (--span-hosts): enable spanning across hosts when doing recursive retrieving. –k (--convert-links) : after the download, convert the links in document for local viewing. -K (--backup-converted) : when converting a file, back up the original version with a .orig suffix. Affects the behavior of -N. -E (--adjust-extension) : add the proper extension to the end of the file. -b (--background) : go to background immediately after startup. If no output file is specified via the -o, output is redirected to wget-log. -e (--execute) : execute command (robots=off). -t number (--tries=number) : set number of tries to number. -U (--user-agent) : identify as agent-string to the HTTP server. Some servers may ban you permanently for recursively download if you send the default User Agent.
Cronjobs
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpkH t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ -D https://example.com
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 23 * 1 * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com
0 8 * * * pkill wget; cd ~/Documents/Webs/mirror/; rm wget*
I have been only using page-requisites but this is even better, thanks!
isn’t curl a better program to download websites?
what are the best linux cli tools to download a mirror of a website for offline browsing?
There are several Linux command-line tools that can be used to download a mirror of a website for offline browsing. Here are some of the most popular ones:
-
wget:
wget
is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website recursively, use the-r
option. For example, the following command will download the entire website located athttp://example.com
:wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com
This will create a local copy of the website in a directory named
example.com
. -
httrack:
httrack
is a free and open-source website copier that allows you to download a website and browse it offline. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website, use the-O
option followed by the directory where you want to save the website. For example, the following command will download the entire website located athttp://example.com
:httrack http://example.com -O /path/to/save/directory
This will create a local copy of the website in the directory
/path/to/save/directory/example.com
. -
curl:
curl
is a command-line tool for transferring data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP, etc.). To download a website recursively, use the-r
option. For example, the following command will download the entire website located athttp://example.com
:curl -k -L -O -J -R -H 'Referer: http://example.com' -e 'http://example.com' -A 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' -m 1800 --retry 3 --retry-delay 5 --retry-max-time 60 --retry-connrefused -s -S -L -r -k -J -O -C - http://example.com
This will create a local copy of the website in the current directory.
These tools have different options and features, so you should choose the one that best suits your needs.
Citations: [1] https://www.tutorialspoint.com/5-linux-command-line-based-tools-for-downloading-files-and-browsing-websites [2] https://www.kali.org/tools/httrack/ [3] https://medevel.com/os1-3-web-copier/ [4] http://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier [5] https://winaero.com/make-offline-copy-of-a-site-with-wget-on-windows-and-linux/ [6] https://alvinalexander.com/linux-unix/how-to-make-offline-mirror-copy-website-with-wget
-