How to download a website for offline browsing using wget?

InternetPirate@lemmy.fmhy.ml · 1 year ago

How to download a website for offline browsing using wget?

InternetPirate@lemmy.fmhy.ml · edit-2 1 year ago

wget -mkEpnp

wget --mirror --convert-links --adjust-extension --page-requisites –no-parent http://example.org

Explanation of the various flags:

--mirror – Makes (among other things) the download recursive.
--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

wget -mpHkKEb -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ http://www.example.com

–m (--mirror) : turn on options suitable for mirroring (infinite recursive download and timestamps).

-p (--page-requisites) : download all files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

-H (--span-hosts): enable spanning across hosts when doing recursive retrieving.

–k (--convert-links) : after the download, convert the links in document for local viewing.

-K (--backup-converted) : when converting a file, back up the original version with a .orig suffix. Affects the behavior of -N.

-E (--adjust-extension) : add the proper extension to the end of the file.

-b (--background) : go to background immediately after startup. If no output file is specified via the -o, output is redirected to wget-log.

-e (--execute) : execute command (robots=off).

-t number (--tries=number) : set number of tries to number.

-U (--user-agent) : identify as agent-string to the HTTP server. Some servers may ban you permanently for recursively download if you send the default User Agent.

Cronjobs

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk -t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpkH t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ -D https://example.com

0 23 * * * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 23 * 1 * cd ~/Documents/Webs/mirror; wget -mpk t 1 -e robots=off -U ‘Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0’ https://example.com

0 8 * * * pkill wget; cd ~/Documents/Webs/mirror/; rm wget*

taskylizard@lemmy.fmhy.ml · 1 year ago

I have been only using page-requisites but this is even better, thanks!

jpablo68@lemmy.fmhy.ml · 1 year ago

isn’t curl a better program to download websites?

InternetPirate@lemmy.fmhy.ml · 1 year ago

what are the best linux cli tools to download a mirror of a website for offline browsing?

InternetPirate@lemmy.fmhy.ml · 1 year ago

There are several Linux command-line tools that can be used to download a mirror of a website for offline browsing. Here are some of the most popular ones:

wget: wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website recursively, use the -r option. For example, the following command will download the entire website located at http://example.com:
```
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com
```
This will create a local copy of the website in a directory named example.com.
httrack: httrack is a free and open-source website copier that allows you to download a website and browse it offline. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. To download a website, use the -O option followed by the directory where you want to save the website. For example, the following command will download the entire website located at http://example.com:
```
httrack http://example.com -O /path/to/save/directory
```
This will create a local copy of the website in the directory /path/to/save/directory/example.com.
curl: curl is a command-line tool for transferring data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP, etc.). To download a website recursively, use the -r option. For example, the following command will download the entire website located at http://example.com:
```
curl -k -L -O -J -R -H 'Referer: http://example.com' -e 'http://example.com' -A 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' -m 1800 --retry 3 --retry-delay 5 --retry-max-time 60 --retry-connrefused -s -S -L -r -k -J -O -C - http://example.com
```
This will create a local copy of the website in the current directory.

These tools have different options and features, so you should choose the one that best suits your needs.

Citations: [1] https://www.tutorialspoint.com/5-linux-command-line-based-tools-for-downloading-files-and-browsing-websites [2] https://www.kali.org/tools/httrack/ [3] https://medevel.com/os1-3-web-copier/ [4] http://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier [5] https://winaero.com/make-offline-copy-of-a-site-with-wget-on-windows-and-linux/ [6] https://alvinalexander.com/linux-unix/how-to-make-offline-mirror-copy-website-with-wget