Skip to content

pywb Server Installation

Ingolf Kuss edited this page Mar 21, 2024 · 12 revisions

Setup Python Wayback Server

We set up Python Wayback Engine on a dedicated server to replay the webarchives gathered with the toscience webgatherer.

We install virtual Python3. Under SLES:

zypper install python3-virtualenv 
zypper install git

We create a dedicated User “wayback” with home directory /opt/pywb. Then we install pywb, the Python Wayback Engine, in this directory.
We start the server at port 8081.

useradd -d /opt/pywb -m wayback
su - wayback
cd
python3 -m venv venv
. venv/bin/activate
pip install pywb

Generate a unit file

We create a service to start|stop|restart pywb on a SLES server

sudo su
cd /etc/systemd/system
cat pywb.service
[Unit]
Description=PyWB Application
After=network.target

[Service]
User=wayback
WorkingDirectory=/opt/pywb
Environment="PATH=/opt/pywb/venv/bin"
#ExecStart=/opt/pywb/Python3/bin/uwsgi --ini pywb.ini
ExecStart=/opt/pywb/venv/bin/pywb -p 8081
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

We start pywb and enable the service. This will start pywb on reboot.

systemctl start pywb.service
systemctl enable pywb.service

Go to http://localhost:8081 .
You should see: “Pywb Wayback Machine” .

Create symbolic links

The webarchives that toscience gathers on the main machine are stored on a mounted volume /data.
This volume needs to be mounted also on the wayback server, so wayback can index these webarchives.
Create symbolic links to the base directories of /data so wayback will find the webarchives on the wayback server:
On the wayback server, do

mkdir /opt/toscience
cd /opt/toscience
ln -s /data/cdn-data/
ln -s /data/wpull-data/
ln -s /data/heritrix-data/

Create pywb scripts

Install hbz pywb scripts. Those will import your webarchives into pywb.

cd /opt/pywb
git clone https://github.com/hbz/pywb.bin  bin

Initially add webarchives to python wayback’s index and archive

Initially index the complete available backlog of webarchives on your toscience server

Collection “wayback”

This is toscience’s standard webarchive collection of material with restricted access.

Generate the pywb collection “wayback”:

Remove existing collection “wayback” (if applicable)

cd /opt/pywb
. venv/bin/activate
cd bin
./ks.remove_collection.sh wayback

Generate a new collection “wayback”

cd /opt/pywb
. venv/bin/activate
wb-manager init wayback

Create multiple indexes for the different webcrawlers that you use (if applicable) in collection “wayback”:

Create the first index: index.cdxj . It contains: wpull-data, cdn-data .

 mkdir /opt/pywb/logs
cd /opt/pywb/bin
./ks.index_wpull-data.sh wayback  >> /opt/pywb/logs/ks.index_wpull-data.log

Create a second index: index_htrx.cdxj . It contains: heritrix-data .

cd /opt/pywb/bin
./ks.index_heritrix-data.sh wayback  >> /opt/pywb/logs/ks.index_heritrix-data.log

Create a third index: index_wget.cdxj . It contains: wget-data .

cd /opt/pywb/bin
./ks.index_wget-data.sh wayback  >> /opt/pywb/logs/ks.index_wget-data.log

Collection “public”

This is toscience’s standard webarchive collection of material with public access.
toscience doesn’t copy the contents from “restricted” to “public” but just creates symbolic links in the public collection.

Generate the pywb collection “public”:

Remove existing collection “public” (if applicable)

cd /opt/pywb
. venv/bin/activate
cd bin
./ks.remove_collection.sh public

Generate a new collection “public”

cd /opt/pywb
wb-manager init public

Create one index in the public collection: index.cdxj . It contains: public-data, cdn-data .

cd /opt/pywb/bin
./ks.index_public-data.sh public  >> /opt/pywb/logs/ks.index_public-data.log

Create cronjob for automatic updates of wayback index

We automatically update the indexes of the collections if new webarchives have been added or existing webarchives have been modified.
Note: This procedure does not remove deletes webarchives from the index !

ks.auto_add.sh >> /opt/pywb/logs/ks.auto_add_cron.log

Install auto add as a cronjob on your wayback server:

crontab -e
# m h  dom mon dow   Command
# Refresh index for newly harvested webarchives (Python Wayback)
0 * * * * /opt/pywb/bin/ks.auto_add.sh >> /opt/pywb/logs/ks.auto_add_cron.log

Apache-Konfiguration:

On the main server (the one which runs the toscience applications),
in /opt/toscience/conf/site.ssl.conf  specify the name of your wayback server in two redirect statements. Usually, to.science.install’s configure.sh on the main server has already done that for you:

   # redirect to the load-balanced pywb:
    RewriteRule ^/wayback(.*) https://<YOUR_WAYBACK_SERVER>/wayback$1 [R=301,L]
    RewriteRule ^/public(.*) https://<YOUR_WAYBACK_SERVER>/public$1 [R=301,L]

where is the fully qualfied domain name of your wayback server.

The host declaration of your new wayback server has already been created by to.science.install on the main server. It is here:

/etc/apache2/vhosts.d/<YOUR_SERVER_NAME>.conf

Restart apache2 on the main server to activate redirection of the wayback links that toscience generates. This will redirect your wayback links to your new wayback server:

sudo service apache2 reload