-
Notifications
You must be signed in to change notification settings - Fork 1
pywb Server Installation
We set up Python Wayback Engine on a dedicated server to replay the webarchives gathered with the toscience webgatherer.
We install virtual Python3. Under SLES:
zypper install python3-virtualenv
zypper install git
We create a dedicated User “wayback” with home directory /opt/pywb. Then we install pywb, the Python Wayback Engine, in this directory.
We start the server at port 8081.
useradd -d /opt/pywb -m wayback
su - wayback
cd
python3 -m venv venv
. venv/bin/activate
pip install pywb
We create a service to start|stop|restart pywb on a SLES server
sudo su cd /etc/systemd/system cat pywb.service [Unit] Description=PyWB Application After=network.target
[Service] User=wayback WorkingDirectory=/opt/pywb Environment="PATH=/opt/pywb/venv/bin" #ExecStart=/opt/pywb/Python3/bin/uwsgi --ini pywb.ini ExecStart=/opt/pywb/venv/bin/pywb -p 8081 Restart=on-failure RestartSec=10
[Install] WantedBy=multi-user.target
We start pywb and enable the service. This will start pywb on reboot.
systemctl start pywb.service
systemctl enable pywb.service
Go to http://localhost:8081 .
You should see: “Pywb Wayback Machine” .
The webarchives that toscience gathers on the main machine are stored on a mounted volume /data.
This volume needs to be mounted also on the wayback server, so wayback can index these webarchives.
Create symbolic links to the base directories of /data so wayback will find the webarchives on the wayback server:
On the wayback server, do
mkdir /opt/toscience
cd /opt/toscience
ln -s /data/cdn-data/
ln -s /data/wpull-data/
ln -s /data/heritrix-data/
Install hbz pywb scripts. Those will import your webarchives into pywb.
cd /opt/pywb
git clone https://github.com/hbz/pywb.bin bin
Initially index the complete available backlog of webarchives on your toscience server
This is toscience’s standard webarchive collection of material with restricted access.
Generate the pywb collection “wayback”:
Remove existing collection “wayback” (if applicable)
cd /opt/pywb
. venv/bin/activate
cd bin
./ks.remove_collection.sh wayback
Generate a new collection “wayback”
cd /opt/pywb
. venv/bin/activate
wb-manager init wayback
Create multiple indexes for the different webcrawlers that you use (if applicable) in collection “wayback”:
Create the first index: index.cdxj . It contains: wpull-data, cdn-data .
mkdir /opt/pywb/logs
cd /opt/pywb/bin
./ks.index_wpull-data.sh wayback >> /opt/pywb/logs/ks.index_wpull-data.log
Create a second index: index_htrx.cdxj . It contains: heritrix-data .
cd /opt/pywb/bin
./ks.index_heritrix-data.sh wayback >> /opt/pywb/logs/ks.index_heritrix-data.log
Create a third index: index_wget.cdxj . It contains: wget-data .
cd /opt/pywb/bin
./ks.index_wget-data.sh wayback >> /opt/pywb/logs/ks.index_wget-data.log
This is toscience’s standard webarchive collection of material with public access.
toscience doesn’t copy the contents from “restricted” to “public” but just creates symbolic links in the public collection.
Generate the pywb collection “public”:
Remove existing collection “public” (if applicable)
cd /opt/pywb
. venv/bin/activate
cd bin
./ks.remove_collection.sh public
Generate a new collection “public”
cd /opt/pywb
wb-manager init public
Create one index in the public collection: index.cdxj . It contains: public-data, cdn-data .
cd /opt/pywb/bin
./ks.index_public-data.sh public >> /opt/pywb/logs/ks.index_public-data.log
We automatically update the indexes of the collections if new webarchives have been added or existing webarchives have been modified.
Note: This procedure does not remove deletes webarchives from the index !
ks.auto_add.sh >> /opt/pywb/logs/ks.auto_add_cron.log
Install auto add as a cronjob on your wayback server:
crontab -e
# m h dom mon dow Command
# Refresh index for newly harvested webarchives (Python Wayback)
0 * * * * /opt/pywb/bin/ks.auto_add.sh >> /opt/pywb/logs/ks.auto_add_cron.log
On the main server (the one which runs the toscience applications),
in /opt/toscience/conf/site.ssl.conf specify the name of your wayback server in two redirect statements. Usually, to.science.install’s configure.sh on the main server has already done that for you:
# redirect to the load-balanced pywb:
RewriteRule ^/wayback(.*) https://<YOUR_WAYBACK_SERVER>/wayback$1 [R=301,L]
RewriteRule ^/public(.*) https://<YOUR_WAYBACK_SERVER>/public$1 [R=301,L]
where is the fully qualfied domain name of your wayback server.
The host declaration of your new wayback server has already been created by to.science.install on the main server. It is here:
/etc/apache2/vhosts.d/<YOUR_SERVER_NAME>.conf
Restart apache2 on the main server to activate redirection of the wayback links that toscience generates. This will redirect your wayback links to your new wayback server:
sudo service apache2 reload