A Python script that crawls a website from a given starting URL, saves each page as a PDF, and optionally combines all PDFs into a single document. This script is especially useful for creating offline versions of documentation sites or other structured content. Additionally, it can now detect and download PDF files linked within the target website.
- Crawls a website, following internal links up to a specified depth.
- Saves each crawled page as a PDF using Headless Chrome.
- Optionally combines all saved pages into a single PDF document.
- Detects and downloads any PDF files linked on pages (optional).
- Can operate in PDF-only mode to download only PDF files linked on the target domain.
- Python 3.x
- Google Chrome installed
- ChromeDriver compatible with your Chrome version
-
Clone this repository:
-
Install required Python packages:
`pip install requests beautifulsoup4 selenium PyPDF2`
-
Install ChromeDriver:
-
Use Homebrew (for macOS):
brew install chromedriver
-
Or download manually from ChromeDriver official site.
-
-
Update the
chrome_driver_path
in the script with your ChromeDriver’s location. For example:chrome_driver_path = '/opt/homebrew/bin/chromedriver' # Adjust as necessary
Run the script with the following command:
bash
Copy code
python3 crawl_and_save.py <URL> [--depth DEPTH] [--single-pdf] [--download-pdfs] [--pdf-only]
<URL>
: The starting URL to crawl (required).--depth DEPTH
: The maximum depth for recursive crawling (default is 2).--single-pdf
: Combine all saved pages into a single PDF file.--download-pdfs
: Download any PDF files linked on pages.--pdf-only
: Only download PDF files and ignore HTML pages.
-
Save each page as a separate PDF:
bash
Copy code
python3 crawl_and_save.py https://docs.example.com
-
Combine all pages into a single PDF:
bash
Copy code
python3 crawl_and_save.py https://docs.example.com --single-pdf
-
Download PDFs linked on pages in addition to saving HTML pages as PDFs:
bash
Copy code
python3 crawl_and_save.py https://docs.example.com --download-pdfs
-
PDF-Only Mode: Crawl the site and download only PDFs linked on pages, ignoring HTML pages:
bash
Copy code
python3 crawl_and_save.py https://docs.example.com --pdf-only
-
Custom Crawling Depth:
bash
Copy code
python3 crawl_and_save.py https://docs.example.com --depth 3
- Crawling: The script starts from the given URL and recursively finds internal links that match the target domain. It crawls up to the specified depth.
- PDF Saving: Each HTML page is loaded with Selenium, and the HTML content is converted to a PDF using Chrome’s DevTools print-to-PDF feature.
- PDF Detection: When
--download-pdfs
or--pdf-only
is set, the script downloads any linked PDF files it encounters. - Combining PDFs: If
--single-pdf
is specified, all generated PDFs are merged into a single document usingPyPDF2
.
The script outputs debug messages that include:
- Timestamps
- HTTP status codes for each request
- Messages indicating when each page is processed and saved
- Selenium: Used to control Chrome in headless mode and capture each page as a PDF.
- PyPDF2: Needed if using the
--single-pdf
option to merge PDFs.
- ChromeDriver Errors: Ensure
chromedriver
is compatible with your installed version of Chrome. You can check Chrome’s version underchrome://settings/help
. - Permission Errors: If you encounter permissions issues with ChromeDriver, try running
chmod +x /path/to/chromedriver
to make it executable. - Connection Issues: If you get network errors (e.g.,
ProtocolUnknownError
), ensure the target site is accessible and doesn't require authentication or other permissions.
This project is licensed under the MIT License.