Skip to content

A webscraper that scrapes historic web data using the Wayback Machine

Notifications You must be signed in to change notification settings

ARWishere/WaybackWebScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WaybackWebScraper

This Python project enables you to scrape a website and its historical versions using Wayback Machine snapshots. With user input to guide the scraping process, the tool provides powerful flexibility for extracting content from a websites archived states.

Features

  • Scrape a website and all available snapshots from the Wayback Machine.
  • Asynchronousity allows entire website snapshots to be scraped quickly
  • Interactive user input to specify scraping criteria (e.g., specific elements and time ranges).
  • Automated handling of snapshot metadata for seamless extraction.
  • Flexible output options: Chose to return data to use in your own projects or generate a csv
  • Error handling for unavailable pages or restricted content.

Use Cases

  • Researching website evolution over time.
  • Archiving content for analysis or preservation.
  • Investigating historical changes in web pages.
  • I personally used this to scrape product information from a few brands to track their items over time

About

A webscraper that scrapes historic web data using the Wayback Machine

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages