In this project, I developed a robust system to scrape data from an ecommerce website Thredup, process it for quality, and integrate it into a mobile application for efficient searching. It utilizes various technologies such as Python for web scraping, Pandas for data preprocessing, Algolia for indexing and search capabilities, Google Cloud Firestore for data storage, and Google Cloud Run with Docker for deployment.
Part 1: Web Scraping Architecture
Part 2: Data Processing and Validation Workflow
Part 3: Data Pipeline Workflow
The project repository contains the following directories and files:
data_processing/
: Contains scripts related to data processing and cleaning.handle_database/
: Includes code for handling the product database and storage.output/
: Stores output files generated during the data processing pipeline.Dockerfile
: Defines the instructions to build a Docker image for this project.Initial_Products_Scraper.ipynb
: Jupyter Notebook file containing the initial product scraping code.run_image.py
: Script to run the Docker image on GCP.scraping_list_product_modules.py
: Contains modules for scraping product listings.
To run this project locally, follow these steps:
- Clone the repository:
git clone https://github.com/faisal-fida/Ecommerce-ETL-Pipeline
- Install the required Python dependencies using Pipenv:
pipenv install
-
Set up any necessary configurations and environment variables.
-
Run the main script:
python main.py