This project involves scraping data from Booking.com for hotels in Ireland, specifically focusing on prices for one night on a weekend. The raw data was collected using the "Instant Data Scraper" tool and subsequently cleaned and analyzed using Python and SQL. Finally, Power BI was used to create a dashboard for visual insights.
The initial dataset consisted of 502 rows and 16 columns, with each row representing a hotel and the columns as follows:
- hotel name
- city
- certification
- score
- review rate
- reviews
- room type
- occupancy
- Free cancellation
- pay at the property
- No prepayment needed
- rooms left
- nights/adults
- price
- location rate
- bb included
The data cleaning process was executed using Pandas in Python with the following steps:
- Checking for duplicate rows
- Checking for null rows or rows with insufficient information
- Dropping unnecessary columns
- Converting 'reviews' into numeric values
- Cleaning the 'location rate' feature
- Removing the € sign from 'price' and converting to float
- Cleaning categorical features and filling null values
- Cleaning 'rooms left' and converting to numeric values
- Saving the cleaned file
import pandas as pd
# Load data
data = pd.read_csv('booking.csv', encoding = 'UTF-8')
Using SQL (SSMS), various insights were drawn from the cleaned data, leveraging Common Table Expressions (CTEs) for complex queries. Here are the key questions explored:
- Top Rated Hotels: Which hotels have the highest scores?
- Average Score by City: Identify cities with the highest average hotel scores.
- Top 5 Hotels by Score and Review Count: List the top 5 hotels based on a combination of score and review count.
- Hotels with Free Cancellation and High Scores: Find high-scoring hotels that offer free cancellation.
- Price Range Analysis: Analyze the price range for hotels in each city.
- Room Availability by City: Determine cities with the highest average number of rooms left.
- Review Rate Distribution: Show the distribution of hotels across different review rates.
- Impact of Review Count on Score: Determine if there's a correlation between the number of reviews and the average score.
A Power BI dashboard was created to visually represent the insights derived from the analysis we have done earlier using SQL.
The cleaned data was imported into Power BI, and the following steps were followed to create the visualizations:
- Import Data: Load the
booking_cleaned.csv
into Power BI. - Create Relationships: Ensure data relationships are correctly set up.
- Design Visuals: Use various Power BI visuals like bar charts, maps, histograms, and scatter plots.
- Filter and Slicers: Add filters and slicers for interactivity.
- Publish: Publish the dashboard to Power BI service for sharing and collaboration.
This project demonstrates a comprehensive workflow from data collection and cleaning to analysis and visualization. The insights drawn from the data provide valuable information on hotel ratings, prices, and availability in Ireland.
- data/: Contains the raw and cleaned datasets.
- python/: Python scripts for data cleaning.
- sql/: SQL scripts for data analysis.
- powerbi/: Power BI project file and screenshots.
- README.md: Project documentation.
To replicate this project:
- Clone the repository
- Run the Python cleaning script
- Load the cleaned data into SQL for analysis
- Import the cleaned data into Power BI for visualization
Load the raw data into your IDE (I used Jupyter Notebook) and execute the code in the
python/
Load the cleaned data into your SQL database and execute the queries in the sql/
directory.
Open the Power BI project file in Power BI Desktop, connect to the cleaned data, and explore the visualizations.
This README provides a structured overview of the project, detailing each step from data collection to final visualization. Feel free to contribute or reach out with any questions!