Programming lesson
Building a Data Warehouse for M-Stay: A Step-by-Step Tutorial
Learn how to design and implement a star schema data warehouse for the M-Stay vacation rental platform. This tutorial covers data cleaning, dimension modeling, fact table design, and SQL implementation.
Introduction to Data Warehousing for M-Stay
In today's data-driven world, businesses like M-Stay rely on Business Intelligence (BI) to make informed decisions. As the platform grows, moving from an operational database to a data warehouse becomes essential for efficient analytics. This tutorial guides you through designing and implementing a star schema for M-Stay, focusing on key metrics like number of reviews, listings, and average booking cost.
Exploring and Cleaning the Operational Data
Before building the warehouse, you must explore the source database. Use SQL queries to understand data quality. For example:
-- Check for NULLs in key columns
SELECT COUNT(*) FROM MStay.REVIEW WHERE REVIEW_ID IS NULL;
-- Find duplicate bookings
SELECT BOOKING_ID, COUNT(*) FROM MStay.BOOKING GROUP BY BOOKING_ID HAVING COUNT(*) > 1;Common cleaning strategies include removing duplicates, handling missing values (e.g., setting default values or deleting rows), and standardizing formats. For instance, if Listing_Min_Nights is negative, you might set it to 1.
Designing the Star Schema
The star schema centers on a fact table linked to dimension tables. For M-Stay, we create a FactBooking table with measures like Booking_Cost, Booking_Duration, and counts. Dimensions include:
- DimListing: Listing_Title, Listing_Price, Listing_Min_Nights, Listing_Max_Nights, Prop_ID, Type_ID, Host_ID
- DimGuest: Guest_ID, Guest_Name
- DimHost: Host_ID, Host_Name, Host_Location
- DimProperty: Prop_ID, Prop_Description, Prop_Num_Beds, etc.
- DimDate: Date, Month, Year, Season
- DimChannel: Channel_ID, Channel_Name
- DimListingType: Type_ID, Type_Description
Derived attributes like listing season (Spring: 9-11, Summer: 12-2, Autumn: 3-5, Winter: 6-8) and booking duration category (short-term: <30, medium: 30-90, long-term: >90 nights) are added to dimensions.
Answering Business Questions
With the star schema, you can answer queries like:
- How many long-term stay listings are on Facebook? (Join DimListing, DimChannel, and FactBooking)
- How many summer listings for 'Entire home/apt' in medium price range? (Use DimDate, DimListingType)
- How many bookings for 'Private rooms' with short-term stay in 2015? (Filter by DimDate, DimListingType, and duration)
Implementing the Schema in SQL
Create dimension tables first, then the fact table. Example for DimDate:
CREATE TABLE DimDate (
DateKey DATE PRIMARY KEY,
Month NUMBER,
Year NUMBER,
Season VARCHAR2(10)
);Populate it with calendar data. Then create fact table:
CREATE TABLE FactBooking (
Booking_ID NUMBER PRIMARY KEY,
Listing_ID NUMBER REFERENCES DimListing(Listing_ID),
Guest_ID NUMBER REFERENCES DimGuest(Guest_ID),
Booking_Date DATE REFERENCES DimDate(DateKey),
Booking_Cost NUMBER,
Booking_Duration NUMBER,
Num_Guests NUMBER
);After creation, insert transformed data from the operational database.
Increasing Granularity (Design Task B)
To drill down, you can add more detailed dimensions or measures. Suggestions include:
- Add a DimTime table with hour/minute to analyze booking times.
- Include DimAmenity to see which amenities affect bookings.
- Add DimReview with individual review ratings.
- Break down booking cost into base cost and extra fees.
These additions allow deeper analysis, such as peak booking hours or the impact of specific amenities on ratings.
Conclusion
Building a data warehouse for M-Stay transforms raw transactional data into a powerful BI tool. By following this tutorial, you've learned to clean data, design a star schema, implement it in SQL, and suggest granularity improvements. This skillset is highly relevant in industries like e-commerce, finance, and even gaming, where player behavior analysis uses similar star schemas. Start applying these concepts to your own projects!