DSCI553 Assignment 3: LSH & Collaborative Filtering with Spark RDD Guide

Introduction to DSCI553 Assignment 3

In this assignment, you will dive into two core data mining techniques: Locality Sensitive Hashing (LSH) for Jaccard similarity and collaborative-filtering recommendation systems. Using the Yelp dataset, you'll implement these algorithms from scratch with Spark RDDs. This guide provides a conceptual walkthrough and best practices to help you succeed, without giving away the full solution. By the end, you'll have a solid understanding of how to handle large-scale similarity search and rating prediction.

Task 1: Jaccard Based LSH

Understanding the Goal

The first task focuses on finding similar businesses based on which users have rated them (binary: 1 if rated, 0 otherwise). You need to identify all pairs of businesses with Jaccard similarity >= 0.5. The challenge is to do this efficiently using LSH, which avoids computing all pairwise similarities.

Building the Characteristic Matrix

Start by creating a binary matrix where rows represent users and columns represent businesses. Each entry is 1 if the user rated the business, else 0. In Spark, you can represent this as an RDD of (user, business) pairs from the training data, then map to binary indicators.

Designing Hash Functions

For MinHashing, you need a set of hash functions that simulate random permutations of rows. Common choices include: h(x) = (a*x + b) % m or h(x) = ((a*x + b) % p) % m, where p is a prime number and m is the number of rows (users). Choose a and b randomly. Typically, you'll use 100-200 hash functions. Ensure your hash functions produce a consistent permutation by using the same set for all businesses.

Generating the Signature Matrix

For each business, compute its MinHash signature: for each hash function, find the minimum hash value among users who rated that business. This can be done in Spark by grouping by business and then aggregating the minimum hash per function. The result is a signature matrix where each business has a vector of n integers.

Band Partitioning and Candidate Pairs

Divide the signature matrix into b bands, each containing r rows. Choose b and r such that b * r = n. For example, if n=100, you might use b=20, r=5. Two businesses become a candidate pair if their signatures are identical in at least one band. You can implement this by hashing each band's portion of the signature to a bucket; businesses in the same bucket are candidates.

Verifying Similarity

For each candidate pair, compute the actual Jaccard similarity using the original binary data. Filter out those with similarity < 0.5. Finally, output the pairs in alphabetical order with their similarity. To achieve the required precision (>=0.99) and recall (>=0.97), tune b and r. A higher b increases recall but may reduce precision. Experiment to find the sweet spot.

Performance Tips

Your runtime must be under 100 seconds. Use efficient Spark operations: avoid shuffling large datasets unnecessarily. Cache intermediate RDDs if reused. Use broadcast variables for hash function parameters. Consider using combineByKey for aggregations.

Task 2: Collaborative Filtering Recommendation System

Overview

In this task, you'll build a recommendation system to predict star ratings for user-business pairs. You can use item-based or user-based collaborative filtering, or even a hybrid. The training data is yelp_train.csv, and you can tune hyperparameters using the validation set.

Item-Based Collaborative Filtering

Compute similarity between items (businesses) based on user ratings. Common similarity measures include Pearson correlation or cosine similarity. For each user-item pair, find items the user rated that are similar to the target item. Predict the rating as a weighted average of those ratings. To handle cold-start, you can use the business average rating as a fallback.

User-Based Collaborative Filtering

Alternatively, find similar users and predict based on their ratings. This can be more expensive for large user bases. You can optimize by precomputing user similarities or using a neighborhood approach.

Model-Based Approaches

You might also implement a simple baseline model: predict using the global average plus user bias and item bias. This is fast and often surprisingly effective. Then you can add a collaborative filtering component to improve accuracy.

Evaluation and Tuning

Use Root Mean Squared Error (RMSE) on the validation set to measure performance. Experiment with different similarity thresholds, neighborhood sizes, and regularization. Since you cannot use external libraries, implement standard deviation and correlation manually.

Scala Bonus

If you also submit a Scala implementation, you can earn a 10% bonus per task. The logic is the same, but you'll use Scala's syntax and Spark's Scala API. Make sure both versions are correct.

Trend Connections: LSH in Modern AI Apps

Locality Sensitive Hashing is not just an academic exercise; it's used in real-world applications like Google's Nearline for duplicate detection, Spotify's music recommendation, and even in AI-powered image search. Think of it as the algorithm that helps find similar content in massive datasets, much like how TikTok's recommendation engine finds videos you might like based on patterns. In the era of big data and AI, LSH is a fundamental tool for scalable similarity search.

Final Tips for Success

Start early: Both tasks require careful tuning and debugging.
Test on small data: Create a tiny subset of the Yelp data to verify your logic.
Output format matters: The autograder expects exact CSV formatting. Double-check alphabetical ordering.
Monitor runtime: Optimize Spark jobs to stay under the time limit.
Plagiarism warning: Write your own code; similarity detection is strict.

Good luck with your DSCI553 Assignment 3! By mastering LSH and collaborative filtering, you'll gain skills directly applicable to data mining and recommendation systems in industry.