Master LSH & Collaborative Filtering with Spark RDD

Introduction: Why LSH and Recommender Systems Matter in 2026

In today's data-driven world, from TikTok's viral video suggestions to Spotify's Discover Weekly, recommendation systems are everywhere. As a data mining student, mastering Locality Sensitive Hashing (LSH) and collaborative filtering is essential. This tutorial walks you through the core concepts of Assignment 3 for DSCI553, focusing on Python and Spark RDD implementations. We'll use timely examples—like how a streaming service might find similar movies or how a gaming platform recommends new titles—to make the theory stick.

Understanding the Dataset: Yelp Reviews

The dataset is a subset of Yelp reviews, containing user_id, business_id, and stars. For Task 1, we binarize ratings: 1 if the user rated the business, 0 otherwise. For Task 2, we use actual star ratings to predict missing ratings. The training set yelp_train.csv is used to build models, while yelp_val.csv helps tune hyperparameters. In 2026, similar datasets power AI-driven restaurant recommendations in apps like Uber Eats.

Task 1: Jaccard Based LSH (2 Points)

Goal: Find Similar Businesses

Your mission is to identify pairs of businesses with Jaccard similarity >= 0.5. Jaccard similarity measures the overlap of users who reviewed both businesses divided by the total unique users who reviewed either. For example, if business A and B share 3 users and have 5 unique users total, similarity = 3/5 = 0.6.

Step 1: Build the Characteristic Matrix

Represent each business as a set of users who rated it. In Spark RDD, you can create a row for each user-business pair with value 1, then group by business to get the set of users. This matrix is sparse—most entries are 0.

Step 2: MinHash Signatures

Instead of computing Jaccard similarity for all pairs (O(n^2)), we use MinHash to compress sets into fixed-size signatures. Choose n hash functions of the form h(x) = (a*x + b) % m, where m is a large prime (e.g., 10007) and a, b are random integers. For each business, compute the minimum hash value over its set of users for each hash function. The result is a signature vector of length n.

Example: Suppose we have 3 hash functions. For a business with users {user1, user3}, the signature might be [2, 5, 1]. The probability that two businesses have the same value in one hash function equals their Jaccard similarity.

Step 3: Banding Technique

Divide the n signature rows into b bands of r rows each (b * r = n). Two businesses are a candidate pair if their signatures match in all rows of at least one band. This amplifies similarity: if true similarity is high, they likely match in a band; if low, they likely don't. Choose b and r to balance false positives and false negatives. For similarity threshold 0.5, a common choice is b = 20, r = 5 (n=100).

Step 4: Verify with Exact Jaccard

For each candidate pair, compute the exact Jaccard similarity from the original data and keep those >= 0.5. Output a CSV with columns business_id_1, business_id_2, similarity, sorted alphabetically per pair and overall.

Performance Tips

Use Spark RDD operations like map, reduceByKey, and groupByKey. Avoid DataFrames. To meet the 100-second runtime, use efficient hash functions and broadcast the user-to-index mapping. In 2026, similar LSH techniques help streaming services like Netflix find similar shows in milliseconds.

Task 2: Recommendation System (5 Points)

Goal: Predict Star Ratings

You'll build an item-based collaborative filtering system using Pearson correlation. The idea: find businesses similar to the target business based on common users, then predict the rating as a weighted average of the target user's ratings on similar businesses.

Step 1: Compute Item Similarity

For each pair of businesses that share at least one user, compute Pearson correlation using only users who rated both. The formula: sim(i, j) = (sum((r_ui - mean_i)*(r_uj - mean_j))) / (sqrt(sum((r_ui - mean_i)^2)) * sqrt(sum((r_uj - mean_j)^2))). Use Spark RDD to compute this efficiently.

Step 2: Predict Ratings

For a user u and business i, find the k most similar businesses that u has rated. The prediction is: pred = mean_i + (sum(sim(i, j) * (r_uj - mean_j)) / sum(|sim(i, j)|)). Choose k (e.g., 10) based on validation set performance.

Step 3: Handle Cold Start

If no similar businesses exist, predict the global average or the business's average. For new users, predict the business's average. Use the validation set to tune k and possibly add regularization.

Improving Accuracy

In 2026, state-of-the-art recommender systems use neural networks, but for this assignment, you can enhance Pearson by adding a shrinkage factor: sim_shrunk = (n_common / (n_common + lambda)) * sim. This reduces noise from few co-ratings. Also, consider using the Jaccard similarity from Task 1 as a fallback.

Connecting to Current Trends

In 2026, AI-powered recommendation is hotter than ever. Think of how Spotify uses collaborative filtering to suggest new songs, or how YouTube recommends videos based on watch history. Even AI chatbots like ChatGPT use similar techniques to personalize responses. By mastering LSH and collaborative filtering, you're building skills directly applicable to real-world data mining jobs in tech, finance, and e-commerce.

Common Pitfalls and How to Avoid Them

Using DataFrames instead of RDDs: The assignment strictly requires RDDs. Stick to map, flatMap, reduceByKey, etc.
Inefficient hash functions: Use a prime modulus and random coefficients. Precompute hash values for each user index.
Ignoring sorting: Output must be sorted alphabetically. Use sortBy on the RDD before saving.
Not validating parameters: Test different b and r combinations to meet the precision/recall targets.

Example Code Snippet: MinHash Signature Generation

# Python-like pseudocode for MinHash using Spark RDD
# Assume business_user_rdd: (business_id, [user_ids])
# hash_functions: list of (a, b) pairs
n_hashes = 100
signature_rdd = business_user_rdd.mapValues(lambda users: [min((a*u + b) % prime for u in users) for a, b in hash_functions])

Conclusion

By completing this assignment, you'll gain hands-on experience with two fundamental data mining techniques: LSH for approximate similarity search and collaborative filtering for recommendation. These skills are highly valued in 2026's job market, where companies like Amazon, Google, and Meta rely on similar algorithms to drive user engagement. Start early, test often, and remember: the key to success is understanding the math behind the code.