Set Similarity Join with Jaccard on Apache Spark: Tutorial for COMP9313

Introduction to Set Similarity Join

In the era of big data, finding similar records across large datasets is a fundamental operation. Whether you're building a recommendation engine for a streaming service like Netflix, deduplicating user profiles in a social network, or detecting near-duplicate documents in a search index, the set similarity join is a powerful tool. This tutorial will guide you through implementing a self-join using Jaccard similarity on Apache Spark, deployed on Google Dataproc. By the end, you'll understand the algorithm, write PySpark code, and optimize it for real-world datasets.

Understanding Jaccard Similarity

Jaccard similarity measures the overlap between two sets. For sets A and B, it is defined as:

J(A, B) = |A ∩ B| / |A ∪ B|

For example, if A = {1,4,5,6} and B = {1,4,6}, then |A ∩ B| = 3 and |A ∪ B| = 4, so J = 0.75. In a set similarity join, we want all pairs (r, s) where Jaccard similarity ≥ a threshold τ. A practical threshold might be 0.5, which means at least half of the elements must be shared.

Why Spark and Dataproc?

Apache Spark excels at distributed data processing, making it ideal for large-scale similarity joins. Google Dataproc provides a managed Spark environment, so you can focus on code rather than cluster setup. This combination is popular in industry for tasks like AI-driven personalization and fraud detection.

The Dataset: Self-Join on a Single File

In this project, you perform a self-join: the same input file serves as both R and S. Each line contains a record ID followed by a list of integers (the set elements). For instance:

This small dataset is perfect for testing. A larger example like flickr_small.txt (from Flickr) contains real-world tags, simulating a tag-based recommendation system.

Algorithm Overview

Naively comparing all pairs is O(n^2), which is infeasible for millions of records. We need an efficient approach. A common technique is prefix filtering with inverted indices. The idea:

Sort each set's elements in a fixed global order (e.g., by frequency).
For each set, take the first k elements (prefix), where k = |set| - ceil(τ * |set|) + 1. This ensures that if two sets are similar, they must share at least one prefix element.
Build an inverted index mapping each element to the records that contain it in their prefix.
For each record, look up its prefix elements in the index to find candidate pairs, then verify Jaccard similarity.

This drastically reduces the number of candidate pairs.

Step-by-Step Implementation in PySpark

Let's implement the self-join on Dataproc. Assume you have a cluster with Spark 3.x and Python 3.8.

Step 1: Load and Parse Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, size, sort_array, lit, explode, collect_list

spark = SparkSession.builder.appName("SetSimilarityJoin").getOrCreate()

# Load input file (adjust path)
raw = spark.read.text("gs://your-bucket/tiny-data.txt")

# Parse: split line into recordId and elements
parsed = raw.withColumn("recordId", split(col("value"), "\\s+")[0].cast("int")) \
            .withColumn("elements", split(col("value"), "\\s+")[1:])
# Convert elements to array of integers
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import udf
cast_int = udf(lambda arr: [int(x) for x in arr], ArrayType(IntegerType()))
df = parsed.withColumn("elements", cast_int(col("elements")))
df.show()

Step 2: Sort Elements and Compute Prefix Length

We need a global ordering. A simple approach: sort elements within each set, then sort by frequency across the dataset. For simplicity, we'll sort each set internally.

# Sort elements within each set
df_sorted = df.withColumn("sorted_elements", sort_array(col("elements")))

# Compute set size
from pyspark.sql.functions import size as spark_size
df_sorted = df_sorted.withColumn("set_size", spark_size(col("sorted_elements")))

# Compute prefix length: k = set_size - ceil(threshold * set_size) + 1
threshold = 0.5  # τ
from pyspark.sql.functions import ceil, when
df_sorted = df_sorted.withColumn("prefix_len", 
    when(col("set_size") > 0, 
         col("set_size") - ceil(lit(threshold) * col("set_size")) + 1
    ).otherwise(0).cast("int"))
df_sorted.select("recordId", "sorted_elements", "set_size", "prefix_len").show()

Step 3: Extract Prefix and Build Inverted Index

# Extract prefix (first prefix_len elements)
from pyspark.sql.functions import slice
df_prefix = df_sorted.withColumn("prefix", slice(col("sorted_elements"), 1, col("prefix_len")))

# Explode prefix to get (element, recordId) pairs
prefix_exploded = df_prefix.select(
    col("recordId"),
    col("sorted_elements"),
    col("set_size"),
    explode(col("prefix")).alias("element")
)

# Build inverted index: for each element, collect list of recordIds
inverted = prefix_exploded.groupBy("element").agg(collect_list("recordId").alias("recordIds"))

Step 4: Generate Candidate Pairs

For each record, we look up its prefix elements in the inverted index to get candidate records. We then pair them, ensuring we don't double-count (recordId1 < recordId2).

# For each record, get candidates from inverted index
candidates = prefix_exploded.alias("a").join(
    inverted.alias("b"),
    col("a.element") == col("b.element")
).select(
    col("a.recordId").alias("r1"),
    col("a.sorted_elements").alias("set1"),
    col("a.set_size").alias("size1"),
    explode(col("b.recordIds")).alias("r2")
).filter(col("r1") < col("r2"))  # avoid self and duplicate pairs

# Remove duplicate candidate pairs (same (r1,r2) may appear multiple times)
candidates_distinct = candidates.select("r1", "r2", "set1", "size1").distinct()

Step 5: Verify Jaccard Similarity

For each candidate pair, we need the set of the second record. We'll join with the original dataframe to get set2.

# Get set2 for each r2
candidates_with_sets = candidates_distinct.alias("c").join(
    df_sorted.select(col("recordId").alias("r2b"), col("sorted_elements").alias("set2"), col("set_size").alias("size2")),
    col("c.r2") == col("r2b")
).select(
    col("c.r1"),
    col("c.r2"),
    col("c.set1"),
    col("c.size1"),
    col("set2"),
    col("size2")
)

# Compute intersection size via array_intersect
from pyspark.sql.functions import array_intersect, size as spark_size2
candidates_with_sets = candidates_with_sets.withColumn(
    "intersection", spark_size2(array_intersect(col("set1"), col("set2")))
).withColumn(
    "union", col("size1") + col("size2") - col("intersection")
).withColumn(
    "jaccard", col("intersection") / col("union")
)

# Filter by threshold
results = candidates_with_sets.filter(col("jaccard") >= threshold)

# Format output: (r1,r2) \t similarity
output = results.select(
    concat(lit("("), col("r1"), lit(","), col("r2"), lit(")"), lit("\t"), col("jaccard")).alias("output_line")
)

# Save to GCS (single file for small data, or partition for large)
output.coalesce(1).write.text("gs://your-bucket/output")

Running on Google Dataproc

To run this on Dataproc, create a cluster with Spark and submit the job. For example:

gcloud dataproc clusters create my-cluster --region us-central1 --single-node

gcloud dataproc jobs submit pyspark gs://your-bucket/script.py \
    --cluster my-cluster --region us-central1

Monitor the job in the GCP console. For large datasets, consider using broadcast variables for the inverted index if it fits in memory, or use a reduce-side join for better scalability.

Optimization Tips

Global ordering by frequency: Sort elements by their frequency across the entire dataset (rare elements first). This reduces prefix lengths and candidate pairs.
Use DataFrames with Catalyst optimizer: Avoid UDFs when possible; use built-in Spark SQL functions for better performance.
Partitioning: Repartition your data by element to avoid shuffles during the inverted index build.
Threshold tuning: Higher thresholds yield shorter prefixes but fewer candidates; lower thresholds require longer prefixes.

Real-World Application: AI Recommendation Systems

Imagine you're building a music recommendation engine like Spotify. Each user's playlist is a set of song IDs. A set similarity self-join can find users with similar taste, enabling collaborative filtering. In 2026, such systems are ubiquitous in AI-powered apps, from TikTok's For You page to Amazon's product suggestions. The Jaccard similarity is a simple yet effective metric for this task.

Conclusion

You've learned how to implement a set similarity self-join using Jaccard similarity on Spark with Dataproc. This technique is essential for many big data applications. Experiment with different thresholds and datasets to see how performance scales. Happy coding!