Programming lesson
Mastering Sensitivity and Laplace Mechanism for Differential Privacy in CSCI 4260/6260
Learn how to compute query sensitivity under different neighboring dataset definitions and apply the Laplace mechanism for differential privacy, with step-by-step examples and real-world analogies from AI and data science.
Understanding Sensitivity in Differential Privacy
Differential privacy has become a cornerstone of modern data protection, especially in AI and machine learning applications where sensitive data is analyzed. In this tutorial, we focus on the core concepts from CSCI 4260/6260 Assignment 3: computing sensitivity of query functions and using the Laplace mechanism to achieve privacy. We'll break down each problem with clear reasoning, using timely examples from data science and privacy-preserving AI to make the ideas stick.
What is Sensitivity?
Sensitivity measures the maximum change in a query's output when one individual's data is added, removed, or replaced. It's the key to calibrating noise for differential privacy. We consider two definitions of neighboring datasets: proper subset (addition/removal) and replacement (changing one element). Let's compute sensitivity for three query functions.
Problem 1: Count of Elements Greater than p
Let dataset D have n elements from U = {1,...,100}. Query q(D) = |{x : x > p, x ∈ D}|. Under the proper subset definition, neighboring datasets differ by one element (add or remove). The maximum change occurs when the added/removed element is exactly on the threshold p. For example, if p=50, adding an element 51 increases count by 1; removing it decreases by 1. So sensitivity = 1. Provide the pair: D = {51}, D' = {} (or D = {51}, D' = {51,52}? Actually, to maximize change, consider D with no elements > p, and D' with one element > p. The sensitivity is 1.
Under the replacement definition, neighboring datasets differ by replacing one element. The maximum change occurs when you replace an element ≤ p with one > p, or vice versa. The count can change by at most 1. For example, D = {50}, D' = {51}. Sensitivity = 1. So for count queries, both definitions yield sensitivity 1.
Problem 1 (2): Sum of Elements
Query q(D) = Σ xi. Under proper subset, adding an element can increase sum by up to 100 (max element), and removing can decrease by up to 100. Sensitivity = 100. Example: D = {100}, D' = {} (or D = {}, D' = {100}). Under replacement, replacing an element can change sum by at most 99 (from 1 to 100). Sensitivity = 99. Example: D = {1}, D' = {100}. This shows definition choice matters for non-count queries.
Problem 1 (3): Median with Odd n
Query q(D) = median of sorted elements. Using replacement definition and odd n. The median is the middle element. Replacing one element can shift the median by at most the range of values. For U = [1,100], sensitivity = 100? Actually, consider n=3, D = {1,50,100}, median=50. Replace 50 with 1 gives median=1, change=49. But worst-case: D = {1,1,100}, median=1. Replace 1 with 100 gives D'={1,100,100}, median=100, change=99. So sensitivity = 99. Example pair: D = {1,1,100}, D' = {1,100,100}. Sensitivity = 99.
Laplace Mechanism and Variance Analysis
The Laplace mechanism adds noise from Lap(0, λ) to achieve ϵ-differential privacy. For a query with sensitivity Δ, set λ = Δ/ϵ. The mechanism is unbiased: E[M(D)] = q(D). Variance is 2λ².
Problem 2: Splitting Privacy Budget
Suppose we split total budget ϵ into two parts. For equal split [ϵ/2, ϵ/2], each query uses λ = 2/ϵ. Then R1 = q(D) + Lap(0,2/ϵ), R2 = q(D) + Lap(0,2/ϵ). Variance of sum Var(R1+R2) = Var(R1)+Var(R2) = 2*(2*(2/ϵ)²) = 16/ϵ²? Wait: Var(Lap(0,λ)) = 2λ². So Var(R1)=2*(2/ϵ)² = 8/ϵ², same for R2, sum = 16/ϵ².
For split [ϵ/3, 2ϵ/3], R1 uses λ=3/ϵ, R2 uses λ=3/(2ϵ). Var(R1)=2*(3/ϵ)²=18/ϵ², Var(R2)=2*(3/(2ϵ))²=2*(9/(4ϵ²))=9/(2ϵ²)=4.5/ϵ². The second split gives lower total variance (22.5/ϵ² vs 16/ϵ²? Actually 18+4.5=22.5 > 16, so equal split is better for this example. However, the question asks which answer is more useful: usually lower variance is better. So equal split yields lower variance for the sum? Wait, we need to compute variance of R1 and R2 individually? Problem asks: compute variance of R1 and R2 and state which answer is more useful. Possibly they want to compare the two mechanisms for a single query? Actually reading: split budget and obtain R1 and R2, then compute variance of R1+R2. Then compare to the split [ϵ/3,2ϵ/3]. The one with lower variance is more useful. Equal split gives 16/ϵ², unequal gives 22.5/ϵ², so equal split is better. But note: with unequal split, the second query has lower variance (4.5/ϵ²) than the first (18/ϵ²), so if you only care about one of them, you might prefer the second. The problem likely expects you to compute both variances and say the one with smaller variance is more useful.
Private Synthetic Data via Histograms
In modern data science, generating synthetic datasets that preserve privacy is crucial. This part of the assignment builds a 2D histogram and releases it under differential privacy.
Building a 2D Histogram
Given observations (x1, x2) in [0,1)², split each dimension into k intervals, creating k² bins. Count observations in each bin. For example, with k=2, we get 4 bins. The function construct_histogram returns a 1D array of counts in order: left-to-right, top-to-bottom.
K-Anonymity via Histograms
To achieve k-anonymity, we can suppress bins with counts less than k, or generalize bins. For instance, merge small bins with neighbors. A dataset satisfies k-anonymity if each record is indistinguishable from at least k-1 others. Using the histogram, we can generate synthetic points uniformly within each bin that has count ≥ k. For k=5, we would only use bins with count ≥5. The released dataset would list each bin's range and a representative count (e.g., the original count or a synthetic count).
Sensitivity of Histogram Release
Under proper subset, adding/removing one observation changes one bin count by 1, so sensitivity = 1. Under replacement, changing one observation can move it from one bin to another, changing two bins by ±1 each, so sensitivity = 2. This is important for calibrating noise.
Releasing Histogram with Laplace Mechanism
To release a private histogram, add independent Laplace noise to each bin count. For proper subset definition, use λ = 1/ϵ. For replacement, λ = 2/ϵ. The noisy counts can then be used to generate synthetic data, e.g., by sampling uniformly from each bin proportional to its noisy count (after post-processing like clamping negatives to 0).
Real-World Relevance and Trends
Differential privacy is used by Apple, Google, and the US Census Bureau to protect user data. In 2026, with the rise of generative AI and privacy regulations, understanding these concepts is more valuable than ever. This tutorial helps you build a solid foundation for privacy-preserving data analysis, a key skill in data science and AI ethics.
Key Takeaways
- Sensitivity depends on the query and the neighboring dataset definition.
- The Laplace mechanism adds noise proportional to sensitivity divided by privacy budget.
- Splitting the privacy budget affects variance; careful allocation improves utility.
- Histograms are a natural tool for generating private synthetic data.
By mastering these problems, you'll be well-prepared for advanced topics in differential privacy and data anonymization.