Programming lesson
Mastering Login Uniqueness: A Tutorial on Bloom Filters, Cuckoo Filters, and Hash Tables for COSC 520
Learn how to solve the login checker problem using hash tables, Bloom filters, and Cuckoo filters. This tutorial covers complexity analysis, Python implementation, and performance comparisons for COSC 520 assignment 1.
Introduction to the Login Checker Problem
In today's digital world, ensuring unique usernames is critical for any online platform. Imagine a social media app like Threads or a gaming platform like Fortnite processing millions of new sign-ups daily. The login checker problem requires quickly verifying that a new login name hasn't been taken. With billions of existing usernames, efficiency matters. This tutorial explores five data structures: linear search, binary search, hashing, Bloom filters, and Cuckoo filters, focusing on the latter two as advanced probabilistic solutions.
Understanding the Core Data Structures
Linear Search
The simplest approach: store all logins in an unsorted list and scan sequentially. Time complexity is O(n) per lookup, where n is the number of existing logins. For n=1 billion, this is impractical—imagine checking every username one by one, like searching for a specific tweet in a decade of Twitter history. Space complexity is O(n) for storing the list.
Binary Search
If logins are stored in a sorted array, binary search reduces time to O(log n). That's efficient—like finding a word in a dictionary. However, inserting new logins requires O(n) time to maintain sorted order, making it unsuitable for high-write workloads. Space remains O(n).
Hash Tables
Hash tables offer average O(1) lookup and insertion. They map logins to array indices using a hash function. Collisions are handled via chaining or open addressing. For 1 billion entries, memory becomes a concern: storing keys and pointers adds overhead. But for moderate n, it's a solid choice. Python's set uses hash tables internally.
Probabilistic Data Structures: Bloom and Cuckoo Filters
When memory is tight and false positives are acceptable, probabilistic filters shine. They are used in databases (e.g., Cassandra), network routers, and even blockchain light clients.
Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure that tests membership. It uses a bit array of m bits and k hash functions. To add an element, hash it k times and set the corresponding bits to 1. To check membership, hash and verify all bits are 1. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element might be present (false positive possible). No false negatives. The false positive rate p is approximately (1 - e^{-kn/m})^k. For n=1 billion and p=1%, m ≈ 9.6 billion bits (~1.2 GB) and k≈7. This is far less than storing 1 billion strings.
Cuckoo Filter
A Cuckoo filter improves upon Bloom filters by supporting deletions and having lower false positive rates for the same memory. It uses a hash table with buckets containing fingerprints (short hash values). When inserting, if the bucket is full, it kicks out an existing fingerprint to another location (cuckoo hashing). This ensures high space utilization. For n=1 billion, a Cuckoo filter with fingerprint size f=8 bits and bucket size b=4 achieves ~95% load factor and false positive rate ~2^{-f} ≈ 0.4%. Memory: about 1.25 GB. Cuckoo filters are used in Google's Bigtable and other large-scale systems.
Complexity Analysis
Below is a summary table of time and space complexities for the login checker problem.
| Data Structure | Lookup Time | Insert Time | Space | False Positives |
|---|---|---|---|---|
| Linear Search | O(n) | O(1) | O(n) | No |
| Binary Search (sorted array) | O(log n) | O(n) | O(n) | No |
| Hash Table | O(1) avg | O(1) avg | O(n) | No |
| Bloom Filter | O(k) | O(k) | O(m) | Yes |
| Cuckoo Filter | O(1) avg | O(1) avg | O(n * fingerprint) | Yes |
For Bloom filter, m is the bit array size, k is number of hash functions. For Cuckoo filter, fingerprint size f and bucket size b affect space and false positive rate. In practice, Cuckoo filters often outperform Bloom filters when deletions are needed.
Python Implementation Guide
We'll implement a login checker using Python's built-in set for hashing, a custom Bloom filter, and a Cuckoo filter. Use the pycuckoo library or implement from scratch. Below is a simplified Bloom filter class.
import hashlib
import bitarray
class BloomFilter:
def __init__(self, size, hash_count):
self.size = size
self.hash_count = hash_count
self.bit_array = bitarray.bitarray(size)
self.bit_array.setall(0)
def _hashes(self, item):
result = []
for i in range(self.hash_count):
digest = hashlib.sha256((str(i) + item).encode()).hexdigest()
result.append(int(digest, 16) % self.size)
return result
def add(self, item):
for h in self._hashes(item):
self.bit_array[h] = 1
def check(self, item):
for h in self._hashes(item):
if self.bit_array[h] == 0:
return False
return True
For Cuckoo filter, we can use the cuckoo-filter package: pip install cuckoo-filter. Then:
from cuckoo import CuckooFilter
cf = CuckooFilter(capacity=1000000, fingerprint_size=8, bucket_size=4)
cf.insert('user123')
print(cf.contains('user123')) # True
print(cf.contains('unknown')) # False (mostly)
Performance Comparison
We'll benchmark with n=10 million logins (due to memory constraints on a typical laptop). Generate random strings as usernames. Measure lookups for 100,000 random checks. Plot the results.
Expected outcomes: Linear search is extremely slow (seconds per lookup). Binary search is faster but insertions are costly. Hash table (set) is fast (~0.1 µs per lookup). Bloom filter is fast (~0.2 µs) with false positives. Cuckoo filter is also fast (~0.15 µs) and supports deletions. The plot will show that probabilistic filters trade accuracy for speed and memory.
Connecting to Trends: AI and Gaming
Just as AI models like ChatGPT use Bloom filters to quickly check if a token is in a vocabulary, login checkers must handle billions of entries. In gaming, player usernames must be unique across millions of accounts; Cuckoo filters help maintain low latency. Even in finance, transaction IDs are checked for duplicates using similar techniques.
Conclusion
This tutorial covered the login checker problem from simple linear search to advanced Cuckoo filters. For your COSC 520 assignment, implement the Python code, generate plots, and include complexity analysis in your ACM-format report. Remember to upload your dataset and code to GitHub. Good luck!