Data Analysis Tutorial: Randomization Tests, Text Mining, Network Centrality

Introduction to Statistical Testing and Randomization

In today's data-driven world, understanding how to test hypotheses is crucial. Whether you're analyzing Facebook page reach by age and gender or evaluating the effectiveness of a new AI app feature, the principles remain the same. In this tutorial, we'll explore three key areas: randomization tests for hypothesis testing, text mining for tweet analysis, and network centrality for YouTube clip graphs. These techniques are widely used in fields like marketing analytics, social media monitoring, and recommendation systems.

Hypothesis Testing with Randomization

When comparing groups, such as reach counts by age group and gender, we often want to know if observed differences are statistically significant. The null hypothesis (H₀) states that there is no difference between groups, while the alternative hypothesis (H₁) states that there is a difference. For example, H₀: The mean reach is equal across all age-gender groups; H₁: At least one group mean differs.

The appropriate test statistic depends on the data structure. For comparing multiple groups, an F-statistic from ANOVA is commonly used. Its equation is: F = (between-group variability) / (within-group variability). The randomization process involves shuffling group labels many times (e.g., 10,000 permutations) and recalculating the F-statistic each time to build a null distribution. If the observed F-statistic (e.g., 3.1) falls in the extreme tail of this distribution, we reject H₀. This approach is especially popular in modern data science bootcamps and online courses.

Text Mining and Tweet Analysis

With the rise of AI-generated content and viral tweets, text preprocessing is essential. For the alien language tweets: "do da da da do", "di di di do do", and "da da da da da da", we must decide whether to apply stop word removal and stemming. Since these tweets consist of short, repetitive words, stop word removal (which typically removes common words like 'the', 'is') is not helpful because all words are content words. Stemming (reducing words to root forms) would also be ineffective because the words are already short. Thus, we skip both steps.

Next, we construct a document-term frequency matrix. Unique terms: do, da, di. The matrix (rows=tweets, columns=terms) is:

      do  da  di
Tweet1: 2   3   0
Tweet2: 2   0   3
Tweet3: 0   6   0

To compute cosine similarity to the query "da di", we first create a query vector: (da=1, di=1, do=0). Then compute cosine similarity for each tweet: cos(Tweet1, query) = (3*1 + 0*1 + 2*0) / (√(3²+0²+2²) * √(1²+1²+0²)) = 3 / (√13 * √2) ≈ 0.588. Similarly, Tweet2: 0.588, Tweet3: 0. So Tweet1 and Tweet2 are equally similar to the query. This technique is used in search engines and recommendation algorithms, similar to how TikTok suggests videos based on hashtag similarity.

Network Centrality and Graph Analysis

Social networks like YouTube clip relationships can be analyzed using graph theory. Given a graph, we construct an adjacency matrix where rows and columns represent vertices, and entries indicate edges (1 if connected, 0 otherwise). The graph diameter is the longest shortest path between any two vertices. Betweenness centrality measures how often a vertex lies on shortest paths between other vertices; vertices with high betweenness act as bridges. The density is the ratio of actual edges to possible edges: density = 2E / (V(V-1)) for undirected graphs. These metrics help identify key influencers in a network, similar to how Twitter identifies trending topics or how gaming communities find central players in multiplayer matches.

By mastering these techniques, you'll be equipped to handle real-world data challenges, from analyzing Facebook ad performance to uncovering patterns in viral content.