SOCS0055 UKHLS Data Wrangling & Regression Tutorial

Introduction: Why UKHLS Data Skills Matter in 2026

In the age of AI-driven analytics and real-world data science, the ability to wrangle large longitudinal datasets like the UK Household Longitudinal Study (UKHLS) is a superpower. As of May 2026, researchers are increasingly using UKHLS to study trends in smoking, obesity, mental health, and employment—topics that directly inform public policy and healthtech innovations. This tutorial mirrors the core tasks of the SOCS0055 summative assessment, focusing on efficient data wrangling, publication-ready tables, and cross-sectional regression pipelines. Whether you're preparing for your exam or building a portfolio, these techniques will save you time and boost your code's clarity.

Task 1: Building a Long-Format Dataset with Minimal Code

The Goal

Create a tidy dataset with one row per participant-wave combination, including only those who completed Wave 1. Clean variables: smoking status (binary), obesity (binary), SF-12 mental/physical scores, life satisfaction, age, interview date (month-year), sex, degree-level education, and employment status.

Key Strategy: Use Functions to Avoid Repetition

Instead of manually reading and cleaning each wave file, write a function that processes one wave and then apply it across waves using lapply() or map(). This reduces code length and errors. For example:

# Function to read and clean one wave
clean_wave <- function(wave) {
  file <- paste0("data/", wave, "_indresp.dta")
  df <- read_dta(file) %>%
    select(pidp, matches(paste0("^", wave, "_(smever|smnow|bmi_dv|sf12pcs_dv|sf12mcs_dv|age_dv|sex_dv|hiqual_dv|jbhas|jboff)"))) %>%
    # Add life satisfaction and interview date variables (find correct stubs)
    # ...
    mutate(wave = wave)
  return(df)
}

waves <- paste0("a", 1:14)
long_data <- bind_rows(lapply(waves, clean_wave))

Use labelled::lookfor() to find life satisfaction and date variables. For example, life satisfaction might be stored as *_lfsato; interview date as *_intdaty_m or *_istrtdaty. Always check the UKHLS documentation.

Cleaning Steps

Smoking: Combine smever and smnow: current smoker = (smever == 1 & smnow == 1).
Obesity: bmi_dv >= 30.
Degree: hiqual_dv values 1-3 (degree or above).
Employment: jbhas == 1 and jboff != 1 (currently employed).
Age and sex: Already clean; keep as numeric and factor.
SF-12 scores: Already numeric; keep as is.
Life satisfaction: Recode to numeric (e.g., 1-7 scale).
Date: Extract month-year as a date variable (e.g., using zoo::as.yearmon()).

Finally, save with saveRDS(long_data, "ukhls_clean.rds").

Task 2: Creating a Publication-Ready Table 1

Requirements

Load the cleaned dataset and produce a descriptive statistics table by wave. For categorical variables: n (%). For continuous: mean (SD). Use descriptive labels, not R column names. Remove unnecessary variables (e.g., wave identifier).

Using gtsummary for Elegance

The gtsummary package makes this straightforward:

library(gtsummary)
library(dplyr)

data <- readRDS("ukhls_clean.rds")

tbl1 <- data %>%
  select(-pidp, -wave) %>% # remove identifiers
  tbl_summary(
    by = wave,
    statistic = list(all_continuous() ~ "{mean} ({sd})",
                     all_categorical() ~ "{n} ({p}%)"),
    label = list(age ~ "Age (years)",
                 sex ~ "Sex",
                 smoking ~ "Current Smoker",
                 obese ~ "Obese (BMI ≥ 30)",
                 sf12pcs ~ "SF-12 PCS",
                 sf12mcs ~ "SF-12 MCS",
                 lifesat ~ "Life Satisfaction",
                 degree ~ "Degree or Above",
                 employed ~ "Employed")
  ) %>%
  modify_header(label = "**Variable**") %>%
  modify_caption("Table 1: Descriptive Statistics by Wave, UKHLS") %>%
  bold_labels()

tbl1 %>% as_gt() %>% gt::gtsave("partA_task2_table.html")

This code automatically handles missing data and proportions. For a more polished look, you can add themes or footnotes.

Task 3: Automating Cross-Sectional Regressions

The Challenge

Run OLS regressions for every combination of outcome (smoking, SF-12 MCS, SF-12 PCS, obesity, life satisfaction), exposure (employment, degree), wave (1-14), and sex (male, female, both), controlling for age, age-squared, and interview date. Store coefficients, SE, and CI in one tibble.

Solution: Nested Loops or expand_grid

Use expand_grid() to create all combinations, then map() to run models:

library(tidyr)
library(purrr)
library(broom)

combos <- expand_grid(
  outcome = c("smoking", "sf12mcs", "sf12pcs", "obese", "lifesat"),
  exposure = c("employed", "degree"),
  wave = 1:14,
  sex = c("Male", "Female", "Both")
)

run_reg <- function(outcome, exposure, wave, sex) {
  df <- data %>% filter(wave == !!wave)
  if (sex != "Both") df <- df %>% filter(sex == !!sex)
  
  form <- as.formula(paste(outcome, "~", exposure, "+ age + I(age^2) + interview_date"))
  mod <- lm(form, data = df)
  
  tidy(mod, conf.int = TRUE) %>%
    filter(term == exposure) %>%
    mutate(outcome = outcome, exposure = exposure, wave = wave, sex = sex)
}

results <- combos %>%
  mutate(model = pmap(list(outcome, exposure, wave, sex), run_reg)) %>%
  unnest(model)

This returns a tidy tibble with all estimates. Remember to handle cases where variables might be missing in some waves (e.g., life satisfaction not collected in wave 1). Use tryCatch() or filter out waves where the outcome is not available.

Part B: Advanced Analysis (Task 4 Overview)

While not detailed here, Part B typically involves more complex modeling (e.g., multilevel models, panel data methods) or causal inference. The same principles apply: write clean, modular code, use functions to avoid repetition, and document your steps. Consider using plm or lme4 for panel regressions.

Best Practices for SOCS0055

Comment your code: Use # to explain each block.
Use relative paths: Assume data is in a data/ folder.
Keep Quarto self-contained: Load all packages at the top.
Save intermediate files: Use saveRDS() for the cleaned dataset.
Validate outputs: Check that your Table 1 matches expected values (e.g., smoking prevalence around 15-20%).

Conclusion

By mastering these techniques—functional programming for data wrangling, automated table generation, and regression pipelines—you'll not only ace SOCS0055 but also be ready for real-world data science challenges. In 2026, as longitudinal studies power everything from health policy to AI training sets, these skills are more relevant than ever. Good luck!