SOCS0055 Tutorial: Data Wrangling & Regression with UKHLS in R

Introduction to Advanced Computational Techniques with UKHLS

In the SOCS0055 Advanced Computational Techniques for Data Science summative assessment, you are tasked with wrangling the UK Household Longitudinal Study (UKHLS) dataset and performing regression analyses. This tutorial will guide you through the key steps—creating a long dataset, cleaning variables, and running cross-sectional regressions efficiently. By leveraging R's functional programming and tidyverse tools, you can reduce code duplication and improve reproducibility. Whether you're analyzing smoking status, obesity, or life satisfaction, these techniques are essential for modern data science. Just as AI models like ChatGPT rely on clean, structured data, your analyses depend on well-wrangled datasets.

Task 1: Building a Long Dataset from UKHLS

The first task requires creating a 'long' dataset with one row per participant-wave combination. Start by listing all adult interview files (*_indresp.dta) from waves 1 to 14. Use purrr::map() to read each file and dplyr::bind_rows() to combine them. Filter to keep only participants who appeared in wave 1 (a_indresp.dta). Then, clean variables: smoking status (binary from *_smever and *_smnow), obesity (BMI ≥ 30 from *_bmi_dv), SF-12 scores, life satisfaction, age, interview date, sex, degree qualification, and employment status. For life satisfaction, use *_lfsato; for interview date, use *_intdoy and *_inty to create month-year. Use ifelse() or case_when() for binary derivations. Save the cleaned dataset with saveRDS(). This approach mirrors how data scientists at companies like Spotify or Netflix prepare user data for personalized recommendations.

Task 2: Creating a Publication-Ready Table 1

Load your cleaned dataset and create a descriptive statistics table by wave. For categorical variables (smoking, obesity, sex, degree, employment), report sample sizes and percentages. For continuous variables (age, SF-12 scores, life satisfaction), report mean and standard deviation. Use gtsummary::tbl_summary() with by = wave to generate the table. Rename variables using modify_header() and modify_footnote(). Save as HTML with as_gt() %>% gt::gtsave(). A well-formatted table is crucial for academic publications and business reports alike—think of how financial analysts present quarterly earnings.

Task 3: Efficient Cross-Sectional Regressions

Run OLS regressions for each combination of outcome (smoking, SF-12 MCS, SF-12 PCS, obesity, life satisfaction), exposure (employment, degree), and sex (male, female, combined), controlling for age, age-squared, and interview date, separately for waves 1-14. Use tidyr::expand_grid() to create all combinations, then purrr::map2() to fit models. Extract coefficients, standard errors, and confidence intervals with broom::tidy(). Store results in a single tibble. This reduces manual coding and errors, similar to how automated trading systems execute thousands of strategies simultaneously.

Conclusion

By mastering these techniques, you'll handle large-scale data wrangling and regression with ease—skills valued in data science roles at tech giants, research institutions, and beyond. Remember to comment your code and use functions to minimize repetition. Good luck with your SOCS0055 assessment!