R Analysis Pipeline | WBR Hwange Endline Evaluation

Import

→

Panel Merge

→

EMIS Link

→

Clean & Anonymise

→

Analyse

→

Visualise

Data Import

01_import.R

library(tidyverse)
library(readxl)
library(janitor)

wave_1 <- read_csv("data/hwange_wave1_2019.csv") %>% clean_names()
wave_2 <- read_csv("data/hwange_wave2_2021.csv") %>% clean_names()
wave_3 <- read_csv("data/hwange_wave3_2023.csv") %>% clean_names()
wave_4 <- read_csv("data/hwange_wave4_2024.csv") %>% clean_names()

cat("Wave 1:", nrow(wave_1), "records\n")
cat("Wave 4:", nrow(wave_4), "records\n")

What's happening: Four survey waves (2019–2024) are imported from CSV files. The janitor::clean_names() function standardises column names to snake_case, ensuring consistency across waves collected by different enumerator teams.

Panel Merging

02_panel_merge.R

panel <- wave_1 %>%
  select(student_id, ward, school, gender, age_bl = age,
         attendance_bl = attendance_rate, score_bl = exam_score) %>%
  left_join(
    wave_4 %>% select(student_id, age_el = age,
                       attendance_el = attendance_rate, score_el = exam_score),
    by = "student_id"
  ) %>%
  mutate(
    attrition = is.na(attendance_el),
    attendance_change = attendance_el - attendance_bl,
    score_change = score_el - score_bl
  )

cat("Panel matched:", sum(!panel$attrition), "of", nrow(panel), "students\n")
cat("Attrition rate:", scales::percent(mean(panel$attrition)), "\n")

What's happening: Baseline (Wave 1) and endline (Wave 4) records are linked by student_id to create a longitudinal panel. Change scores are computed for attendance and exam performance. Attrition is flagged where endline observations are missing - the 12% rate is within acceptable limits for a 5-year panel.

EMIS Data Linkage

03_emis_link.R

emis <- read_csv("data/emis_school_registry.csv") %>% clean_names()

panel_enriched <- panel %>%
  left_join(emis %>% select(school, school_type, district, 
                             total_enrolled, pupil_teacher_ratio),
            by = "school") %>%
  mutate(
    bicycle_recipient = student_id %in% bicycle_registry$student_id,
    treatment = if_else(bicycle_recipient, 1L, 0L)
  )

What's happening: School-level variables from the EMIS administrative registry (school type, enrolment, pupil-teacher ratio) are merged onto the student panel. Treatment assignment is derived from the bicycle distribution registry, creating the binary indicator used in the DiD model.

Anonymisation

04_anonymise.R

library(digest)

panel_anon <- panel_enriched %>%
  mutate(
    student_id = map_chr(student_id, ~ digest(.x, algo = "sha256")),
    school = paste0("School_", dense_rank(school))
  ) %>%
  select(-contains("name"), -contains("phone"))

k_anon <- panel_anon %>%
  count(ward, gender, school_type) %>%
  filter(n < 5)

cat("Groups below k=5 threshold:", nrow(k_anon), "\n")

What's happening: Student IDs are SHA-256 hashed and school names replaced with pseudonyms. Personal identifiers (name, phone) are dropped. A k-anonymity check verifies that every combination of quasi-identifiers (ward, gender, school type) has at least 5 records, reducing re-identification risk.

Statistical Analysis

Pre-Post Paired t-Test

Comparing baseline and endline attendance within the same students to establish the overall direction of change before estimating causal effects.

05a_t_test.R

t_attendance <- t.test(panel_clean$attendance_el, 
                        panel_clean$attendance_bl, 
                        paired = TRUE)

cat("Attendance change:", round(t_attendance$estimate, 1), "pp\n")
cat("95% CI:", round(t_attendance$conf.int, 1), "\n")
cat("p-value:", format.pval(t_attendance$p.value), "\n")

Difference-in-Difference Estimation

The core causal identification strategy: estimating the treatment effect of bicycle distribution while controlling for confounders including gender, ward fixed effects, pupil-teacher ratio, and baseline age.

05b_did_model.R

did_model <- lm(attendance_change ~ treatment + gender + ward + 
                  pupil_teacher_ratio + age_bl,
                data = panel_clean)

did_estimate <- broom::tidy(did_model, conf.int = TRUE) %>%
  filter(term == "treatment")

cat("DiD estimate:", round(did_estimate$estimate, 2), "pp\n")
cat("SE:", round(did_estimate$std.error, 2), "\n")
cat("p-value:", format.pval(did_estimate$p.value), "\n")

Result: The DiD model estimates that bicycle recipients experienced an 8.2 percentage-point improvement in attendance relative to the control group (p < 0.01), after adjusting for ward, gender, pupil-teacher ratio, and baseline age.

Visualisation

06_visualise.R

ggplot(panel_clean, aes(x = factor(wave), y = attendance_rate, 
                         colour = factor(treatment))) +
  stat_summary(fun = mean, geom = "line", linewidth = 1.2) +
  stat_summary(fun = mean, geom = "point", size = 3) +
  scale_colour_manual(values = c("#5C6B7A", "#E8A020"),
                       labels = c("Control", "Bicycle recipients")) +
  labs(title = "Parallel Trends: Attendance by Treatment Group",
       x = "Survey Wave", y = "Mean Attendance Rate (%)",
       colour = "Group") +
  theme_minimal(base_family = "Inter", base_size = 13)

What's happening: The parallel trends plot is the key diagnostic for DiD validity. Pre-treatment trends should be roughly parallel between groups; post-treatment divergence then reflects the causal effect. The palette matches the portfolio design system - control in muted grey, treatment in amber.

Technical Summary

Pipeline at a Glance

346

Students in final panel

Survey waves (2019–2024)

12%

Attrition rate

Wards

+8.2 pp

DiD estimate (p < 0.01)

Explore Further

Continue exploring the Hwange evaluation results or return to the main portfolio.

Education Outcomes DiD Analysis Qualitative Synthesis Main Portfolio