Survey-weighted estimation with MORIE

Overview

Many MORIE analyses use Statistics Canada Public Use Microdata Files (CCS, CSADS, CSUS, CADS, CPADS), which require complex-sample design weights for valid inference. This vignette covers the sampling and weighting helpers that wrap the same machinery.

Stratified, cluster, and PPS sampling

library(rmorie)
set.seed(11)

n <- 1000
df <- data.frame(
  id     = seq_len(n),
  region = sample(c("ON", "QC", "BC", "AB"), n, replace = TRUE),
  age    = round(rnorm(n, 40, 12)),
  weight = runif(n, 0.5, 2.0)
)

# Stratified random sample within region.
strat <- morie_stratified_sample(df, strata_col = "region", n_per_stratum = 50)

# Single-stage cluster sample on region (all units in sampled
# clusters are kept).
clust <- morie_cluster_sample(df, cluster_col = "region", n_clusters = 2)

dim(strat)
#> [1] 200   5
dim(clust)
#> [1] 505   5

Calibration weights and design weights

morie_compute_design_weights() and morie_calibration_weights() produce weights consistent with a known sampling scheme and known population marginals.

# Design weights from a stratified scheme.
# Treat `region` as the stratum; supply known population sizes per stratum.
pop_sizes <- c(ON = 14000000, QC = 8500000, BC = 5200000, AB = 4400000)
design <- morie_compute_design_weights(
  df,
  strata_col       = "region",
  population_sizes = pop_sizes
)
head(design)
#> [1] 34000.00 34000.00 18107.00 55555.56 18107.00 55555.56

Effective sample size and design effect

# Effective sample size given a weight vector.
ess <- morie_effective_sample_size(df$weight)
ess
#> [1] 897.9353

# Design effect: how much variance inflates from the weighting.
deff <- morie_design_effect(df$weight)
deff
#> [1] 1.113666

The effective sample size answers the practical question “how many independent observations does this weighted sample contain?” The design effect translates between unweighted-equivalent and complex-sample variances.

Bootstrap variance with weights

# Bootstrap a weighted statistic --- here, the weighted mean of `age`.
boot <- morie_bootstrap_sample(
  df,
  statistic   = function(x) stats::weighted.mean(x$age, x$weight),
  n_bootstrap = 100
)
str(boot, max.level = 1)
#> List of 5
#>  $ estimate    : num 39.9
#>  $ se          : num 0.425
#>  $ ci_lower    : Named num 39.1
#>   ..- attr(*, "names")= chr "2.5%"
#>  $ ci_upper    : Named num 40.6
#>   ..- attr(*, "names")= chr "97.5%"
#>  $ distribution: num [1:100] 40.3 40 39.1 39.9 39.6 ...

Where to go next

  • For complex-survey causal inference (survey-weighted ATE/AIPW), see the causal-inference vignette — the same estimate_* functions accept a weights argument.
  • For Statistics Canada PUMF acknowledgments and citation requirements, see the package README.