| Title: | Bundled Datasets for the rmorie Package |
|---|---|
| Description: | Bundled open data fixtures used by the 'rmorie' package for examples, vignettes, and tests. Split out so 'rmorie' itself stays within CRAN's package-size soft cap. Contains snapshots of publicly-available datasets from CKAN/Socrata/Opendatasoft portals (Chicago, NYC, Toronto, Vancouver, etc.), Statistics Canada CCJS tables, and synthetic fixtures for unit tests. Also ships a small set of analyst-facing helpers for releasing aggregate statistics without re-identification risk: Laplace and Gaussian differential privacy mechanisms and k-anonymity, l-diversity, and cell suppression verifiers. |
| Authors: | Vansh Singh Ruhela [aut, cre] (ORCID: <https://orcid.org/0009-0004-1750-3592>) |
| Maintainer: | Vansh Singh Ruhela <[email protected]> |
| License: | AGPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-06-23 11:06:02 UTC |
| Source: | https://github.com/rootcoder007/rmoriedata |
Standard StatCan / open-data complementary-suppression: identifies
counts below threshold, suppresses them by setting to NA,
and (if return_complementary = TRUE) also suppresses the smallest
other count in each affected row and column so the suppressed value
can't be reconstructed from marginal sums.
morie_cell_suppress(tbl, threshold = 5, return_complementary = TRUE)morie_cell_suppress(tbl, threshold = 5, return_complementary = TRUE)
tbl |
A numeric matrix or 2-D table of counts. Will be coerced to matrix; row/column names are preserved. |
threshold |
Minimum count to remain unsuppressed. Default 5. |
return_complementary |
Logical; if |
Only finite numeric cells are eligible for suppression. NA cells
in the input pass through unchanged.
A list with class "morie_cell_suppress":
suppressednumeric matrix, suppressed cells set to NA.
primary_masklogical matrix, TRUE for primary suppressions.
complementary_masklogical matrix, TRUE for complementary
suppressions (all FALSE when return_complementary = FALSE).
n_primaryinteger.
n_complementaryinteger.
thresholdthe threshold used.
tbl <- matrix(c(120, 3, 47, 88, 2, 99, 14, 51, 60), nrow = 3, dimnames = list(c("A","B","C"), c("X","Y","Z"))) morie_cell_suppress(tbl, threshold = 5)tbl <- matrix(c(120, 3, 47, 88, 2, 99, 14, 51, 60), nrow = 3, dimnames = list(c("A","B","C"), c("X","Y","Z"))) morie_cell_suppress(tbl, threshold = 5)
Releases an approximately (, )-DP mean of a
bounded numeric vector. Sensitivity is derived from the user-asserted
bounds: changing one record can shift the sum by at most
upper - lower, so the mean's sensitivity is
(upper - lower) / length(x).
morie_dp_gaussian_mean(x, lower, upper, epsilon, delta = 1e-6)morie_dp_gaussian_mean(x, lower, upper, epsilon, delta = 1e-6)
x |
Numeric vector (no NAs). |
lower, upper
|
Hard bounds on |
epsilon, delta
|
Privacy parameters. Standard recommendation:
|
The noise standard deviation follows the classical analytic-Gaussian calibration:
A noised mean (single numeric).
set.seed(1) x <- runif(1000, 0, 1) morie_dp_gaussian_mean(x, lower = 0, upper = 1, epsilon = 1.0)set.seed(1) x <- runif(1000, 0, 1) morie_dp_gaussian_mean(x, lower = 0, upper = 1, epsilon = 1.0)
Adds Laplace noise calibrated to sensitivity / epsilon. Use when releasing counts of records matching some predicate (e.g. number of UoF incidents in a division-year). Sensitivity is hardcoded to 1: one record entering or leaving the dataset changes the count by at most 1.
morie_dp_laplace_count(true_count, epsilon)morie_dp_laplace_count(true_count, epsilon)
true_count |
Non-negative integer; the true count. |
epsilon |
Privacy budget (smaller = more noise = stronger privacy). Typical range: 0.1 to 5.0. |
Pure (, 0)-differentially-private under the standard
add-or-remove-one neighbouring-databases definition.
A noised count (numeric, may be fractional or negative). Caller
should usually clip to a non-negative integer for display:
round(pmax(0, x)).
set.seed(1) morie_dp_laplace_count(true_count = 42, epsilon = 1.0)set.seed(1) morie_dp_laplace_count(true_count = 42, epsilon = 1.0)
Adds independent Laplace(1/epsilon) noise to each bin count.
Under the add-or-remove-one neighbouring-databases definition a single
record participates in exactly one bin, so the per-bin sensitivity is
1 and the overall mechanism is (, 0)-DP.
morie_dp_laplace_histogram(counts, epsilon)morie_dp_laplace_histogram(counts, epsilon)
counts |
Integer vector of non-negative bin counts. |
epsilon |
Privacy budget (positive scalar). |
A numeric vector of the same length as counts. May contain
fractional or negative values. Caller is responsible for any post-hoc
non-negativity / rounding before display.
set.seed(1) morie_dp_laplace_histogram(c(120, 45, 8, 230, 17), epsilon = 0.5)set.seed(1) morie_dp_laplace_histogram(c(120, 45, 8, 230, 17), epsilon = 0.5)
Checks whether a data.frame satisfies k-anonymity over the supplied
quasi-identifier columns. A dataset is k-anonymous if every combination
of quasi-identifier values appears in at least k rows.
morie_k_anonymity_verify(data, quasi_identifiers, k = 5)morie_k_anonymity_verify(data, quasi_identifiers, k = 5)
data |
data.frame. |
quasi_identifiers |
Character vector of column names. |
k |
Minimum equivalence-class size. Default 5 (a common public-health / open-data threshold). |
A list with class "morie_k_anon" containing:
satisfieslogical, whether the dataset is k-anonymous.
kthe threshold used.
min_class_sizeinteger, size of the smallest class.
n_classesinteger, total number of equivalence classes.
n_violationsinteger, number of classes below the threshold.
violating_classesdata.frame of class keys plus their
.n sizes (empty data.frame when none).
summaryhuman-readable one-line summary.
df <- data.frame( age = c(25, 25, 25, 32, 32, 40), sex = c("F", "F", "F", "M", "M", "M") ) morie_k_anonymity_verify(df, c("age", "sex"), k = 2)df <- data.frame( age = c(25, 25, 25, 32, 32, 40), sex = c("F", "F", "F", "M", "M", "M") ) morie_k_anonymity_verify(df, c("age", "sex"), k = 2)
Checks whether a data.frame satisfies l-diversity: within each
equivalence class defined by the quasi-identifiers, the sensitive
attribute must take at least l distinct values.
morie_l_diversity_verify(data, quasi_identifiers, sensitive, l = 3)morie_l_diversity_verify(data, quasi_identifiers, sensitive, l = 3)
data |
data.frame. |
quasi_identifiers |
Character vector of QI column names. |
sensitive |
Name of the sensitive-attribute column. |
l |
Minimum number of distinct sensitive values per class. Default 3. |
A list with class "morie_l_div" containing:
satisfieslogical.
lthe threshold used.
min_diversityinteger, lowest per-class distinct count.
n_classesinteger.
n_violationsinteger, classes below the threshold.
violating_classesdata.frame of class keys plus their
.diversity count.
summaryhuman-readable.
df <- data.frame( age = c(25, 25, 25, 25, 32, 32, 32), sex = c("F", "F", "F", "F", "M", "M", "M"), dx = c("A", "B", "C", "A", "X", "Y", "Z") ) morie_l_diversity_verify(df, c("age", "sex"), "dx", l = 3)df <- data.frame( age = c(25, 25, 25, 25, 32, 32, 32), sex = c("F", "F", "F", "F", "M", "M", "M"), dx = c("A", "B", "C", "A", "X", "Y", "Z") ) morie_l_diversity_verify(df, c("age", "sex"), "dx", l = 3)