| Title: | Reproducible Data Bundles with Provenance and Fallback |
|---|---|
| Description: | Tools for building brick-proof, reproducible data bundles. Resolves open-data sources through CKAN package_show and package_search endpoints, records and verifies provenance with SHA256 digests and Wayback Machine snapshots, validates downloaded data against a pinned schema, and falls back to schema-driven synthetic data when the real source is unreachable. Run records are captured in a manifest plus a plain-language summary so any result can be traced back to its inputs. |
| Authors: | Vansh Singh Ruhela [aut, cre] (ORCID: <https://orcid.org/0009-0004-1750-3592>) |
| Maintainer: | Vansh Singh Ruhela <[email protected]> |
| License: | AGPL-3 |
| Version: | 0.1.0 |
| Built: | 2026-06-24 18:34:14 UTC |
| Source: | https://github.com/rootcoder007/morie-bricklayer |
Runs validate_schema() and acts on the result: fatal issues raise an
error via stop(), warning-severity issues emit a warning().
apply_schema_validation(df_raw, provenance)apply_schema_validation(df_raw, provenance)
df_raw |
The data frame to validate. |
provenance |
A provenance list as returned by |
Invisibly, TRUE if no issues were found and FALSE
otherwise. Errors if any fatal issue is present.
Thin wrapper around utils::download.file() that returns the target
path invisibly so it composes in pipelines.
download_data(url, target_path, mode = "wb", quiet = FALSE)download_data(url, target_path, mode = "wb", quiet = FALSE)
url |
URL to download. |
target_path |
Destination path on disk. |
mode |
Write mode passed to |
quiet |
Logical; suppress progress output. Defaults to |
The target_path, returned invisibly.
Wraps utils::download.file() and, on failure, prints plain-language
guidance for the most common academic and corporate network problems
(rate limiting, TLS-inspection VPNs, DNS failures, timeouts, HTTP 403).
Optionally retries from a Wayback Machine snapshot URL.
friendly_download(url, target_path, attempt_wayback = NULL)friendly_download(url, target_path, attempt_wayback = NULL)
url |
URL to download. |
target_path |
Destination path on disk. |
attempt_wayback |
Optional Wayback Machine snapshot URL tried as a
fallback if the primary download fails. |
TRUE if either the primary download or the Wayback fallback
succeeds, otherwise FALSE.
Reads a data_provenance.json file describing a project's pinned data
source: the CKAN endpoint, resource name pattern, expected SHA256,
Wayback snapshot, schema, and synthetic-data recipe.
load_provenance(path)load_provenance(path)
path |
Path to the provenance JSON file. |
The parsed provenance as a nested list (via
jsonlite::fromJSON() with simplifyVector = FALSE), or NULL if
the file does not exist.
Creates an empty manifest object that accumulates cross-check entries
via record() and is later serialized with write_manifest_json().
make_manifest(meta)make_manifest(meta)
meta |
A named list of run metadata (e.g. |
A manifest list with elements meta and an empty results
list.
Builds a single synthetic data column according to a column spec drawn
from a provenance synthetic recipe. Supported types are "sample"
(categorical, optionally weighted), "bernoulli" (two-label draw with
optional per-row base rate), "poisson" (counts with a floor),
"id_pattern" (templated IDs, optionally per-year sequenced), and
"sequence" (a running integer sequence).
make_synthetic_column(spec, n, ctx = list(), base_p = NULL)make_synthetic_column(spec, n, ctx = list(), base_p = NULL)
spec |
A list describing the column; recognised fields depend on
|
n |
Number of values to generate. |
ctx |
Named list of already-generated columns, letting later
columns (such as |
base_p |
Optional numeric vector of per-row latent propensities
used by the |
A vector of length n for the requested column type. Errors on
an unknown type.
Generates a reproducible synthetic data set from the
schema$synthetic_recipe block of a provenance object and writes it to
a CSV. Columns are produced in declaration order so later columns can
reference earlier ones, a shared per-row latent propensity drives any
Bernoulli columns, and an optional row-replication block expands
per-person rows.
make_synthetic_csv(schema, out_path, n_rows = NULL, seed = NULL)make_synthetic_csv(schema, out_path, n_rows = NULL, seed = NULL)
schema |
The synthetic recipe (a list with |
out_path |
Path where the CSV is written. |
n_rows |
Number of rows (persons, if replicating) to generate.
Defaults to |
seed |
Random seed for reproducibility. Defaults to |
Invisibly, a list with path, rows (rows written), and
seed used.
Appends one named cross-check entry to a manifest, classifying it as
PASS, DIFFER, or INFO, printing a formatted line to the console,
and returning the updated manifest.
record( manifest, name, observed, expected, tol = 1e-04, group = "general", synthetic = FALSE )record( manifest, name, observed, expected, tol = 1e-04, group = "general", synthetic = FALSE )
manifest |
A manifest as returned by |
name |
Unique name for this cross-check; used as the result key. |
observed |
The observed value (numeric or otherwise). |
expected |
The expected value to compare against. |
tol |
Numeric tolerance; a numeric pair within |
group |
Optional grouping label for the entry. Defaults to
|
synthetic |
Logical; if |
The updated manifest, returned so calls can be chained.
Queries the CKAN package_show endpoint recorded in a provenance
object and returns the URL of the first resource whose name matches the
provenance's name-match pattern. CKAN powers data.ontario.ca,
data.gov.uk, data.gov, and most government open-data portals, so this
recovers the current download URL even if the underlying resource UUID
has been replaced.
resolve_via_ckan(provenance)resolve_via_ckan(provenance)
provenance |
A provenance list as returned by |
The matched resource URL as a character string, or NULL if
the endpoint is missing, the request fails, CKAN reports failure, or
no resource name matches.
Fallback for resolve_via_ckan() when the dataset slug has changed.
Derives the CKAN portal base URL from the provenance's package_show
endpoint, runs a package_search query (from resource$search_query,
or derived from the name-match pattern), and returns the URL of the
first matching resource, preferring CSV format when specified.
resolve_via_ckan_search(provenance)resolve_via_ckan_search(provenance)
provenance |
A provenance list as returned by |
The matched resource URL as a character string, or NULL if no
query or base URL can be derived, the request fails, or nothing
matches.
Returns the SHA256 digest of a file as a lowercase hex string, using the digest package. Used to record and verify data provenance.
sha256_file(path)sha256_file(path)
path |
Path to the file to hash. |
The SHA256 digest as a character string.
Checks a raw data frame against the schema block of a provenance
object: required columns, row-count bounds, and allowed categorical
value sets. Returns the issues found rather than raising, so the caller
decides how to react.
validate_schema(df_raw, provenance)validate_schema(df_raw, provenance)
df_raw |
The data frame to validate. |
provenance |
A provenance list as returned by |
A named list of issues; each issue is a list with severity
("fatal" or "warning") and a human-readable message. A
zero-length list means the data frame is clean.
Computes the SHA256 digest of a file (via the digest package) and compares it to the expected value pinned in provenance.
verify_sha256(path, expected_sha)verify_sha256(path, expected_sha)
path |
Path to the file to hash. |
expected_sha |
The expected SHA256 digest, as a lowercase hex string. |
A list with actual (computed digest), expected (the value
passed in), and match (logical; TRUE if they are identical).
Serializes a manifest to a pretty-printed JSON file via the jsonlite package.
write_manifest_json(manifest, path)write_manifest_json(manifest, path)
manifest |
A manifest as returned by |
path |
Destination path for the JSON file. |
The path, returned invisibly.
Writes a human-readable SUMMARY.txt into the output directory,
covering run metadata, the exact absolute paths used, result counts,
the files produced, and optional notes, contact, and licence lines.
write_summary_txt( manifest, output_dir, paths, what_was_done = NULL, contact = NULL, licence = NULL )write_summary_txt( manifest, output_dir, paths, what_was_done = NULL, contact = NULL, licence = NULL )
manifest |
A manifest as returned by |
output_dir |
Directory to write |
paths |
A named list of absolute paths to report (e.g. |
what_was_done |
Optional character vector of bullet points describing what the run did. |
contact |
Optional contact string appended to the summary. |
licence |
Optional licence string appended to the summary. |
The path to the written SUMMARY.txt, returned invisibly.