The Ontario Special Investigations Unit (SIU) publishes director’s reports on every police-civilian interaction that meets the SIU’s mandate (serious injury, death, sexual assault, or discharge of a firearm at a person). MORIE ships a full pipeline for that corpus:
morie_fetch_siu()),morie_siu_audit_case(),
morie_siu_anomaly_check(),
morie_siu_llm_extract(),
morie_siu_compare()),morie_siu_translate()).Most code chunks below are flagged eval = FALSE because
they make network calls or talk to a local LLM. They are tested in
tests/ under a guarded skipif.
The DRID manifest ships inside the package, so reading it requires no network access.
morie_fetch_siu() walks the manifest, fetches every page
(with a token-bucket throttle, default 4 req/s), and parses the result
into a 2,218 x 64 data frame.
df <- morie_fetch_siu(
lang = "en", # skip French drids: half the network traffic
cache_html = TRUE, # persist pages locally; re-runs are ~1-2 min
rate_limit = 4 # requests per second
)
dim(df)
# 2218 x 64The fetcher respects Retry-After on 429 responses,
retries up to 3 times with exponential backoff on 5xx, and interleaves
report and news-release fetches in a single rolling-window pool so a
cold run finishes in ~30 minutes rather than the ~58 it took before the
interleaving change in v0.9.5.
morie_siu_audit_case() returns the parser row, the raw
report HTML, the raw news HTML, and the cleaned text side-by-side, so
you can spot exactly what the parser saw.
audit <- morie_siu_audit_case("16-OFI-019")
names(audit)
# "row", "report_html", "news_html", "report_text", "news_text"
# Per-field "does the HTML actually support this value?" verdicts.
morie_siu_anomaly_check("16-OFI-019")
# How accurate is each column across a sample of cases?
morie_siu_audit_columns(case_numbers = sample(df$case_number, 50))morie_siu_compare() produces a side-by-side diff between
the parser row and any external table that names the same case.
morie_siu_llm_extract() runs an LLM over the cleaned
report text and returns structured field extractions. The default
provider is local Ollama with gemma3:4b, so no API key is
required.
# Default: local Ollama with gemma3:4b. No API key needed.
morie_siu_llm_extract("16-OFI-019")
# Failover chain: try local first, fall back to Gemini only on error.
morie_siu_llm_extract("16-OFI-019", model = c("ollama", "gemini"))Supported providers: ollama (default, local, free),
gemini, claude, vertex.
Environment knobs:
OLLAMA_HOST (default
http://localhost:11434)OLLAMA_MODEL (default gemma3:4b)OLLAMA_KEEP_ALIVE (default 30m)OLLAMA_API_KEY (optional, for hosted gateways)translategemma:latest is the recommended Ollama model
for SIU French-language report translation; install with
ollama pull translategemma:latest.
Ship-time corrections live in
inst/extdata/siu_canonical_overrides.csv.gz (47
hand-verified corrections covering 10 spot-checked cases). Users can add
their own, which take effect on the next morie_fetch_siu()
call:
Overrides are applied per cell, by case number, immediately after the parser writes its initial output.
morie_siu_sanity_check() validates per-row format
constraints: case-number regex, ISO dates, integer counts, Yes / No
values, and no-page-chrome-leak in narrative columns.
morie_fetch_siu() is wired through MORIE’s universal
multi-format fetcher; see the mrm-dataset-fetchers vignette
for the full set of supported open-data sources.mrm-otis-walkthrough vignette; OTIS + SIU together form the
federal + provincial side of the MRM framework.citation("morie").