The SIU pipeline: fetch, parse, audit, translate

Overview

The Ontario Special Investigations Unit (SIU) publishes director’s reports on every police-civilian interaction that meets the SIU’s mandate (serious injury, death, sexual assault, or discharge of a firearm at a person). MORIE ships a full pipeline for that corpus:

  • a polite-by-default HTTP fetcher (morie_fetch_siu()),
  • a hand-rolled C++ parser that handles both English and French templates from 2005 to the present,
  • a language-aware DRID manifest (4,743 drids, English + French) that ships inside the package,
  • a canonical override system that lets the parser learn from hand-verified corrections,
  • audit + AI tooling (morie_siu_audit_case(), morie_siu_anomaly_check(), morie_siu_llm_extract(), morie_siu_compare()),
  • and FR -> EN translation via local Ollama (morie_siu_translate()).

Most code chunks below are flagged eval = FALSE because they make network calls or talk to a local LLM. They are tested in tests/ under a guarded skipif.

Inspecting the shipped manifest

The DRID manifest ships inside the package, so reading it requires no network access.

library(morie)

manifest <- morie_siu_index()
str(manifest)

# Language counts: en=2531, fr=2212, unknown=0.
table(manifest$`_language`)

# canonical_drid points each French drid at its English counterpart.
head(manifest[manifest$`_language` == "fr",
              c("drid", "case_number", "canonical_drid")])

Fetching the full corpus

morie_fetch_siu() walks the manifest, fetches every page (with a token-bucket throttle, default 4 req/s), and parses the result into a 2,218 x 64 data frame.

df <- morie_fetch_siu(
  lang       = "en",     # skip French drids: half the network traffic
  cache_html = TRUE,     # persist pages locally; re-runs are ~1-2 min
  rate_limit = 4         # requests per second
)

dim(df)
# 2218 x 64

The fetcher respects Retry-After on 429 responses, retries up to 3 times with exponential backoff on 5xx, and interleaves report and news-release fetches in a single rolling-window pool so a cold run finishes in ~30 minutes rather than the ~58 it took before the interleaving change in v0.9.5.

Auditing a single case

morie_siu_audit_case() returns the parser row, the raw report HTML, the raw news HTML, and the cleaned text side-by-side, so you can spot exactly what the parser saw.

audit <- morie_siu_audit_case("16-OFI-019")

names(audit)
# "row", "report_html", "news_html", "report_text", "news_text"

# Per-field "does the HTML actually support this value?" verdicts.
morie_siu_anomaly_check("16-OFI-019")

# How accurate is each column across a sample of cases?
morie_siu_audit_columns(case_numbers = sample(df$case_number, 50))

Comparing parser output against an external table

morie_siu_compare() produces a side-by-side diff between the parser row and any external table that names the same case.

my_other_table <- data.frame(
  case_number = "16-OFI-019",
  n_officers  = 2L,
  date        = as.Date("2016-01-15")
)

morie_siu_compare(
  case_number = "16-OFI-019",
  external    = my_other_table,
  field_map   = c(officer_count = "n_officers",
                  incident_date = "date")
)

AI extraction (free local model by default)

morie_siu_llm_extract() runs an LLM over the cleaned report text and returns structured field extractions. The default provider is local Ollama with gemma3:4b, so no API key is required.

# Default: local Ollama with gemma3:4b. No API key needed.
morie_siu_llm_extract("16-OFI-019")

# Failover chain: try local first, fall back to Gemini only on error.
morie_siu_llm_extract("16-OFI-019", model = c("ollama", "gemini"))

Supported providers: ollama (default, local, free), gemini, claude, vertex. Environment knobs:

  • OLLAMA_HOST (default http://localhost:11434)
  • OLLAMA_MODEL (default gemma3:4b)
  • OLLAMA_KEEP_ALIVE (default 30m)
  • OLLAMA_API_KEY (optional, for hosted gateways)

French -> English translation

morie_siu_translate(
  text        = "L'enquete a ete close...",
  target_lang = "en"
)

translategemma:latest is the recommended Ollama model for SIU French-language report translation; install with ollama pull translategemma:latest.

The canonical-override system: the parser learns

Ship-time corrections live in inst/extdata/siu_canonical_overrides.csv.gz (47 hand-verified corrections covering 10 spot-checked cases). Users can add their own, which take effect on the next morie_fetch_siu() call:

morie_siu_record_correction(
  case_number = "20-OFD-082",
  field       = "officer_count",
  value       = 3L
)

Overrides are applied per cell, by case number, immediately after the parser writes its initial output.

Format validity

sane <- morie_siu_sanity_check(df)
sum(!sane$ok)
# 0 on the shipped corpus (100.000% format-clean)

morie_siu_sanity_check() validates per-row format constraints: case-number regex, ISO dates, integer counts, Yes / No values, and no-page-chrome-leak in narrative columns.

Where to go next

  • morie_fetch_siu() is wired through MORIE’s universal multi-format fetcher; see the mrm-dataset-fetchers vignette for the full set of supported open-data sources.
  • The OTIS provincial corpus is covered in the mrm-otis-walkthrough vignette; OTIS + SIU together form the federal + provincial side of the MRM framework.
  • Citation: see citation("morie").