--- title: "The SIU pipeline: fetch, parse, audit, translate" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{The SIU pipeline: fetch, parse, audit, translate} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = requireNamespace("morie", quietly = TRUE) ) ``` # Overview The Ontario Special Investigations Unit (SIU) publishes director's reports on every police-civilian interaction that meets the SIU's mandate (serious injury, death, sexual assault, or discharge of a firearm at a person). MORIE ships a full pipeline for that corpus: - a polite-by-default HTTP fetcher (`morie_fetch_siu()`), - a hand-rolled C++ parser that handles both English and French templates from 2005 to the present, - a language-aware DRID manifest (4,743 drids, English + French) that ships inside the package, - a canonical override system that lets the parser *learn* from hand-verified corrections, - audit + AI tooling (`morie_siu_audit_case()`, `morie_siu_anomaly_check()`, `morie_siu_llm_extract()`, `morie_siu_compare()`), - and FR -> EN translation via local Ollama (`morie_siu_translate()`). Most code chunks below are flagged `eval = FALSE` because they make network calls or talk to a local LLM. They are tested in `tests/` under a guarded skipif. # Inspecting the shipped manifest The DRID manifest ships inside the package, so reading it requires no network access. ```{r manifest, eval = FALSE} library(morie) manifest <- morie_siu_index() str(manifest) # Language counts: en=2531, fr=2212, unknown=0. table(manifest$`_language`) # canonical_drid points each French drid at its English counterpart. head(manifest[manifest$`_language` == "fr", c("drid", "case_number", "canonical_drid")]) ``` # Fetching the full corpus `morie_fetch_siu()` walks the manifest, fetches every page (with a token-bucket throttle, default 4 req/s), and parses the result into a 2,218 x 64 data frame. ```{r fetch, eval = FALSE} df <- morie_fetch_siu( lang = "en", # skip French drids: half the network traffic cache_html = TRUE, # persist pages locally; re-runs are ~1-2 min rate_limit = 4 # requests per second ) dim(df) # 2218 x 64 ``` The fetcher respects `Retry-After` on 429 responses, retries up to 3 times with exponential backoff on 5xx, and interleaves report and news-release fetches in a single rolling-window pool so a cold run finishes in ~30 minutes rather than the ~58 it took before the interleaving change in v0.9.5. # Auditing a single case `morie_siu_audit_case()` returns the parser row, the raw report HTML, the raw news HTML, and the cleaned text side-by-side, so you can spot exactly what the parser saw. ```{r audit, eval = FALSE} audit <- morie_siu_audit_case("16-OFI-019") names(audit) # "row", "report_html", "news_html", "report_text", "news_text" # Per-field "does the HTML actually support this value?" verdicts. morie_siu_anomaly_check("16-OFI-019") # How accurate is each column across a sample of cases? morie_siu_audit_columns(case_numbers = sample(df$case_number, 50)) ``` # Comparing parser output against an external table `morie_siu_compare()` produces a side-by-side diff between the parser row and any external table that names the same case. ```{r compare, eval = FALSE} my_other_table <- data.frame( case_number = "16-OFI-019", n_officers = 2L, date = as.Date("2016-01-15") ) morie_siu_compare( case_number = "16-OFI-019", external = my_other_table, field_map = c(officer_count = "n_officers", incident_date = "date") ) ``` # AI extraction (free local model by default) `morie_siu_llm_extract()` runs an LLM over the cleaned report text and returns structured field extractions. The default provider is local Ollama with `gemma3:4b`, so no API key is required. ```{r llm, eval = FALSE} # Default: local Ollama with gemma3:4b. No API key needed. morie_siu_llm_extract("16-OFI-019") # Failover chain: try local first, fall back to Gemini only on error. morie_siu_llm_extract("16-OFI-019", model = c("ollama", "gemini")) ``` Supported providers: `ollama` (default, local, free), `gemini`, `claude`, `vertex`. Environment knobs: - `OLLAMA_HOST` (default `http://localhost:11434`) - `OLLAMA_MODEL` (default `gemma3:4b`) - `OLLAMA_KEEP_ALIVE` (default `30m`) - `OLLAMA_API_KEY` (optional, for hosted gateways) # French -> English translation ```{r translate, eval = FALSE} morie_siu_translate( text = "L'enquete a ete close...", target_lang = "en" ) ``` `translategemma:latest` is the recommended Ollama model for SIU French-language report translation; install with `ollama pull translategemma:latest`. # The canonical-override system: the parser learns Ship-time corrections live in `inst/extdata/siu_canonical_overrides.csv.gz` (47 hand-verified corrections covering 10 spot-checked cases). Users can add their own, which take effect on the next `morie_fetch_siu()` call: ```{r override, eval = FALSE} morie_siu_record_correction( case_number = "20-OFD-082", field = "officer_count", value = 3L ) ``` Overrides are applied per cell, by case number, immediately after the parser writes its initial output. # Format validity ```{r sanity, eval = FALSE} sane <- morie_siu_sanity_check(df) sum(!sane$ok) # 0 on the shipped corpus (100.000% format-clean) ``` `morie_siu_sanity_check()` validates per-row format constraints: case-number regex, ISO dates, integer counts, Yes / No values, and no-page-chrome-leak in narrative columns. # Where to go next - `morie_fetch_siu()` is wired through MORIE's universal multi-format fetcher; see the `mrm-dataset-fetchers` vignette for the full set of supported open-data sources. - The OTIS provincial corpus is covered in the `mrm-otis-walkthrough` vignette; OTIS + SIU together form the federal + provincial side of the MRM framework. - Citation: see `citation("morie")`.