MRM dataset fetchers and bundled samples

Overview

The MRM empirical callables (mrm_otis_*, mrm_tps_*, mrm_siu_*) operate on three external data sources. This vignette documents how morie makes each of them accessible to end users:

  • OTIS (Offender Tracking Information System; published by the Ontario Ministry of the Solicitor General) 014 public CKAN release on data.ontario.ca, reached via the existing morie_load_dataset() infrastructure.
  • TPS (Toronto Police Service) 014 public ArcGIS Open Data, reached via morie_fetch_tps().
  • SIU (Ontario Special Investigations Unit) 014 public Director’s Reports, scraped on demand by morie_fetch_siu() (the corpus is not shipped because redistribution licensing is unsettled).

Four small reference samples are bundled with the package (inst/extdata/) so every example runs offline.

library(rmorie)

Bundled reference samples

The samples live in inst/extdata/ and total ~420 KB. Each is a 1000-row random draw (seed 42) except otis_b09 and otis_c11, which are shipped whole (already small).

# Bundled samples
b01 <- morie_sample("otis_b01")
b09 <- morie_sample("otis_b09")
c11 <- morie_sample("otis_c11")
tps <- morie_sample("tps_assault")

# Schema sanity
str(b01, max.level = 1)
#> 'data.frame':    1000 obs. of  18 variables:
#>  $ EndFiscalYear                                         : int  2025 2023 2023 2023 2024 2024 2025 2025 2024 2023 ...
#>  $ UniqueIndividual_ID                                   : chr  "2025-08960-SG" "2023-11152-SG" "2023-08955-SG" "2023-11360-SG" ...
#>  $ Gender                                                : chr  "Male" "Male" "Female" "Male" ...
#>  $ Region_AtTimeOfPlacement                              : chr  "Eastern" "Northern" "Central" "Central" ...
#>  $ Region_MostRecentPlacement                            : chr  "Eastern" "Northern" "Central" "Central" ...
#>  $ Age_Category                                          : chr  "25 to 49" "25 to 49" "25 to 49" "25 to 49" ...
#>  $ NumberConsecutiveDays_Segregation                     : int  4 1 6 1 2 1 1 1 2 3 ...
#>  $ SegReason_SecurityOfInstitution_SafetyOfOthers        : chr  "No" "No" "No" "No" ...
#>  $ SegReason_InmateNeedsProtection                       : chr  "No" "Yes" "No" "No" ...
#>  $ SegReason_InmateNeedsProtection_Medical               : chr  "Yes" "No" "No" "No" ...
#>  $ SegReason_SecurityOfInstitution_SafetyOfOthers_Medical: chr  "No" "No" "Yes" "Yes" ...
#>  $ SegReason_Disciplinary_Segregation                    : chr  "No" "No" "No" "No" ...
#>  $ SegReason_InmateRefuseSearch_Scan                     : chr  "No" "No" "No" "No" ...
#>  $ MentalHealth_Alert                                    : chr  "Yes" "No" "Yes" "No" ...
#>  $ SuicideRisk_Alert                                     : chr  "Yes" "No" "Yes" "No" ...
#>  $ SuicideWatch_Alert                                    : chr  "Yes" "No" "Yes" "No" ...
#>  $ SegReason_Other                                       : chr  "No" "" "" "" ...
#>  $ Number_Of_Placements                                  : int  1 8 1 1 1 2 2 1 2 1 ...
nrow(b09)
#> [1] 78
nrow(c11)
#> [1] 33
ncol(tps)
#> [1] 31

OTIS via CKAN (data.ontario.ca)

The OTIS public release is hosted at data.ontario.ca/dataset/data-on-inmates-in-ontario as 28 CSV resources. The MORIE catalog has four of them registered with their canonical CKAN resource IDs:

cat <- morie_dataset_catalog()
cat[cat$source == "otis", c("key", "name", "table_name", "ckan_resource_id")]
#>        key                                                        name
#> 37 otisa01        OTIS a01: Restrictive Confinement - Detailed Dataset
#> 38 otisb01                    OTIS b01: Segregation - Detailed Dataset
#> 39 otisb09 OTIS b09: Individuals in Segregation - Number of Placements
#> 40 otisc11 OTIS c11: Individuals in Segregation/RC by Aggregate Length
#>    table_name                     ckan_resource_id
#> 37    otisa01 5a0c5804-a055-4031-9743-73f556e43bb4
#> 38    otisb01 406e6d90-d568-4553-8ca7-bc9f90e133b9
#> 39    otisb09 df24e943-d52b-43a8-a10e-a3cc906e26bb
#> 40    otisc11 9c7b74a5-53ad-4ef0-a7a6-97772cd01c55

To pull the full b01 (82,001 rows) from CKAN:

b01_full <- morie_load_dataset("otisb01")
nrow(b01_full)
# 82001

The loader downloads the CSV on first call, caches it into the package SQLite database, and returns a data.frame on subsequent calls.

TPS via ArcGIS Open Data

Toronto Police Service publishes per-category crime events through ArcGIS Online. MORIE knows the layer URLs for nine categories.

morie_tps_layer_urls()
#>                                                                                                                       Assault 
#>                         "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Assault_Open_Data/FeatureServer/0" 
#>                                                                                                                     AutoTheft 
#>                      "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Auto_Theft_Open_Data/FeatureServer/0" 
#>                                                                                                                  BicycleTheft 
#>                  "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Bicycle_Thefts_Open_Data/FeatureServer/0" 
#>                                                                                                                 BreakAndEnter 
#>                 "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Break_and_Enter_Open_Data/FeatureServer/0" 
#>                                                                                                                     Homicides 
#>        "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Homicides_Open_Data_ASR_RC_TBL_002/FeatureServer/0" 
#>                                                                                                                       Robbery 
#>                         "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Robbery_Open_Data/FeatureServer/0" 
#>                                                                                                   ShootingAndFirearmDiscarges 
#> "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Shooting_and_Firearm_Discharges_Open_Data/FeatureServer/0" 
#>                                                                                                                   TheftFromMV 
#>        "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Theft_From_Motor_Vehicle_Open_Data/FeatureServer/0" 
#>                                                                                                                     TheftOver 
#>                      "https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Theft_Over_Open_Data/FeatureServer/0"

To fetch a single category:

csv_path <- morie_fetch_tps("Assault")
assault <- utils::read.csv(csv_path)
nrow(assault)
# 254378 (as of mid-2026)

The fetcher pages through /query with a 2000-record-per-page cap (the ArcGIS-imposed maximum) and writes a tidy CSV to ~/.cache/morie/tps/tps_Assault.csv. Subsequent calls return the cached path unless overwrite = TRUE.

Filtering at the server side:

recent <- morie_fetch_tps("Homicides",
                          where = "OCC_YEAR >= 2024",
                          overwrite = TRUE)

SIU via on-demand scraper

The Ontario SIU publishes Director’s Reports at siu.on.ca/en/case_directors_reports.php. MORIE includes an on-demand scraper that:

  • Iterates over the index page(s) (optionally filtered by year)
  • Extracts case links
  • Pulls each case detail page
  • Parses incident date, notifying police service, Director’s decision, and outcome text into a single CSV
csv_path <- morie_fetch_siu()       # full unfiltered index
siu <- utils::read.csv(csv_path)
nrow(siu)

# Or restricted to specific years:
csv_path <- morie_fetch_siu(years = 2020:2025, overwrite = TRUE)

Why a scraper rather than a shipped dataset?

The legal status of redistributing a single tabular copy of public oversight reports is not clearly established. Running the scraper per-user is unambiguously fair use of public information; bundling the scraped corpus might be more questionable. The scraper itself respects a 2-second rate limit, sets a clear User-Agent, and follows the SIU site’s published structure.

R-side wrapper, Python-side implementation

morie_fetch_siu() in R is a thin reticulate wrapper around morie.siu_fetch.fetch_siu_cases() in Python. This keeps the regex parsing logic in one canonical place. If reticulate isn’t installed, fall back to calling the Python directly:

python3 -c "from morie.siu_fetch import fetch_siu_cases; print(fetch_siu_cases())"

Workflow: from bundled sample to verified result

Putting it together 014 a complete worked example without any network call:

b01 <- morie_sample("otis_b01")

# Mandela classification with default "individual_any" denominator
mrm_classify_mandela(b01, denominator = "row")
#>     year denominator n_mandela        rate  pct n_broader_rc rate_broader
#> 1   2023         362         0 0.000000000 0.00            0  0.000000000
#> 2   2024         337         3 0.008902077 0.89            3  0.008902077
#> 3   2025         301         4 0.013289037 1.33            4  0.013289037
#> 4 pooled        1000         7 0.007000000 0.70            7  0.007000000

# Segregation duration KM
mrm_otis_seg_duration_km(b01,
                         group_cols = "MentalHealth_Alert")
#>   stratum   n mean_days median_days q25_days pct_above_mandela
#> 1      No 499      2.76           2        3               0.6
#> 2     Yes 501      3.29           2        4               0.8
#>   median_among_above_mandela
#> 1                       54.0
#> 2                       41.5

# Mortification co-occurrence: Cramer's V across alert pairs
mrm_otis_mortification_cooccurrence(b01)
#>              alert_a            alert_b    n   chi2 df   p_value
#> 1 MentalHealth_Alert  SuicideRisk_Alert 1000  33.25  1  8.12e-09
#> 2 MentalHealth_Alert SuicideWatch_Alert 1000  12.09  1  5.08e-04
#> 3  SuicideRisk_Alert SuicideWatch_Alert 1000 470.37  1 2.66e-104
#>   morie_cramers_v
#> 1          0.1823
#> 2          0.1099
#> 3          0.6858

For a full-data version, swap morie_sample("otis_b01") for morie_load_dataset("otisb01") and re-run the same callables.

Caching layout

Cache path Populated by Size (full)
~/.cache/morie/morie.db (SQLite) morie_load_dataset(*) a few MB to ~1 GB depending on selection
~/.cache/morie/tps/tps_*.csv morie_fetch_tps() ~5–50 MB per category
~/.cache/morie/siu/SIU.csv morie_fetch_siu() ~5 MB

References

  • OTIS data dictionary 014 data.ontario.ca/dataset/data-on-inmates-in-ontario
  • Toronto Police Open Data 014 data.torontopolice.on.ca/
  • SIU Director’s Reports 014 siu.on.ca/en/case_directors_reports.php
  • MRM theoretical paper (Ruhela 2026, Multilevel Reconciliation Methodology; see citation("morie") for the full bibentry)