I started by creating a clean repository skeleton with a minimal dependency set and a CLI entry point, focusing on testability and predictable execution flow.
I defined core domain models for events, baseline keys, baseline statistics, and anomaly results, using explicit fields to avoid hidden assumptions about time, identity, or metrics.
I implemented baseline computation by grouping events into contextual buckets and calculating median and MAD only after a minimum sample threshold was met, deliberately skipping underspecified keys.
To make baselines durable, I added a SQLite-backed storage layer that persists baseline artifacts with metadata such as training window, sample count, and version, allowing multiple historical baselines to coexist.
I built a scoring function that measures deviation in MAD units and produces a structured result with both a numeric score and a human-readable explanation.
I wired the system end-to-end through a CLI, adding commands to train baselines, score new events, list and inspect stored baselines, and drill into individual scoring decisions.
To validate behavior, I created a synthetic dataset generator that produces realistic time-series data with daily seasonality, noise, and an injected incident, allowing repeatable demonstrations without relying on real logs.
I added a reporting command that summarizes scoring results into a Markdown artifact, making the system suitable for case studies and post-incident analysis rather than just terminal output.
Finally, I implemented an explain command that takes a single event and shows exactly which baseline was used, how the score was computed, and why the engine classified the event as normal or anomalous.