Building a Baseline-First Behavioral Detection Engine

A software-driven project focused on designing and validating a baseline-first behavioral detection engine that learns normal behavior, persists it as an artifact, and scores new events with explainable deviation logic instead of hard-coded alerts.

The Challenge

Most detection systems start from alerts. Thresholds are chosen early, often without a clear understanding of what normal behavior actually looks like, and are then adjusted reactively as noise appears.

I wanted to reverse that order. Instead of asking when to alert, I wanted to ask what normal behavior looks like, how stable that definition is, and how deviations from it can be measured without immediately turning into alarms.

The challenge was to design a system that treats baselines as first-class artifacts rather than intermediate calculations. That meant baselines needed to be learnable, inspectable, durable across restarts, and explainable when used.

This project was about answering a defensive question honestly: before deciding something is suspicious, can I clearly explain what normal is and how far away this behavior actually is?

My Approach

I approached this as a systems design problem rather than a statistics exercise. The math needed to be robust, but the structure and reasoning around it mattered more than sophistication.

I intentionally separated the system into stages: event ingestion, baseline learning, baseline storage, scoring, inspection, and reporting. Each stage has a single responsibility and can be reasoned about independently.

Instead of mean and standard deviation, I chose median and median absolute deviation to make the system more resilient to outliers and transient spikes during training.

I avoided alerting entirely. The engine’s job is to score and explain deviation, not to decide severity or paging behavior. That decision can be layered on later without changing the core logic.

Throughout the build, I prioritized inspectability. Every decision the engine makes should be traceable back to a baseline artifact that can be viewed and questioned.

Build Process

I started by creating a clean repository skeleton with a minimal dependency set and a CLI entry point, focusing on testability and predictable execution flow.

I defined core domain models for events, baseline keys, baseline statistics, and anomaly results, using explicit fields to avoid hidden assumptions about time, identity, or metrics.

I implemented baseline computation by grouping events into contextual buckets and calculating median and MAD only after a minimum sample threshold was met, deliberately skipping underspecified keys.

To make baselines durable, I added a SQLite-backed storage layer that persists baseline artifacts with metadata such as training window, sample count, and version, allowing multiple historical baselines to coexist.

I built a scoring function that measures deviation in MAD units and produces a structured result with both a numeric score and a human-readable explanation.

I wired the system end-to-end through a CLI, adding commands to train baselines, score new events, list and inspect stored baselines, and drill into individual scoring decisions.

To validate behavior, I created a synthetic dataset generator that produces realistic time-series data with daily seasonality, noise, and an injected incident, allowing repeatable demonstrations without relying on real logs.

I added a reporting command that summarizes scoring results into a Markdown artifact, making the system suitable for case studies and post-incident analysis rather than just terminal output.

Finally, I implemented an explain command that takes a single event and shows exactly which baseline was used, how the score was computed, and why the engine classified the event as normal or anomalous.

Security Focus

This project reinforces baseline-first thinking, where detection starts with understanding normal behavior instead of reacting to predefined signatures or thresholds.

By persisting baselines and making them inspectable, the system avoids black-box scoring and supports defensive review of assumptions over time.

Hour-of-day contextualization highlights how normal behavior is rarely static and how ignoring temporal patterns can create false positives or blind spots.

The separation between scoring and alerting reduces the risk of noisy or brittle detections and allows security teams to tune response logic independently from measurement logic.

From a defensive perspective, the engine emphasizes visibility, explainability, and repeatability over one-off anomaly hits.

Results

The final system can ingest events, learn contextual baselines, store them durably, score new behavior, and explain its decisions without relying on static thresholds.

Functionality was validated through unit tests at each layer, end-to-end CLI tests, and controlled demo datasets with known injected incidents.

One surprise was how quickly baselines become sparse when context is added. Hour-of-day bucketing made it obvious that training data sufficiency is a first-order concern, not an implementation detail.

Another shift in thinking was treating baselines as versioned artifacts rather than something that is constantly overwritten. This opens the door to comparing how normal changes over time.

Overall, this project changed how I think about detection. I now see alerts as the final output of a system that should first be able to calmly explain what normal looks like and how far away an event truly is.

Want to dig into the code?

This project is fully documented on GitHub, including notes, commits, and future ideas.

View repo on GitHub →