Failure alerts
When a run transitions toFAIL or ABORT, oleander opens a CRITICAL failure alert on the corresponding pipeline. Each alert carries context about how the failure rate has changed relative to recent history:
- Error rate — the percentage of all runs for that pipeline that have failed
- Failure increase factor — the ratio of failures in the most recent 20 runs compared to the 20 runs before that
Anomaly alerts
oleander tracks the following metrics per pipeline run and detects anomalies using a windowed statistical model:| Metric | Description |
|---|---|
bytes_read | Total bytes read across all inputs |
rows_read | Total rows read across all inputs |
bytes_written | Total bytes written across all outputs |
rows_written | Total rows written across all outputs |
duration_avg | Average job duration in seconds |
- The observed value and the expected value
- Upper and lower bounds from the detection window
- An anomaly score
- Severity:
INFO,WARN, orCRITICAL
Alert lifecycle
| Status | Meaning |
|---|---|
NEW | Alert just opened, not yet acknowledged |
ACTIVE | Alert is ongoing |
RESOLVED | Manually resolved by a team member |
last_detected_at and re-evaluate severity rather than creating duplicate alerts.
Notifications
When an alert opens, oleander sends a Slack notification (if connected) and creates a notification entry visible in the platform. Webhooks configured in Settings receive all alert events as OpenLineage-formatted payloads. To receive alerts for a pipeline, add it to your watch list in Settings.AI investigations
For every failure alert, oleander can run an automated investigation. The AI correlates:- Logs, OTel spans, and raw telemetry from the failed run
- Lineage events and dataset I/O from the run
- Iceberg commit and scan metrics
- Spark logical plan diffs against the previous successful run