Health and Alarms

SystemHealth record, write cadence, alarm lifecycle, trigger conditions, and the minimum operator control API.

System health

HeartbeatWatcher writes a SystemHealth record to system_health every 30 seconds.

SystemHealth
  status:               HealthStatus  — healthy | degraded | down
  uptime_seconds:       int
  version:              str
  restart_reason:       str | None
  last_heartbeat_at:    datetime
  next_expected_at:     datetime      — if now > this, heartbeat is missed
  degraded_subsystems:  list[str]     — names of unhealthy subsystems

GET /health returns the most recent record. next_expected_at is used by the alarm system to detect missed heartbeats without requiring a log scan.

Alarm lifecycle

Alarms are persisted entities with three states:

open ──► acked ──► resolved
  • open: the condition that triggered the alarm is still active.
  • acked: an operator has acknowledged the alarm; it remains open.
  • resolved: the condition is cleared and the operator has confirmed resolution.

Alarms use a dedupe_key to prevent storms: if an open alarm already exists for the same dedupe_key, a new trigger does not create a duplicate alarm — it updates the existing one.

Alarm trigger conditions

Trigger typeConditionSeverity
Missed heartbeatnow > system_health.next_expected_at + grace_periodcritical
Integration auth failedConsecutive auth errors ≥ thresholderror
Integration auth expiringCredential expiry within warning windowwarning
Repeated tool errorstool.consecutive_errors ≥ thresholderror
Stuck taskstatus = running AND updated_at < now - thresholdwarning
Rule stormtriggered_count_per_minute > thresholdwarning
Schedule backlognext_run_at < now - threshold AND enabled = 1warning
Notification stormOutbound notifications/hour > MAX_NOTIFICATIONS_PER_HOURwarning
Secret in audit payloadRedaction miss detectedcritical

Operator control endpoints

All controls endpoints emit operator.action.* audit events. See observability/api-reference for the full endpoint list.

EndpointAction
POST /controls/pausePause all pipeline processing
POST /controls/resumeResume pipeline processing
POST /controls/autonomyChange autonomy level
POST /tasks/{id}/cancelCancel a running or paused task
POST /tasks/{id}/pausePause a running task
POST /tasks/{id}/resumeResume a paused task
POST /tasks/{id}/retryRetry a failed task
PATCH /watchers/{id}Enable or disable a watcher
PATCH /rules/{id}Enable or disable a rule
PATCH /integrations/{id}Enable or disable an integration
POST /approvals/{id}/approveApprove a pending gate
POST /approvals/{id}/denyDeny a pending gate
POST /alarms/{id}/ackAcknowledge an open alarm
POST /alarms/{id}/resolveResolve an alarm