Failure Modes

Catalogue of known failure modes with detection, system response, operator signal, and recovery path.

Every failure mode has a detection method, a system response, an operator signal, and a recovery path. "Check the logs" is not an acceptable detection method for any of these.

Pipeline failures

F-01: Normalizer receives malformed adapter payload

DetectionPydantic validation error raised in normalizer.py
System responseDiscard raw payload; emit AuditEvent("event.ingested", outcome="failure")
Operator signalAlarm raised after repeated failures from same connector_id
RecoveryFix adapter payload format; re-submit event if required

F-02: Spurious dedup suppression (hash collision)

DetectionUser reports missing event; GET /audit?type=event.deduped shows unexpected entry
System responseEvent discarded silently; event.deduped audit event emitted
Operator signalVisible in audit log; no alarm by default
RecoveryClear dedupe_key from events table; re-submit the event

F-03: LLM fallback timeout

DetectionTimeout exception in routing/llm_fallback.py
System responseEmit AuditEvent("routing.decided", outcome="failure"); route to "clarification needed" task; notify operator
Operator signalTask visible at GET /tasks; alarm on repeated failures
RecoveryAutomatic retry is not applied; operator reviews and re-submits if appropriate

Task engine failures

F-04: Step function raises unexpected exception

DetectionUnhandled exception caught by step_runner.run()
System responseEmit task.step_unexpected_error; mark step failed; mark task failed
Operator signalTask visible at GET /tasks/{id} with error detail; alarm if stuck
RecoveryFix step implementation; POST /tasks/{id}/retry

F-05: Task stuck (no step progress)

DetectionAlarm trigger: status = running AND updated_at < now - threshold
System responseAlarm raised; task remains in running state
Operator signalAlarm at GET /alarms; visible in proj_active_tasks
RecoveryPOST /tasks/{id}/cancel or POST /tasks/{id}/retry after investigation

F-06: Crash mid-step; unknown outcome

DetectionOn startup reconciliation: step with status = running and no entry in idempotency_outcomes
System responseStore unknown outcome; reset step to pending; increment attempt counter; emit tool_call.unknown
Operator signalVisible in audit log at GET /audit?trace_id={trace_id}
RecoveryAutomatic: task engine re-attempts step on next loop iteration

F-07: Task deadlocked on expired approval

DetectionApproval-wait loop finds approval.status = expired for a waiting task
System responsePer configuration: fail the step, or create a new Approval and notify operator
Operator signalApproval visible at GET /approvals; task blocked at GET /tasks/{id}
RecoveryPOST /approvals/{id}/approve on new approval, or POST /tasks/{id}/cancel

Tool runtime failures

F-08: Tool unavailable (integration down)

Detectiontool.health.status = unavailable in ToolRegistry
System responseFail fast on tool executor lookup; emit tool_call.failed
Operator signalIntegration health at GET /integrations; alarm if consecutive errors ≥ threshold
RecoveryFix or re-enable integration; PATCH /integrations/{id} to re-enable

F-09: Rate limited by provider (429)

Detection429 response from provider
System responseSet rate_limit_resets_at on ToolHealth; retry after reset; emit tool_call.failed with retryable flag
Operator signalVisible via GET /integrations; tool calls fail with retryable error until window clears
RecoveryAutomatic: task engine backs off to rate_limit_resets_at

F-10: Idempotency key collision (different calls, same key)

Detectionidempotency_outcomes returns a prior result for a genuinely different call
System responseReturn prior result without re-executing; emit tool_call.deduped
Operator signalUnexpected tool_call.deduped visible in audit
RecoveryInvestigate key generation strategy; see architecture/task-engine for key construction rules

F-11: MCP server disconnects mid-call

DetectionTransport exception in MCPConnectionManager during an active call
System responseStore unknown outcome in idempotency_outcomes; emit tool_call.unknown; begin reconnect with backoff
Operator signalmcp.disconnected in audit; integration health degrades within 30 seconds
RecoveryAutomatic reconnect; on reconnect, startup reconciliation handles the unknown outcome

Scheduler and watcher failures

F-12: Schedule fires during process downtime

DetectionOn startup: next_run_at < now for enabled schedules
System responseApply catch_up_policy; emit schedule.missed per skipped slot
Operator signalschedule.missed events in audit; visible in proj_schedules
RecoveryAutomatic; configurable via catch_up_policy per schedule

F-13: Watcher tick raises exception

DetectionException from tick() caught by watcher loop
System responseIncrement consecutive_errors; emit watcher.error; raise alarm if threshold exceeded
Operator signalAlarm at GET /alarms; watcher state at GET /watchers
RecoveryFix watcher implementation; PATCH /watchers/{id} to re-enable

F-14: Rule storm (rule firing too frequently)

DetectionAlarm trigger: triggered_count_per_minute > threshold
System responseSuppress further notifications; emit gate.storm_block; raise alarm
Operator signalAlarm at GET /alarms; rule hit count at GET /rules
RecoveryPATCH /rules/{id} to disable or increase debounce_ms; resolve alarm

Observability failures

F-15: Audit write fails (DB error)

DetectionDB exception on AuditWriter.emit()
System responsePipeline operation fails; transaction rolls back; error logged to stderr
Operator signalProcess error log; GET /health returns degraded if DB is consistently unavailable
RecoveryResolve DB issue; the event that failed to audit may need to be re-submitted

F-16: Secret exposed in audit payload (redaction miss)

DetectionRedaction policy evaluation raises a detection alarm on known secret patterns
System responseAlarm raised immediately with critical severity
Operator signalAlarm at GET /alarms; type = secret_leak_detected
RecoveryRotate exposed credential immediately; patch redaction policy; audit affected records