Skip to content

Service Monitor Service

The Service Monitor is a Go-based microservice responsible for assessing the health and stability of deployed services running in a Kubernetes cluster. It operates by watching all pods in a target namespace over a configurable time window, tracking container state transitions in real time, and publishing a health report back to SONATA when the watch window expires. Intermediate reports can also be emitted periodically during the watch window so that SONATA can observe the service converging to a healthy state.

API

At the northbound, the Service Monitor listens to a Kafka topic where SONATA publishes diagnosis requests. Each request carries the kubeconfig needed to connect to the target cluster, the namespace to watch, and the duration of the observation window. The Service Monitor never holds cluster credentials itself — the kubeconfig is passed inline with each request.

Table 1: Service Monitor Kafka API.
Northbound integration with Using Kafka topic Leads to southbound call Purpose
SONATA monitor Kubernetes Watch API Watch all pods in a namespace and assess their health and stability
Table 2: Service Monitor response topics.
Southbound integration with Using Kafka topic Purpose
SONATA monitor-result Publish the final health report when the watch window closes, and optionally intermediate reports during the window

Diagnosis Flow

SONATA publishes a diagnosis request to monitor. The Service Monitor builds a Kubernetes client from the supplied kubeconfig, opens a streaming pod watch against the target namespace, and tracks container state changes until the watch window expires.

The incoming message is a Kogito-wrapped DiagnosisRequest:

{
  "specversion": "1.0",
  "id": "21000e26-31eb-43e7-8343-88a696fd96b1",
  "source": "/process/serviceMonitorProcess",
  "type": "monitor",
  "kogitoprocinstanceid": "7aa45064-f4a7-4a25-a7a2-667afb07c3d1",
  "data": {
    "serviceOrderId": "2c4fecbc-c436-4f8c-a5db-4a1159ebad2a",
    "serviceOrderItemId": "4e0abab6-25d4-4204-b6e1-aea09f642d49",
    "serviceId": "790daa4d-7f56-4575-8b64-c558cf518bed",
    "kubeConfig": "<path-or-inline-kubeconfig>",
    "namespace": "790daa4d-7f56-4575-8b64-c558cf518bed",
    "watchForSec": 120,
    "reportsFrequency": 30
  }
}

Field constraints enforced at validation:

Field Constraint
kubeConfig Must be at least 3 characters
namespace Must be at least 3 characters
watchForSec Minimum 60 seconds
reportsFrequency When non-zero, an intermediate health report is emitted via monitor-result each time this many seconds have elapsed since the last report

When reportsFrequency is zero, only the final report is published.

The Service Monitor opens a Kubernetes pod watch scoped to the given namespace with a context timeout equal to watchForSec. For every pod event received from the Kubernetes API, it updates the in-memory health state of each container in that pod. When the context deadline is reached the watch channel closes, triggering the final report.

Upon successful completion, the handler publishes a final result to monitor-result:

{
  "specversion": "1.0",
  "id": "21000e26-31eb-43e7-8343-88a696fd96b1",
  "source": "/process/serviceMonitorProcess",
  "type": "monitor-result",
  "kogitoprocrefid": "7aa45064-f4a7-4a25-a7a2-667afb07c3d1",
  "data": {
    "status": "SUCCESS",
    "message": "Diagnosis for namespace '790daa4d-7f56-4575-8b64-c558cf518bed' finished!",
    "monitorResults": {
      "running": true,
      "isStable": true,
      "description": "[ns: '790daa4d-...', pod: 'my-app-xyz', container: 'my-app']: Running."
    }
  }
}

On failure (e.g. invalid kubeconfig, unreachable cluster), the handler publishes an error response and the monitorResults field is null:

{
  "specversion": "1.0",
  "id": "21000e26-31eb-43e7-8343-88a696fd96b1",
  "source": "/process/serviceMonitorProcess",
  "type": "monitor-result",
  "kogitoprocrefid": "7aa45064-f4a7-4a25-a7a2-667afb07c3d1",
  "data": {
    "status": "FAILURE",
    "message": "Could not build config from flags: ...",
    "monitorResults": null
  }
}

Health Assessment Logic

Container health is evaluated per event received from the Kubernetes Watch API. The aggregated result across all containers in the namespace is reported as a single Diagnosis object.

Container-level health (ContainerState):

A container is considered healthy (IsHealthy: true) when its Kubernetes state is Running (i.e. state.Running != nil). If the container is Terminated or Waiting, it is considered unhealthy and the reason code from the Kubernetes state (e.g. CrashLoopBackOff, OOMKilled, Completed) is captured as the container's description.

A container is considered stable when its IsHealthy value has not changed for at least 60 seconds. Any transition between healthy and unhealthy resets the stability timer.

Namespace-level aggregation (Diagnosis):

Field Value
running true only if every container in the namespace is healthy
isStable true only if every container in the namespace is stable
description A human-readable summary line per container, listing its current state, namespace, pod name, and container name

The description field includes a line for every tracked container. Healthy and stable containers are reported as Running. Healthy but unstable containers report as Running but Unstable with the timestamp of the last health change. Unhealthy containers report the Kubernetes reason code.

The in-memory watch state is scoped to the namespace of the request and is cleaned up automatically once the watch window closes, so concurrent diagnosis requests against different namespaces are isolated from one another.