Observability & Monitoring: The Unified Blueprint

To ensure consistent operational visibility across `wealthview`, `hud`, `operatorvoice`, and `Wikantik`, all services must adhere to this unified monitoring standard. We prioritize **OpenTelemetry (OTel)** for instrumentation and **Prometheus/Grafana** for collection and visualization.

1. Metric Standards: The RED and USE Methods

Every service in the ecosystem must produce metrics following these two industry-standard methodologies.

1.1 The RED Method (Request-Driven Services)

For synchronous services (APIs, Gateways), track:

- **(R)ate**: Number of requests per second.

- **(E)rrors**: Number of failed requests (non-2xx for HTTP, non-0 for gRPC).

- **(D)uration**: Time taken to process requests (P95 and P99 latencies).

1.2 The USE Method (Resource-Driven Components)

For infrastructure or background workers, track:

- **(U)tilization**: Average time the resource was busy (e.g., CPU, Thread Pool).

- **(S)aturation**: The degree to which extra work is queued (e.g., Queue Depth, Disk I/O Wait).

- **(E)rrors**: Count of error events at the resource level.

2. Implementation: OpenTelemetry (OTel)

Services must use the OTel SDK to ensure vendor-neutrality.

2.1 Standard Resource Attributes

Every exported metric must include these standard tags to enable unified filtering in Grafana:

```yaml

resource_attributes:

service.name: "wealthview-api"

service.namespace: "prod"

deployment.environment: "production"

host.name: "${HOSTNAME}"

```

2.2 Prometheus Metric Naming

Use the following naming convention to prevent metric collisions:

- `<service_name>_<subsystem>_<unit>_<type>`

- *Example*: `wealthview_ingestion_transactions_total` (Counter)

- *Example*: `hud_render_latency_ms_bucket` (Histogram)

3. Health Check Integration

Metrics alone are insufficient. Services must implement the **Health Check Triad** as defined in [HealthCheckPatterns](HealthCheckPatterns):

| Probe | Path | Logic |

| :--- | :--- | :--- |

| **Startup** | `/health/startup` | Returns 200 after internal caches/DB migrations are complete. |

| **Readiness** | `/health/ready` | Checks downstream connectivity (e.g., Plaid API, Redis). |

| **Liveness** | `/health/live` | Minimal check (e.g., thread-pool heartbeat). |

4. Grafana Visualization Standards

A "Golden Signal" dashboard must exist for every project, containing:

1. **Traffic Overview**: RED metrics for the ingress layer.

2. **Saturation Heatmaps**: USE metrics for database and message brokers.

3. **Error Breakdown**: Rate of 4xx vs 5xx errors, or "Drift" alerts for AI models.

4. **Health Ribbon**: Status of the 3 probes (Startup, Readiness, Liveness) across all replicas.

5. RAG Implementation Hook

For an agent instrumenting a new service (e.g., **operatorvoice**), the prompt should be:

> "Following the `ObservabilityAndMonitoringBlueprint`, instrument this Python service with OpenTelemetry to track RED metrics for the voice-interaction loops and expose a `/health/ready` endpoint that verifies the STT and TTS service connectivity."

See Also

- [HealthCheckPatterns](HealthCheckPatterns) — Deep-dive on probe implementation.

- [CloudMonitoring](CloudMonitoring) — Survey of available cloud tools.

- [AiObservabilityInProduction](AiObservabilityInProduction) — Specific metrics for LLM and Agentic systems.