Default Alert Rules

The alert rules provided are predefined monitoring templates for infrastructure, Kubernetes clusters, and application levels. These rules are shipped "out of the box" by Scoutflo and automatically activated when a cluster is created. They aim to provide instant value by ensuring customers have critical monitoring set up from the beginning without manual configuration.


Alert Rules list

1. Infrastructure-Level Alerts

These monitor the performance and health of the underlying physical or virtual machines hosting the cluster.

  • Node CPU Usage

    • Metric: node_cpu_seconds_total

    • Formula: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

    • Trigger: Alerts when CPU usage exceeds 80% for 5 minutes.

    • Use Case: High CPU usage can degrade system performance, indicating resource exhaustion or runaway processes.

  • Node Memory Usage

    • Metric: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

    • Formula: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

    • Trigger: Alerts when memory usage exceeds 80% for 5 minutes.

    • Use Case: Helps identify memory leaks or over-provisioned workloads that can lead to application failures.

  • Disk Space Usage

    • Metric: node_filesystem_avail_bytes / node_filesystem_size_bytes

    • Formula: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

    • Trigger: Alerts when available disk space drops below 20% for 10 minutes.

    • Use Case: Prevents outages caused by insufficient disk space for logs, files, or system processes.

  • Node Load

    • Metric: node_load1

    • Formula: node_load1 > count(node_cpu_seconds_total{mode="idle"}) * 1.5

    • Trigger: Alerts when load exceeds 1.5 times the CPU count for 5 minutes.

    • Use Case: High load may indicate CPU contention, unoptimized workloads, or insufficient resources.


2. Kubernetes Cluster Alerts

These ensure the Kubernetes cluster operates as expected, monitoring pods, nodes, and deployments.

  • Pod Restarts

    • Metric: kube_pod_container_status_restarts_total

    • Formula: increase(kube_pod_container_status_restarts_total[1h]) > 3

    • Trigger: Alerts if a pod restarts more than 3 times in an hour.

    • Use Case: Frequent restarts often indicate issues like configuration errors, crashes, or resource constraints.

  • Pod Not Ready

    • Metric: kube_pod_status_ready

    • Formula: kube_pod_status_ready{condition="true"} == 0

    • Trigger: Alerts when a pod remains in a "Not Ready" state for 5 minutes.

    • Use Case: Ensures critical applications are available and ready to handle requests.

  • Deployment Replicas Mismatch

    • Metric: kube_deployment_spec_replicas vs. kube_deployment_status_replicas_available

    • Formula: kube_deployment_spec_replicas != kube_deployment_status_replicas_available

    • Trigger: Alerts if desired replicas do not match available replicas for 5 minutes.

    • Use Case: Identifies issues in scaling or deployment strategies.

  • Node Not Ready

    • Metric: kube_node_status_condition

    • Formula: kube_node_status_condition{condition="Ready",status="true"} == 0

    • Trigger: Alerts when a node is in a "Not Ready" state for 5 minutes.

    • Use Case: Detects node-level failures that could impact workload scheduling.


3. Application-Level Alerts

These monitor the health and performance of user applications deployed on the cluster.

  • High Latency

    • Metric: http_request_duration_seconds

    • Formula: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

    • Trigger: Alerts when 95th percentile latency exceeds 0.5 seconds for 5 minutes.

    • Use Case: Ensures responsive user experiences by detecting slow requests.

  • High Error Rate

    • Metric: http_requests_total

    • Formula: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

    • Trigger: Alerts when error rates exceed 5% for 5 minutes.

    • Use Case: Helps catch application errors and prevent customer impact.

  • Low Throughput

    • Metric: http_requests_total

    • Formula: rate(http_requests_total[5m])

    • Trigger: Alerts when request rates drop below 10 requests/second for 5 minutes.

    • Use Case: Detects underperforming or idle applications.


Product Flow and Use Case

  1. Cluster Creation

    • When users create a Kubernetes cluster through your platform, these alerts are auto-enabled.

    • Default alert rules require no setup, ensuring immediate monitoring coverage.

  2. Alerting and Visibility

    • Alerts are integrated with Prometheus and routed through Alert Manager.

    • Notifications can be sent to configured communication channels (e.g., Slack, email, PagerDuty).

  3. Actionable Monitoring

    • Engineers and operators are notified of anomalies in real-time.

    • Alerts contain detailed context (e.g., node, pod, deployment names) to assist in debugging.

  4. Prevention and Optimization

    • Early detection of issues like high resource usage or application errors prevents downtime.

    • Continuous monitoring ensures applications remain performant and resources are optimally utilized.

  5. Scalability

    • As applications grow, these alerts can be customized for specific workloads or environments.

    • Supports scaling operations without compromising reliability.


Full List of Predefined Alerts

  1. NodeHighCPUUsage

  2. NodeHighMemoryUsage

  3. NodeLowDiskSpace

  4. NodeHighLoad

  5. PodFrequentRestarts

  6. PodNotReady

  7. DeploymentReplicasMismatch

  8. NodeNotReady

  9. HighLatency

  10. HighErrorRate

  11. LowThroughput

  12. HighResponseTime

  13. DiskSpaceUsage (variant for specific mounts)

  14. ServiceUnavailable (future)

  15. DatabaseConnectionErrors (future potential)

Last updated