Default Alert Rules
The alert rules provided are predefined monitoring templates for infrastructure, Kubernetes clusters, and application levels. These rules are shipped "out of the box" by Scoutflo and automatically activated when a cluster is created. They aim to provide instant value by ensuring customers have critical monitoring set up from the beginning without manual configuration.
Alert Rules list
1. Infrastructure-Level Alerts
These monitor the performance and health of the underlying physical or virtual machines hosting the cluster.
Node CPU Usage
Metric:
node_cpu_seconds_total
Formula:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Trigger: Alerts when CPU usage exceeds 80% for 5 minutes.
Use Case: High CPU usage can degrade system performance, indicating resource exhaustion or runaway processes.
Node Memory Usage
Metric:
node_memory_MemAvailable_bytes
/node_memory_MemTotal_bytes
Formula:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Trigger: Alerts when memory usage exceeds 80% for 5 minutes.
Use Case: Helps identify memory leaks or over-provisioned workloads that can lead to application failures.
Disk Space Usage
Metric:
node_filesystem_avail_bytes
/node_filesystem_size_bytes
Formula:
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
Trigger: Alerts when available disk space drops below 20% for 10 minutes.
Use Case: Prevents outages caused by insufficient disk space for logs, files, or system processes.
Node Load
Metric:
node_load1
Formula:
node_load1 > count(node_cpu_seconds_total{mode="idle"}) * 1.5
Trigger: Alerts when load exceeds 1.5 times the CPU count for 5 minutes.
Use Case: High load may indicate CPU contention, unoptimized workloads, or insufficient resources.
2. Kubernetes Cluster Alerts
These ensure the Kubernetes cluster operates as expected, monitoring pods, nodes, and deployments.
Pod Restarts
Metric:
kube_pod_container_status_restarts_total
Formula:
increase(kube_pod_container_status_restarts_total[1h]) > 3
Trigger: Alerts if a pod restarts more than 3 times in an hour.
Use Case: Frequent restarts often indicate issues like configuration errors, crashes, or resource constraints.
Pod Not Ready
Metric:
kube_pod_status_ready
Formula:
kube_pod_status_ready{condition="true"} == 0
Trigger: Alerts when a pod remains in a "Not Ready" state for 5 minutes.
Use Case: Ensures critical applications are available and ready to handle requests.
Deployment Replicas Mismatch
Metric:
kube_deployment_spec_replicas
vs.kube_deployment_status_replicas_available
Formula:
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
Trigger: Alerts if desired replicas do not match available replicas for 5 minutes.
Use Case: Identifies issues in scaling or deployment strategies.
Node Not Ready
Metric:
kube_node_status_condition
Formula:
kube_node_status_condition{condition="Ready",status="true"} == 0
Trigger: Alerts when a node is in a "Not Ready" state for 5 minutes.
Use Case: Detects node-level failures that could impact workload scheduling.
3. Application-Level Alerts
These monitor the health and performance of user applications deployed on the cluster.
High Latency
Metric:
http_request_duration_seconds
Formula:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Trigger: Alerts when 95th percentile latency exceeds 0.5 seconds for 5 minutes.
Use Case: Ensures responsive user experiences by detecting slow requests.
High Error Rate
Metric:
http_requests_total
Formula:
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Trigger: Alerts when error rates exceed 5% for 5 minutes.
Use Case: Helps catch application errors and prevent customer impact.
Low Throughput
Metric:
http_requests_total
Formula:
rate(http_requests_total[5m])
Trigger: Alerts when request rates drop below 10 requests/second for 5 minutes.
Use Case: Detects underperforming or idle applications.
Product Flow and Use Case
Cluster Creation
When users create a Kubernetes cluster through your platform, these alerts are auto-enabled.
Default alert rules require no setup, ensuring immediate monitoring coverage.
Alerting and Visibility
Alerts are integrated with Prometheus and routed through Alert Manager.
Notifications can be sent to configured communication channels (e.g., Slack, email, PagerDuty).
Actionable Monitoring
Engineers and operators are notified of anomalies in real-time.
Alerts contain detailed context (e.g., node, pod, deployment names) to assist in debugging.
Prevention and Optimization
Early detection of issues like high resource usage or application errors prevents downtime.
Continuous monitoring ensures applications remain performant and resources are optimally utilized.
Scalability
As applications grow, these alerts can be customized for specific workloads or environments.
Supports scaling operations without compromising reliability.
Full List of Predefined Alerts
NodeHighCPUUsage
NodeHighMemoryUsage
NodeLowDiskSpace
NodeHighLoad
PodFrequentRestarts
PodNotReady
DeploymentReplicasMismatch
NodeNotReady
HighLatency
HighErrorRate
LowThroughput
HighResponseTime
DiskSpaceUsage (variant for specific mounts)
ServiceUnavailable (future)
DatabaseConnectionErrors (future potential)
Last updated