Default Alert Rules
The alert rules provided are predefined monitoring templates for infrastructure, Kubernetes clusters, and application levels. These rules are shipped "out of the box" by Scoutflo and automatically activated when a cluster is created. They aim to provide instant value by ensuring customers have critical monitoring set up from the beginning without manual configuration.
Alert Rules list
1. Infrastructure-Level Alerts
These monitor the performance and health of the underlying physical or virtual machines hosting the cluster.
Node CPU Usage
Metric:
node_cpu_seconds_totalFormula:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)Trigger: Alerts when CPU usage exceeds 80% for 5 minutes.
Use Case: High CPU usage can degrade system performance, indicating resource exhaustion or runaway processes.
Node Memory Usage
Metric:
node_memory_MemAvailable_bytes/node_memory_MemTotal_bytesFormula:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100Trigger: Alerts when memory usage exceeds 80% for 5 minutes.
Use Case: Helps identify memory leaks or over-provisioned workloads that can lead to application failures.
Disk Space Usage
Metric:
node_filesystem_avail_bytes/node_filesystem_size_bytesFormula:
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100Trigger: Alerts when available disk space drops below 20% for 10 minutes.
Use Case: Prevents outages caused by insufficient disk space for logs, files, or system processes.
Node Load
Metric:
node_load1Formula:
node_load1 > count(node_cpu_seconds_total{mode="idle"}) * 1.5Trigger: Alerts when load exceeds 1.5 times the CPU count for 5 minutes.
Use Case: High load may indicate CPU contention, unoptimized workloads, or insufficient resources.
2. Kubernetes Cluster Alerts
These ensure the Kubernetes cluster operates as expected, monitoring pods, nodes, and deployments.
Pod Restarts
Metric:
kube_pod_container_status_restarts_totalFormula:
increase(kube_pod_container_status_restarts_total[1h]) > 3Trigger: Alerts if a pod restarts more than 3 times in an hour.
Use Case: Frequent restarts often indicate issues like configuration errors, crashes, or resource constraints.
Pod Not Ready
Metric:
kube_pod_status_readyFormula:
kube_pod_status_ready{condition="true"} == 0Trigger: Alerts when a pod remains in a "Not Ready" state for 5 minutes.
Use Case: Ensures critical applications are available and ready to handle requests.
Deployment Replicas Mismatch
Metric:
kube_deployment_spec_replicasvs.kube_deployment_status_replicas_availableFormula:
kube_deployment_spec_replicas != kube_deployment_status_replicas_availableTrigger: Alerts if desired replicas do not match available replicas for 5 minutes.
Use Case: Identifies issues in scaling or deployment strategies.
Node Not Ready
Metric:
kube_node_status_conditionFormula:
kube_node_status_condition{condition="Ready",status="true"} == 0Trigger: Alerts when a node is in a "Not Ready" state for 5 minutes.
Use Case: Detects node-level failures that could impact workload scheduling.
3. Application-Level Alerts
These monitor the health and performance of user applications deployed on the cluster.
High Latency
Metric:
http_request_duration_secondsFormula:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))Trigger: Alerts when 95th percentile latency exceeds 0.5 seconds for 5 minutes.
Use Case: Ensures responsive user experiences by detecting slow requests.
High Error Rate
Metric:
http_requests_totalFormula:
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Trigger: Alerts when error rates exceed 5% for 5 minutes.
Use Case: Helps catch application errors and prevent customer impact.
Low Throughput
Metric:
http_requests_totalFormula:
rate(http_requests_total[5m])Trigger: Alerts when request rates drop below 10 requests/second for 5 minutes.
Use Case: Detects underperforming or idle applications.
Product Flow and Use Case
Cluster Creation
When users create a Kubernetes cluster through your platform, these alerts are auto-enabled.
Default alert rules require no setup, ensuring immediate monitoring coverage.
Alerting and Visibility
Alerts are integrated with Prometheus and routed through Alert Manager.
Notifications can be sent to configured communication channels (e.g., Slack, email, PagerDuty).
Actionable Monitoring
Engineers and operators are notified of anomalies in real-time.
Alerts contain detailed context (e.g., node, pod, deployment names) to assist in debugging.
Prevention and Optimization
Early detection of issues like high resource usage or application errors prevents downtime.
Continuous monitoring ensures applications remain performant and resources are optimally utilized.
Scalability
As applications grow, these alerts can be customized for specific workloads or environments.
Supports scaling operations without compromising reliability.
Full List of Predefined Alerts
NodeHighCPUUsage
NodeHighMemoryUsage
NodeLowDiskSpace
NodeHighLoad
PodFrequentRestarts
PodNotReady
DeploymentReplicasMismatch
NodeNotReady
HighLatency
HighErrorRate
LowThroughput
HighResponseTime
DiskSpaceUsage (variant for specific mounts)
ServiceUnavailable (future)
DatabaseConnectionErrors (future potential)
Last updated