Scoutflo Documentation
  • 🚀Welcome to Scoutflo💙
  • Overview
    • What is Scoutflo?
    • Getting Started
    • Scoutflo Architecture
    • Basic Concepts
    • Our Products
  • Our Products
    • Scoutflo Deploy
    • Scoutflo Atlas
      • About
      • Key Features
      • Scoutflo Sandbox
        • How to use
        • Available Product Sandboxes
      • Scoutflo Health Score
        • Overview
        • Key Metrics
          • Security Key Elements
          • Code Quality & Maintenance Key Elements
          • Support Key Elements
          • Community Activeness Key Elements
          • Business Readiness Key Elements
        • Calculation
        • Use case of these Scores
        • Process of Score calculation
        • FAQ
      • Product Qualification
      • Find the right product
      • Product Information and Maintenance
      • Product Stakeholders
  • Key Feature
    • Infrastructure Provisioning
      • Create a new Cluster
        • Add Credentials
        • VPC Configuration
      • Edit an existing Cluster
        • Security Scans for Cluster
      • Delete a Cluster
    • Service Deployment
      • Service Onboarding
      • Service Cost Prediction
      • Service Deployment
      • Delete a Service
    • Helm Service Deployment
      • Customized Helm Deployment
      • Open Source Helm Deployment
      • Open Source Service Catalog
      • Edit an App
      • Delete an App
    • Database Deployment
    • RBAC
      • Set Up your custom Roles
    • Workspace
    • Dora Dashboard
    • Kubernetes Dashboard
    • Notification (Coming Soon)
    • Alert Management
      • Default Alert Rules
  • Guide
    • Terraform and Scoutflo
    • AWS EKS Best Practices Guide
    • Kubernetes and Scoutflo
    • ArgoCD and Scoutflo
    • Connect your Cloud
    • Scoutflo Deploy Free Trial Cluster
    • Add-on deployments
    • Custom Configurations
    • Terminology Guide
    • Workflow Action ID
  • Integrations
    • Scoutflo Integration
    • Version Control tool
      • GitHub App
    • Container Registry
      • AWS ECR Container Registry
      • Docker Hub Container Registry
    • Slack
    • Scoutflo Add-Ons
  • Fundamentals
    • GitOps with Scoutflo
    • Container/OCI Registry
    • Monitoring
    • AWS EKS Cluster
    • List of IAM permissions for your scoutflo IAM user on AWS
  • FAQs
    • General
    • Scoutflo Atlas
    • Scoutflo Deploy
    • Scoutflo Sandbox
    • Contact Us
  • Glossary
Powered by GitBook
On this page
  1. Key Feature
  2. Alert Management

Default Alert Rules

The alert rules provided are predefined monitoring templates for infrastructure, Kubernetes clusters, and application levels. These rules are shipped "out of the box" by Scoutflo and automatically activated when a cluster is created. They aim to provide instant value by ensuring customers have critical monitoring set up from the beginning without manual configuration.


Alert Rules list

1. Infrastructure-Level Alerts

These monitor the performance and health of the underlying physical or virtual machines hosting the cluster.

  • Node CPU Usage

    • Metric: node_cpu_seconds_total

    • Formula: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

    • Trigger: Alerts when CPU usage exceeds 80% for 5 minutes.

    • Use Case: High CPU usage can degrade system performance, indicating resource exhaustion or runaway processes.

  • Node Memory Usage

    • Metric: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

    • Formula: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

    • Trigger: Alerts when memory usage exceeds 80% for 5 minutes.

    • Use Case: Helps identify memory leaks or over-provisioned workloads that can lead to application failures.

  • Disk Space Usage

    • Metric: node_filesystem_avail_bytes / node_filesystem_size_bytes

    • Formula: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

    • Trigger: Alerts when available disk space drops below 20% for 10 minutes.

    • Use Case: Prevents outages caused by insufficient disk space for logs, files, or system processes.

  • Node Load

    • Metric: node_load1

    • Formula: node_load1 > count(node_cpu_seconds_total{mode="idle"}) * 1.5

    • Trigger: Alerts when load exceeds 1.5 times the CPU count for 5 minutes.

    • Use Case: High load may indicate CPU contention, unoptimized workloads, or insufficient resources.


2. Kubernetes Cluster Alerts

These ensure the Kubernetes cluster operates as expected, monitoring pods, nodes, and deployments.

  • Pod Restarts

    • Metric: kube_pod_container_status_restarts_total

    • Formula: increase(kube_pod_container_status_restarts_total[1h]) > 3

    • Trigger: Alerts if a pod restarts more than 3 times in an hour.

    • Use Case: Frequent restarts often indicate issues like configuration errors, crashes, or resource constraints.

  • Pod Not Ready

    • Metric: kube_pod_status_ready

    • Formula: kube_pod_status_ready{condition="true"} == 0

    • Trigger: Alerts when a pod remains in a "Not Ready" state for 5 minutes.

    • Use Case: Ensures critical applications are available and ready to handle requests.

  • Deployment Replicas Mismatch

    • Metric: kube_deployment_spec_replicas vs. kube_deployment_status_replicas_available

    • Formula: kube_deployment_spec_replicas != kube_deployment_status_replicas_available

    • Trigger: Alerts if desired replicas do not match available replicas for 5 minutes.

    • Use Case: Identifies issues in scaling or deployment strategies.

  • Node Not Ready

    • Metric: kube_node_status_condition

    • Formula: kube_node_status_condition{condition="Ready",status="true"} == 0

    • Trigger: Alerts when a node is in a "Not Ready" state for 5 minutes.

    • Use Case: Detects node-level failures that could impact workload scheduling.


3. Application-Level Alerts

These monitor the health and performance of user applications deployed on the cluster.

  • High Latency

    • Metric: http_request_duration_seconds

    • Formula: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

    • Trigger: Alerts when 95th percentile latency exceeds 0.5 seconds for 5 minutes.

    • Use Case: Ensures responsive user experiences by detecting slow requests.

  • High Error Rate

    • Metric: http_requests_total

    • Formula: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

    • Trigger: Alerts when error rates exceed 5% for 5 minutes.

    • Use Case: Helps catch application errors and prevent customer impact.

  • Low Throughput

    • Metric: http_requests_total

    • Formula: rate(http_requests_total[5m])

    • Trigger: Alerts when request rates drop below 10 requests/second for 5 minutes.

    • Use Case: Detects underperforming or idle applications.


Product Flow and Use Case

  1. Cluster Creation

    • When users create a Kubernetes cluster through your platform, these alerts are auto-enabled.

    • Default alert rules require no setup, ensuring immediate monitoring coverage.

  2. Alerting and Visibility

    • Alerts are integrated with Prometheus and routed through Alert Manager.

    • Notifications can be sent to configured communication channels (e.g., Slack, email, PagerDuty).

  3. Actionable Monitoring

    • Engineers and operators are notified of anomalies in real-time.

    • Alerts contain detailed context (e.g., node, pod, deployment names) to assist in debugging.

  4. Prevention and Optimization

    • Early detection of issues like high resource usage or application errors prevents downtime.

    • Continuous monitoring ensures applications remain performant and resources are optimally utilized.

  5. Scalability

    • As applications grow, these alerts can be customized for specific workloads or environments.

    • Supports scaling operations without compromising reliability.


Full List of Predefined Alerts

  1. NodeHighCPUUsage

  2. NodeHighMemoryUsage

  3. NodeLowDiskSpace

  4. NodeHighLoad

  5. PodFrequentRestarts

  6. PodNotReady

  7. DeploymentReplicasMismatch

  8. NodeNotReady

  9. HighLatency

  10. HighErrorRate

  11. LowThroughput

  12. HighResponseTime

  13. DiskSpaceUsage (variant for specific mounts)

  14. ServiceUnavailable (future)

  15. DatabaseConnectionErrors (future potential)

PreviousAlert ManagementNextTerraform and Scoutflo

Last updated 4 months ago