Scoutflo Voyager
Introduction
Scoutflo Voyager is an Agentic Workflow engine for debugging and management of Kubernetes-based systems through advanced AI-driven workflows. By leveraging Agentic Debugging, Scoutflo empowers teams to efficiently diagnose and resolve issues in complex, distributed environments. Our platform integrates with Kubernetes clusters and observability tools like Prometheus, Sentry, Grafana, ArgoCD, Argo Rollouts, and Istio, to fetch real-time and historical data, analyze system behavior, and provide actionable remediation suggestions. Scoutflo’s AI Engine employs specialized agents that combine contextual data, integrated tools, and reasoning to identify root causes and accelerate incident resolution, reducing downtime and operational overhead.
Whether you’re managing a single Kubernetes cluster or a multi-cluster environment with rich observability integrations, Scoutflo adapts to your setup, offering tailored debugging guidance—from general recommendations to in-depth, context-aware root cause analysis. This document outlines Scoutflo’s core capabilities, starting with our innovative Agentic Debugging approach, and provides a foundation for understanding how Scoutflo transforms incident management and system reliability.
Agentic Debugging
Agentic Debugging is Scoutflo’s core methodology, enabling teams to delegate complex debugging tasks to our AI Engine. Powered by a specialized ensemble of AI agents, Scoutflo analyzes system symptoms, queries relevant data sources, identifies likely root causes, and recommends precise next steps for remediation. By integrating with Kubernetes and observability tools the AI Engine delivers context-aware insights, making it an indispensable ally for DevOps teams, SREs, and developers managing modern cloud-native applications.
🛠️ How It Works
Scoutflo’s AI Engine orchestrates debugging through a structured workflow, leveraging the following components:
Data Sources:
Connect Scoutflo to your telemetry data sources (e.g., Prometheus for metrics, Sentry for error tracking, Grafana for visualizations, Kubernetes for cluster state) to enable the AI Engine to query logs, metrics, and infrastructure details.
Example: The prometheus_metrics_query tool retrieves real-time CPU usage, while pods_log fetches pod logs to analyze errors like stack traces.
Alerts:
Integrate with alerting systems (e.g., Prometheus, Sentry, Grafana) to allow the AI Engine to detect and investigate issues as they arise.
Example: A firing Prometheus alert for high memory usage triggers the AI to use nodes_metrics to check node resource constraints.
Runbooks (Coming Soon):
Provide custom prompts or wiki-style runbooks to guide the AI Engine in specific scenarios where you have predefined opinions or workflows.
Runbooks are optional, enhancing debugging for non-standard or organization-specific cases, while Scoutflo handles common scenarios out-of-the-box.
🔄 Integration-Based Behavior
Scoutflo Voyager's Engine adapts its debugging approach based on the level of integration with your environment, ensuring flexibility across setups:
Scenario: No Kubernetes Cluster or Third-Party Integrations Connected
Behavior: The AI Engine provides general debugging guidance, including questions to ask, commands to run (e.g., kubectl get pods), and logs to inspect manually. This mode supports users with minimal setup, offering foundational troubleshooting steps.
Example: For a suspected pod issue, the AI suggests running kubectl describe pod and checking for common errors like ImagePullBackOff.
Scenario: Kubernetes Cluster Connected, No Third-Party Integrations
Behavior: The AI Engine leverages Kubernetes-specific MCP tools (e.g., pods_list, nodes_get, events_list) to provide cluster-level debugging guidance, suggesting targeted commands and log inspections based on user queries.
Example: For a pod failure, the AI uses pods_get to check status and recommends inspecting events_list for scheduling errors.
Scenario: Kubernetes Cluster and Third-Party Integrations Connected (e.g., Sentry, Prometheus, Grafana, ArgoCD, Istio)
Behavior: The AI Engine actively queries integrated tools (e.g., prometheus_metrics_query_range for historical trends, sentry-get-issue-details for error stack traces, istio_get_virtual_services for routing issues), correlates data across sources, and delivers in-depth root cause analysis with remediation suggestions.
Example: For a CrashLoopBackOff pod, the AI fetches logs (pods_log), checks memory metrics (pods_metrics), and reviews Sentry events (sentry-list-events-for-issue) to diagnose an OOMKill, suggesting “Increase memory limit” via resources_patch.
📋 Example
Consider a Kubernetes pod in a CrashLoopBackOff state with observability integrations configured (Prometheus, Sentry, Grafana):
AI Engine Actions:
Queries pods_log to retrieve application logs, identifying error codes or stack traces.
Uses pods_get to inspect pod configuration and events, detecting an OOMKilled condition.
Pulls metrics via prometheus_metrics_query to confirm high memory usage in real-time.
Checks sentry-get-issue-details for related error reports, correlating with a recent deployment via argocd_get_application_events.
Diagnosis: Identifies an out-of-memory issue due to insufficient resource limits post-deployment.
Remediation: Suggests increasing the pod’s memory limit using resources_patch or rolling back to a stable version with argocd_sync_application.
Last updated