Kepler Doc
Introduction
When your infrastructure encounters an issue (a pod crashing, an AWS alarm firing, or an application error in Sentry), you need answers fast. General-purpose AI can only offer generic suggestions and does not know your specific alerts, infrastructure context, or the proven steps your team has learned over time.
Kepler changes this.
Kepler is the AI Memory server that powers Voyager's intelligent troubleshooting capabilities. It stores expert-curated Playbooks, which are step-by-step troubleshooting guides written specifically for the alerts and errors you encounter.
When an incident occurs, Voyager does not guess. It queries Kepler, retrieves the matching Playbook, and follows a deterministic, proven path to help you understand and investigate the issue accurately.
At a Glance
476 expert-curated Playbooks
Covering Kubernetes, AWS, and Sentry alerts
4-section structure
Meaning, Impact, Playbook, Diagnosis
Deterministic guidance
Same alert = same proven steps, every time
Prioritized investigation
Most likely causes checked first
Coming soon
Upload your own RCAs for personalized knowledge
Why Kepler?
The Challenge with AI Alone
AI assistants are powerful, but incident troubleshooting requires precision that general AI cannot guarantee:
You report a Prometheus alert, and AI gives you a generic explanation that could apply to many situations.
You ask about an application error, and AI suggests steps it thinks might work but misses your specific context.
You ask the same question twice, and AI gives different answers each time.
This inconsistency and guesswork can slow you down when every minute counts.
How Kepler Solves This
Kepler provides deterministic guidance, so the same alert always triggers the same proven troubleshooting path:
AI generates responses on-the-fly
AI follows curated, expert-written steps
Answers vary each time you ask
Same alert = same proven guidance
Generic suggestions that may not apply
Specific steps for your exact alert
Random order of investigation
Prioritized by most likely cause first
Risk of incorrect or irrelevant advice
Validated, safe, triage-focused steps
The result is faster, more accurate troubleshooting with confidence that you are following a proven path.
What Are Playbooks?
Playbooks are the heart of Kepler. They are structured troubleshooting guides that capture expert knowledge and make it instantly available when you need it.
The Idea Behind Playbooks
Imagine your most experienced engineer is always available, 24/7, ready to guide you through any incident. That is what a Playbook represents: expert knowledge, written down and ready to use.
Each Playbook is designed for a specific type of issue:
A specific Prometheus alert (such as high memory usage or pod failures)
A specific AWS alarm (such as EC2 connectivity issues or RDS problems)
A specific application error in Sentry (such as database timeouts or API failures)
When that issue occurs, Kepler finds the matching Playbook and Voyager guides you through it step by step.
How Playbooks Are Structured
Every Playbook follows a consistent four-section format so you get answers to the questions that matter most during an incident.
Section 1: Meaning
Key question: What does this alert or error actually mean?
Before you can fix something, you need to understand it. The Meaning section gives you:
A clear explanation of what triggered the alert
What is actually happening in your system
Why this condition is being flagged as a problem
For example, if you see a memory-related alert, this section explains whether it is about a memory leak, a spike in usage, a container hitting its limits, or something else entirely.
Why it matters: You immediately understand the situation without needing to research the alert yourself.
Section 2: Impact
Key question: Why should I care? What is at risk?
Not every alert is equally urgent. The Impact section helps you understand:
What services, systems, or users could be affected
How severe the situation is
What might happen if this is not addressed
For example, a database connection error might impact all API requests and lead to user-facing failures, while a non-critical warning might only affect internal logging.
Why it matters: You can prioritize effectively and communicate the urgency to your team or stakeholders.
Section 3: Playbook
Key question: What should I check first?
This is the core troubleshooting section. It provides:
Step-by-step investigation instructions
Specific commands, queries, or checks to run
What to look for in the results
A prioritized order so the most likely causes are checked first
For example, for a pod crash, the steps might guide you to check logs first, then resource limits, then recent deployments, and then external dependencies in that specific order.
Why it matters: You follow a proven diagnostic path instead of randomly trying things.
Section 4: Diagnosis
Key question: How do I confirm the root cause?
Once you have narrowed down the issue, this section helps you:
Dig deeper with advanced diagnostic steps
Correlate findings across different systems
Confirm the root cause before taking action
Understand the full scope of the problem
For example, after identifying a memory issue, this section might guide you to check historical trends, compare with other pods, or review recent code changes.
Why it matters: You move beyond symptoms to truly understand what happened and why.
What Makes Playbooks Effective
Written by experts
Steps come from experienced engineers who've handled these issues before
Specific to your alert
Not generic advice guidance tailored to the exact problem you're facing
Prioritized investigation
Most likely causes checked first, saving you time
Safe and read-only
Focus on investigation, not risky automated fixes
Consistent every time
Same quality guidance whether it's 3 AM or 3 PM
Continuously improved
Refined based on real-world feedback and incidents
Supported Platforms
Kepler provides Playbooks for the major platforms where your incidents occur.
Prometheus and Kubernetes
For teams running containerized workloads, Kepler covers:
Pod issues such as crashes, restarts, failures to start, and resource problems
Node problems such as health issues, capacity problems, and connectivity
Workload concerns such as Deployment, StatefulSet, and DaemonSet issues
Resource alerts for CPU, memory, and storage
Cluster health issues in the control plane and networking
Amazon Web Services (AWS)
For teams using AWS infrastructure, Kepler covers:
Compute issues in EC2, Lambda, and Auto Scaling
Container issues in EKS, ECS, and Fargate
Database alerts in RDS, Aurora, and DynamoDB
Storage concerns in S3, EBS, and EFS
Networking issues in VPC, load balancers, and DNS
Sentry (Application Errors)
For application-level issues captured by Sentry, Kepler covers:
Database errors such as connection failures, timeouts, and query issues
API problems including HTTP errors, request failures, and timeouts
Application exceptions including common runtime errors and their causes
Infrastructure dependencies such as cache, queue, and external services
How It Works
The key difference is that Voyager is not making things up. It follows a curated, validated path that has been specifically designed for the alert you are facing.
Frequently Asked Questions
How does Kepler find the right Playbook for my alert?
When you describe an alert or error to Voyager, Kepler matches it against its library of Playbooks. Each Playbook is designed for specific alert types, error patterns, and issue categories. Kepler identifies the best match based on what you've described and returns the most relevant guidance.
What happens if there's no Playbook for my specific alert?
Kepler's Playbook library is continuously expanding. If an exact match isn't available, Voyager may still provide helpful context based on related Playbooks or general knowledge. As new Playbooks are added, coverage grows over time. With the upcoming custom document feature, you'll also be able to add your own guidance for issues specific to your environment.
Are Playbooks updated over time?
Yes. The Scoutflo team continuously refines Playbooks based on real-world usage, feedback, and evolving best practices. When better diagnostic approaches are identified or new patterns emerge, Playbooks are updated to reflect that knowledge.
Do Playbooks tell me how to fix the issue?
Playbooks focus on **triage and diagnosis**—helping you understand what's happening and investigate the root cause. They guide you through safe, read-only investigation steps. Once you've identified the issue, you can take appropriate remediation action based on your organization's practices and the specific situation.
Can I use Kepler for alerts from platforms not listed?
Currently, Kepler provides Playbooks for Prometheus/Kubernetes, AWS, and Sentry. Support for additional platforms may be added in the future. The upcoming custom document feature will also allow you to add guidance for any platform or tool your team uses.
How do saved RCAs help with future incidents?
When you save an RCA in Kepler, it becomes part of your organization's knowledge base. If a similar incident occurs in the future, Kepler can retrieve your past RCA and present it alongside the standard Playbook. This means the AI knows what your team already learned, what the root cause was, and how it was resolved—making the new incident much faster to troubleshoot.
Getting Started
Last updated


