Kepler Doc

Introduction

When your infrastructure encounters an issue (a pod crashing, an AWS alarm firing, or an application error in Sentry), you need answers fast. General-purpose AI can only offer generic suggestions and does not know your specific alerts, infrastructure context, or the proven steps your team has learned over time.

Kepler changes this.

Kepler is the AI Memory server that powers Voyager's intelligent troubleshooting capabilities. It stores expert-curated Playbooks, which are step-by-step troubleshooting guides written specifically for the alerts and errors you encounter.

When an incident occurs, Voyager does not guess. It queries Kepler, retrieves the matching Playbook, and follows a deterministic, proven path to help you understand and investigate the issue accurately.


At a Glance

476 expert-curated Playbooks

Covering Kubernetes, AWS, and Sentry alerts

4-section structure

Meaning, Impact, Playbook, Diagnosis

Deterministic guidance

Same alert = same proven steps, every time

Prioritized investigation

Most likely causes checked first

Coming soon

Upload your own RCAs for personalized knowledge


Why Kepler?

The Challenge with AI Alone

AI assistants are powerful, but incident troubleshooting requires precision that general AI cannot guarantee:

  • You report a Prometheus alert, and AI gives you a generic explanation that could apply to many situations.

  • You ask about an application error, and AI suggests steps it thinks might work but misses your specific context.

  • You ask the same question twice, and AI gives different answers each time.

This inconsistency and guesswork can slow you down when every minute counts.

How Kepler Solves This

Kepler provides deterministic guidance, so the same alert always triggers the same proven troubleshooting path:

Without Kepler
With Kepler

AI generates responses on-the-fly

AI follows curated, expert-written steps

Answers vary each time you ask

Same alert = same proven guidance

Generic suggestions that may not apply

Specific steps for your exact alert

Random order of investigation

Prioritized by most likely cause first

Risk of incorrect or irrelevant advice

Validated, safe, triage-focused steps

The result is faster, more accurate troubleshooting with confidence that you are following a proven path.


What Are Playbooks?

Playbooks are the heart of Kepler. They are structured troubleshooting guides that capture expert knowledge and make it instantly available when you need it.

The Idea Behind Playbooks

Imagine your most experienced engineer is always available, 24/7, ready to guide you through any incident. That is what a Playbook represents: expert knowledge, written down and ready to use.

Each Playbook is designed for a specific type of issue:

  • A specific Prometheus alert (such as high memory usage or pod failures)

  • A specific AWS alarm (such as EC2 connectivity issues or RDS problems)

  • A specific application error in Sentry (such as database timeouts or API failures)

When that issue occurs, Kepler finds the matching Playbook and Voyager guides you through it step by step.

How Playbooks Are Structured

Every Playbook follows a consistent four-section format so you get answers to the questions that matter most during an incident.

Section 1: Meaning

Key question: What does this alert or error actually mean?

Before you can fix something, you need to understand it. The Meaning section gives you:

  • A clear explanation of what triggered the alert

  • What is actually happening in your system

  • Why this condition is being flagged as a problem

For example, if you see a memory-related alert, this section explains whether it is about a memory leak, a spike in usage, a container hitting its limits, or something else entirely.

Why it matters: You immediately understand the situation without needing to research the alert yourself.


Section 2: Impact

Key question: Why should I care? What is at risk?

Not every alert is equally urgent. The Impact section helps you understand:

  • What services, systems, or users could be affected

  • How severe the situation is

  • What might happen if this is not addressed

For example, a database connection error might impact all API requests and lead to user-facing failures, while a non-critical warning might only affect internal logging.

Why it matters: You can prioritize effectively and communicate the urgency to your team or stakeholders.


Section 3: Playbook

Key question: What should I check first?

This is the core troubleshooting section. It provides:

  • Step-by-step investigation instructions

  • Specific commands, queries, or checks to run

  • What to look for in the results

  • A prioritized order so the most likely causes are checked first

For example, for a pod crash, the steps might guide you to check logs first, then resource limits, then recent deployments, and then external dependencies in that specific order.

Why it matters: You follow a proven diagnostic path instead of randomly trying things.

Section 4: Diagnosis

Key question: How do I confirm the root cause?

Once you have narrowed down the issue, this section helps you:

  • Dig deeper with advanced diagnostic steps

  • Correlate findings across different systems

  • Confirm the root cause before taking action

  • Understand the full scope of the problem

For example, after identifying a memory issue, this section might guide you to check historical trends, compare with other pods, or review recent code changes.

Why it matters: You move beyond symptoms to truly understand what happened and why.

What Makes Playbooks Effective

Characteristic
What It Means for You

Written by experts

Steps come from experienced engineers who've handled these issues before

Specific to your alert

Not generic advice guidance tailored to the exact problem you're facing

Prioritized investigation

Most likely causes checked first, saving you time

Safe and read-only

Focus on investigation, not risky automated fixes

Consistent every time

Same quality guidance whether it's 3 AM or 3 PM

Continuously improved

Refined based on real-world feedback and incidents


Supported Platforms

Kepler provides Playbooks for the major platforms where your incidents occur.

Prometheus and Kubernetes

For teams running containerized workloads, Kepler covers:

  • Pod issues such as crashes, restarts, failures to start, and resource problems

  • Node problems such as health issues, capacity problems, and connectivity

  • Workload concerns such as Deployment, StatefulSet, and DaemonSet issues

  • Resource alerts for CPU, memory, and storage

  • Cluster health issues in the control plane and networking

Amazon Web Services (AWS)

For teams using AWS infrastructure, Kepler covers:

  • Compute issues in EC2, Lambda, and Auto Scaling

  • Container issues in EKS, ECS, and Fargate

  • Database alerts in RDS, Aurora, and DynamoDB

  • Storage concerns in S3, EBS, and EFS

  • Networking issues in VPC, load balancers, and DNS

Sentry (Application Errors)

For application-level issues captured by Sentry, Kepler covers:

  • Database errors such as connection failures, timeouts, and query issues

  • API problems including HTTP errors, request failures, and timeouts

  • Application exceptions including common runtime errors and their causes

  • Infrastructure dependencies such as cache, queue, and external services


How It Works

1

An alert fires

Your monitoring system detects an issue, a Prometheus alert, an AWS CloudWatch alarm, or a Sentry error.

2

You ask Voyager

You describe the alert or paste the error message. For example: "I'm seeing a high memory alert on my production pods" or "Help me understand this database connection error."

3

Kepler finds the match

Behind the scenes, Voyager queries Kepler to find the Playbook that matches your specific issue.

4

You get deterministic guidance

Voyager presents the Playbook content, explaining what the alert means, what's at risk, and walking you through prioritized investigation steps.

5

You investigate with confidence

You follow the proven steps, knowing they're based on expert knowledge, not AI guesswork.

The key difference is that Voyager is not making things up. It follows a curated, validated path that has been specifically designed for the alert you are facing.


Frequently Asked Questions

chevron-rightHow does Kepler find the right Playbook for my alert?hashtag

When you describe an alert or error to Voyager, Kepler matches it against its library of Playbooks. Each Playbook is designed for specific alert types, error patterns, and issue categories. Kepler identifies the best match based on what you've described and returns the most relevant guidance.

chevron-rightWhat happens if there's no Playbook for my specific alert?hashtag

Kepler's Playbook library is continuously expanding. If an exact match isn't available, Voyager may still provide helpful context based on related Playbooks or general knowledge. As new Playbooks are added, coverage grows over time. With the upcoming custom document feature, you'll also be able to add your own guidance for issues specific to your environment.

chevron-rightAre Playbooks updated over time?hashtag

Yes. The Scoutflo team continuously refines Playbooks based on real-world usage, feedback, and evolving best practices. When better diagnostic approaches are identified or new patterns emerge, Playbooks are updated to reflect that knowledge.

chevron-rightDo Playbooks tell me how to fix the issue?hashtag

Playbooks focus on **triage and diagnosis**—helping you understand what's happening and investigate the root cause. They guide you through safe, read-only investigation steps. Once you've identified the issue, you can take appropriate remediation action based on your organization's practices and the specific situation.

chevron-rightCan I use Kepler for alerts from platforms not listed?hashtag

Currently, Kepler provides Playbooks for Prometheus/Kubernetes, AWS, and Sentry. Support for additional platforms may be added in the future. The upcoming custom document feature will also allow you to add guidance for any platform or tool your team uses.

chevron-rightHow do saved RCAs help with future incidents?hashtag

When you save an RCA in Kepler, it becomes part of your organization's knowledge base. If a similar incident occurs in the future, Kepler can retrieve your past RCA and present it alongside the standard Playbook. This means the AI knows what your team already learned, what the root cause was, and how it was resolved—making the new incident much faster to troubleshoot.


Getting Started

Last updated