Ops mode

What you'll learn: Transform infrastructure operations from complex CLI commands to natural conversations. Discover how Scoutflo's Ops Mode enables you to chat with your cluster, query GitHub, investigate Sentry issues, and explore ELK logs using plain English.

What Ops Mode is:

Scoutflo's Ops Mode revolutionizes infrastructure operations by providing a conversational interface that lets you interact with your entire technology stack using natural language, eliminating the need to remember complex commands and syntax.

Chat naturally with Kubernetes clusters, databases, monitoring systems, and development tools
Query multiple systems simultaneously with intelligent agent routing
Get contextualized responses that understand your infrastructure relationships
Execute operations safely with built-in guardrails and confirmation prompts
Learn and improve through conversational feedback and usage patterns

Scoutflo Ops Mode acts as your intelligent operations assistant that speaks your language while understanding your infrastructure:

How Ops Mode Works

Scoutflo's Ops Mode understands your intent, selects the appropriate systems, and translates natural language into precise technical operations:

Stage 1: Intent Recognition (2 seconds)

Natural language processing to understand operational intent
Context awareness based on previous conversations and current environment
Multi-system query detection for complex operations
Safety classification for destructive vs read-only operations

Stage 2: Agent Selection & Routing (3 seconds)

Intelligent selection of appropriate system agents based on query content
Multi-agent coordination for queries spanning multiple systems
Authentication and authorization validation for target systems
Query optimization and parameter extraction

Stage 3: Execution & Response (5-15 seconds)

Parallel execution across selected systems when applicable
Real-time streaming of results for long-running operations
Intelligent response formatting and context preservation
Follow-up question suggestions and related operation recommendations

How it works: Instead of memorizing complex CLI commands and switching between multiple tools, you simply ask questions or state what you want to accomplish in plain English, and Ops Mode handles the translation to appropriate system commands.

Intent understanding that recognizes operational goals from conversational language
Multi-system awareness that routes queries to the right tools automatically
Context preservation that remembers your session and builds on previous interactions
Safety guardrails that prevent accidental destructive operations

Example: "Show me pods that are using too much memory in production" automatically queries Kubernetes metrics, identifies memory-intensive pods, and presents results with context about resource limits and recommendations.

Detailed Agent Capabilities

Kubernetes Agent:

Pod Operations: Status checks, log retrieval, resource utilization analysis
Deployment Management: Rollout status, replica scaling, health monitoring
Resource Investigation: CPU/memory usage, storage analysis, network diagnostics
Cluster Health: Node status, namespace resource allocation, event investigation

Sentry Agent:

Error Analysis: Exception tracking, error rate trends, user impact assessment
Performance Monitoring: Transaction analysis, slow query identification, bottleneck detection
Release Correlation: Error patterns across deployments, regression identification
User Experience: Performance impact on specific user segments or geographic regions

GitHub Agent:

Code Investigation: Recent changes analysis, blame tracking, dependency exploration
Pull Request Analysis: Change impact assessment, reviewer history, deployment correlation
Repository Intelligence: Code quality trends, contributor patterns, technical debt analysis
Deployment History: Release correlation with incidents, rollback analysis

Conversation Types & Use Cases

Operational Conversation Categories

Troubleshooting conversations that combine multiple data sources:

Common Patterns:

"Why is service X responding slowly?"
"Show me errors from the last hour related to authentication"
"What changed in the payment service since yesterday?"
"Are there any pods experiencing memory issues?"

Example Conversation:

User: "The checkout API seems slow today, what's going on?"

Ops Mode: "I'll investigate the checkout API performance across multiple systems..."

[Checking Kubernetes pod health...]
✅ All 12 checkout-api pods are running and healthy
📊 Average CPU: 45%, Memory: 67% (within normal ranges)

[Analyzing Sentry performance data...]
🚨 Response time increased 340% in the last 2 hours
📈 P95 latency: 2.8s (baseline: 0.8s)
🔍 Most affected endpoint: POST /api/checkout/process (67% of slow requests)

[Querying ELK for related errors...]
⚠️ Found 2,847 database timeout errors in last hour
🕒 Pattern shows timeouts correlating with payment gateway calls

[Checking recent GitHub changes...]
📝 Payment service v2.3.1 deployed 3 hours ago
🔍 Changes include new payment provider integration

Recommended next steps:
1. Check payment gateway API response times
2. Review database connection pool configuration
3. Consider rolling back payment service to v2.3.0

Would you like me to investigate any of these areas further?

Infrastructure operations and resource optimization:

Common Patterns:

"Scale up the user service to 10 replicas"
"Show me which services are using the most memory"
"List all pods that have been restarted recently"
"What's the current resource utilization across clusters?"

Example Conversation:

User: "We're expecting high traffic for the sale event. Can you help me prepare the infrastructure?"

Ops Mode: "I'll help you prepare for the traffic spike. Let me analyze current capacity and scaling options..."

[Current Resource Analysis...]
📊 Current production cluster utilization:
   • CPU: 67% average (45 nodes)
   • Memory: 72% average
   • Network: 23% utilization

🎯 Services likely to need scaling:
   • product-api: 8 replicas, 89% CPU usage
   • search-service: 5 replicas, 78% memory usage
   • user-service: 12 replicas, 45% CPU usage (good)

[Historical Traffic Analysis...]
📈 Last sale event (3 months ago):
   • Traffic increased 340% over 4-hour period
   • product-api scaled to 25 replicas
   • search-service scaled to 18 replicas
   • Database connections peaked at 387 (current limit: 400)

Recommended preparation:
1. Pre-scale product-api to 15 replicas
2. Pre-scale search-service to 12 replicas
3. Increase database connection pool to 600
4. Enable aggressive autoscaling (target 60% CPU)

Would you like me to execute these scaling operations now?

Performance monitoring and trend analysis:

Common Patterns:

"How has error rate changed over the past week?"
"Show me SLA compliance for critical services"
"What are the slowest API endpoints right now?"
"Compare performance between staging and production"

Example Conversation:

User: "Give me a health check of our critical services"

Ops Mode: "Running comprehensive health analysis across all critical services..."

[SLA Compliance Analysis...]
✅ Payment API: 99.97% uptime (SLA: 99.95%)
✅ Authentication: 99.99% uptime (SLA: 99.9%)
⚠️  Search Service: 99.89% uptime (SLA: 99.95%) - BELOW TARGET
❌ Notification Service: 99.23% uptime (SLA: 99.9%) - SLA BREACH

[Performance Metrics - Last 24h...]
🚀 Fastest services:
   • User API: 89ms avg response time
   • Product Catalog: 124ms avg response time

🐌 Slowest services:
   • Analytics API: 2.3s avg response time (+340% from baseline)
   • Report Generation: 8.7s avg response time (within expected range)

[Error Rate Analysis...]
📊 Error rates trending:
   • Overall: 0.34% (target <0.5%) ✅
   • Search Service: 1.2% (trending up 45% this week) ⚠️
   • Analytics API: 2.1% (database timeout errors) 🚨

Critical issues requiring attention:
1. Search Service SLA breach - investigate indexing performance
2. Analytics API database timeouts - review query optimization
3. Notification Service reliability - check message queue health

Would you like me to deep-dive into any of these issues?

Best Practices for Ops Conversations

Effective Query Construction

How to ask clear, actionable questions:

Good Query Patterns:

✅ Specific and actionable:
"Show me pods with high memory usage in production"
"Scale the user service to 15 replicas"
"Find errors in payment service from the last hour"

✅ Context-aware:
"Why is checkout slow compared to yesterday?"
"What changed in authentication since the incident?"
"Compare performance between staging and production"

✅ Investigation-focused:
"Investigate the spike in database errors"
"Analyze the impact of yesterday's deployment"
"Find the root cause of login failures"

Query Patterns to Avoid:

❌ Too vague:
"Something is wrong with the system"
"Check if everything is okay"
"Why is it slow?"

❌ Missing context:
"Scale the service" (which service?)
"Show me errors" (from where? when?)
"Fix the problem" (what problem?)

❌ Overly complex:
"Scale user service to 10 replicas and check memory usage while monitoring error rates and comparing to baseline from last week unless staging shows issues"

Building effective conversation context:

Context Preservation Example:

User: "Show me pods with high CPU usage"

Ops Mode: "Found 5 pods with >80% CPU usage:
• payment-api-7b8c9: 89% CPU
• search-service-3f2a: 87% CPU
• analytics-worker-1d4e: 92% CPU
..."

User: "Check the memory usage for these pods"
# Ops Mode automatically knows "these pods" refers to the 5 high-CPU pods

Ops Mode: "Memory usage for the 5 high-CPU pods:
• payment-api-7b8c9: 67% memory (normal)
• search-service-3f2a: 94% memory (high)
• analytics-worker-1d4e: 78% memory (normal)
..."

User: "Scale up the one with high memory"
# Context preserved: refers to search-service-3f2a

Ops Mode: "Scaling search-service-3f2a from 5 to 8 replicas to address high memory usage..."

Rollback Safety:

User: "Rollback the payment service"

Ops Mode: "🚨 High Risk Operation - Detailed Confirmation Required

Operation: Rollback payment-service deployment
Current version: v2.4.1 (deployed 2 hours ago)
Target version: v2.4.0 (last stable)

⚠️ Impact Analysis:
• Service: Critical payment processing
• Active transactions: 47 in progress
• Daily transaction volume: $2.3M
• Rollback duration: ~3-5 minutes

Prerequisites check:
✅ Target version v2.4.0 is available
✅ Database schema is backward compatible
✅ No breaking configuration changes
⚠️ 47 active transactions will be interrupted

Safety measures:
• Graceful shutdown: 30 second drain period
• Health check validation before traffic restoration
• Automatic rollback if health checks fail

Please confirm by typing: 'CONFIRM ROLLBACK payment-service v2.4.0'
Or type 'cancel' to abort."

PreviousTroubleshooting Mode NextRBAC

Last updated 25 days ago

hashtagWhat Ops Mode is:

hashtagHow Ops Mode Works

hashtagDetailed Agent Capabilities

hashtagConversation Types & Use Cases

hashtagOperational Conversation Categories

hashtagBest Practices for Ops Conversations

hashtagEffective Query Construction

What Ops Mode is:

How Ops Mode Works

Detailed Agent Capabilities

Conversation Types & Use Cases

Operational Conversation Categories

Best Practices for Ops Conversations

Effective Query Construction