AWS EKS Best Practices Guide
Last updated
Last updated
Principle of Least Privilege: Ensure the AWS credentials you use have only the permissions necessary to create and manage EKS clusters and associated resources (VPC, subnets, security groups, IAM roles, etc.). AWS recommends using IAM Access Analyzer to identify resources shared with external entities and refine permissions.
Billing Awareness: Double-check you are using the intended AWS account so that resources (and subsequent costs) are billed to the correct account (especially in multi-account setups). AWS recommends using AWS Organizations for centralized billing and cost management.
AWS Organization / Multi-Account Strategy: If your organization has a multi-account strategy (e.g., separate Dev, Staging, Prod accounts), make sure you are deploying to the right environment to keep resources isolated and manageable. AWS recommends using AWS Control Tower for setting up multi-account environments with guardrails.
Temporary Credentials: AWS recommends using temporary credentials with appropriate session duration rather than long-term access keys.
Uniqueness: EKS cluster names must be unique per region per account. Once created, the cluster name cannot be changed.
Naming Conventions:
Use a clear naming pattern, for example: company-environment-appname-eks
or staging-frontend-eks
.
Avoid disallowed characters (AWS typically restricts certain special characters). Stick to alphanumeric, -
, and _
.
Include environment labels if relevant (e.g., prod
, dev
) to easily identify your cluster's purpose.
AWS recommends using a consistent tagging strategy across all resources for better organization and cost tracking.
This section typically configures the networking and node groups that will run your workloads.
Existing vs. New VPC:
Private vs. Public Subnets:
Node groups define the EC2 instances (workers) that run your Kubernetes workloads.
Min Node, Desired Node, Max Node
Right-Sizing:
Desired Node is what you expect to run normally. AWS recommends setting this based on your typical workload requirements plus a buffer for unexpected spikes.
Instance Family / Instance Type / Arch Type
Workload Requirements:
Capacity Type
IAM Roles and Permissions (often part of Node Group creation though not shown in the screenshot):
Storage Configuration:
Validation: Before clicking "Deploy Cluster", verify all fields and confirm:
You have selected the correct AWS account and region.
The cluster name adheres to naming best practices.
The VPC and subnets are correctly configured for high availability and enough IP space.
The node group autoscaling configuration matches your expected workload.
The instance type, family, and capacity type are cost-effective and suitable for your application needs.
Logging & Monitoring: After creation, enable or confirm you have control-plane logging, container logs shipping to CloudWatch, and relevant metrics set up for better observability. AWS recommends:
Security Hardening: Restrict public access to the EKS control plane unless required. Use security groups and network policies to limit traffic. AWS recommends:
Resource Availability: Confirm that your chosen region supports EKS and all the complementary services/features you plan to use (e.g., managed node groups, specific instance types). Not all AWS regions support all EKS features.
Proximity & Latency: Choose a region close to your users or to other AWS services you rely on, reducing latency. AWS recommends using Amazon CloudFront for content delivery to further reduce latency for global users.
Cost & Compliance: Different regions can have different costs and data residency regulations. Make sure to pick the region that meets compliance or data residency needs if applicable. AWS provides a Total Cost of Ownership (TCO) calculator to estimate costs across regions.
High Availability: Some regions have more Availability Zones (AZs) than others. AWS recommends deploying across at least three AZs for production workloads to improve redundancy for your EKS clusters.
Kubernetes Version: AWS recommends using the latest supported Kubernetes version for new clusters and staying within one or two minor versions of the latest for existing clusters.
Existing VPC: Make sure it is properly set up with private subnets, NAT gateways, and the necessary routing for EKS. Validate that it has enough IP address space (CIDR blocks) to handle the expected number of pods and nodes. AWS recommends a minimum CIDR block size of /24 for each subnet.
New VPC: If auto-generating a new VPC, ensure you specify a sufficiently large CIDR range and consider private/public subnet segmentation. AWS recommends using the EKS VPC quick start template for proper configuration.
IP Address Management: EKS assigns IP addresses to each pod (if using the AWS VPC CNI). If your IP space is limited, you may encounter IP exhaustion. Plan the CIDR blocks accordingly. AWS recommends using custom networking and secondary CIDR blocks for large deployments.
VPC Limits: Keep in mind AWS default VPC and subnet limits. If you plan on multiple clusters, check that you're not nearing any resource quotas (VPC count, NAT gateways, route tables, etc.). AWS recommends using AWS Service Quotas to monitor and request increases when needed.
VPC Endpoints: AWS recommends using VPC Endpoints for private connectivity to AWS services, reducing data transfer costs and improving security.
Minimum Two Subnets: EKS requires at least two subnets in different Availability Zones for high availability of the control plane and the worker nodes. For production environments, AWS recommends using three or more AZs.
Private Subnets for Worker Nodes: It's generally recommended to place worker nodes in private subnets for security. AWS recommends this approach for all production workloads.
Public Subnets for Load Balancers: If you plan to expose services publicly, typically you attach load balancers to public subnets. AWS recommends using AWS Load Balancer Controller for optimal load balancer configuration.
Sufficient AZ Spread: Select subnets that span at least two (preferably three) AZs for better fault tolerance. AWS recommends evenly distributing workloads across all available AZs.
Tagging: EKS automatically looks for subnets tagged with the appropriate Kubernetes tags. Ensure your subnets are properly tagged if you're reusing them. AWS provides specific tagging requirements for subnet discovery.
CIDR Planning: AWS recommends allocating sufficiently large subnet CIDR blocks to accommodate node and pod growth (/24 or larger for each subnet is recommended).
Min Node should be at least 1–2 to ensure the cluster can handle baseline workloads. AWS recommends having at least one node per AZ for high availability.
Max Node should be large enough to handle peak traffic without exhausting capacity. AWS recommends setting appropriate service quotas to ensure you can scale to this limit.
Autoscaling: Make sure the range is realistic so that the cluster can scale up/down efficiently based on workload. AWS recommends using Cluster Autoscaler with appropriate scan interval and scale-down utilization threshold settings.
Choose an instance type (e.g., t3.large
, m5.large
, etc.) that balances CPU, memory, and network performance for your applications. AWS recommends using the Amazon EC2 Instance Selector tool to identify optimal instance types.
For specialized workloads (e.g., GPU or high-memory), pick the corresponding instance families (p2
, p3
, g4
, r5
, etc.). AWS provides optimized AMIs for GPU workloads on EKS.
ARM vs. x86: If your applications can run on ARM (Graviton) architectures, using arm64
can sometimes be more cost-effective. AWS Graviton processors can provide up to 40% better price/performance compared to equivalent x86-based instances.
Reserved vs. Spot: If using Spot, consider using multiple instance types to improve availability. AWS recommends using Spot Instances for stateless, fault-tolerant workloads and Reserved Instances for predictable, long-running workloads.
Bottlerocket: AWS recommends considering Bottlerocket, a purpose-built Linux-based operating system for running containers, for improved security and reduced operational overhead.
On-Demand: More reliable but higher cost. AWS recommends using On-Demand for critical production workloads that require guaranteed availability.
Spot: Cheaper but can be interrupted. Best practice is to use a mixed strategy (Spot + On-Demand) for cost optimization while maintaining reliability. AWS provides Spot interruption handler for graceful termination.
If using Spot, ensure you have a fallback On-Demand node group or at least capacity-optimized Spot allocation strategies to reduce the risk of interruption. AWS recommends implementing the Node Termination Handler for graceful pod evacuation upon Spot instance termination notice.
Karpenter: AWS recommends considering Karpenter as a flexible, high-performance Kubernetes cluster autoscaler that helps improve application availability and cluster efficiency.
Ensure the node group has an appropriate IAM role (AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, etc.) to allow workers to communicate with the control plane. AWS recommends creating dedicated IAM roles for each node group.
Use IRSA (IAM Roles for Service Accounts) to grant pods the least privilege they need. AWS recommends this approach over providing broad permissions to the entire node.
AWS recommends implementing IAM Access Analyzer to identify unintended access to your resources and data.
AWS recommends using EBS CSI Driver for persistent storage needs in EKS clusters.
For shared file storage, AWS recommends EFS CSI Driver which supports ReadWriteMany access mode.
AWS recommends using eksctl or AWS CloudFormation templates for repeatable, version-controlled cluster deployments.
Tagging & Resource Management: Apply consistent tags (e.g., Environment=Production
, Application=MyApp
, Owner=TeamName
) to the cluster, node groups, and VPC resources for better visibility and cost tracking. AWS recommends implementing tag-based access control and cost allocation.
Enabling all EKS control plane logs (API server, audit, authenticator, controller manager, scheduler)
Implementing Container Insights for comprehensive monitoring
Setting up Prometheus and Grafana for advanced metrics
Enabling private API endpoint access for production clusters
Implementing Kubernetes Network Policies for pod-to-pod communication controls
Using AWS Security Groups for Pods feature to apply fine-grained security group rules
Implementing encryption for secrets and EBS volumes
Right-sizing Resources: AWS recommends regularly analyzing resource utilization and adjusting instance types and quantities to match actual needs.
Spot Instances: Use Spot Instances for non-critical workloads to save up to 90% compared to On-Demand pricing.
Savings Plans: Consider Compute Savings Plans for predictable workloads to save up to 72% compared to On-Demand pricing.
Cluster Scaling: Implement efficient cluster scaling to adjust capacity based on demand, avoiding over-provisioning.
Cost Monitoring: Use AWS Cost Explorer and Kubernetes cost allocation tags to track and analyze EKS-related spending.