Amazon EKS and Grafana stack
Build secure Amazon EKS cluster with Grafana stack
I will outline the steps for setting up an Amazon EKS environment that prioritizes security, including the configuration of standard applications.
The Amazon EKS setup should align with the following criteria:
- Utilize two Availability Zones (AZs), or a single zone if possible, to reduce payments for cross-AZ traffic
- Spot instances
- Less expensive region -
us-east-1 - Most price efficient EC2 instance type
t4g.medium(2 x CPU, 4GB RAM) using AWS Graviton based on ARM - Use Bottlerocket OS for a minimal operating system, CPU, and memory footprint
- Leverage Network Load Balancer (NLB) for highly cost-effective and optimized load balancing, seamlessly integrated with kgateway.
- Karpenter to enable automatic node scaling that matches the specific resource requirements of pods
- The Amazon EKS control plane must be encrypted using KMS
- Worker node EBS volumes must be encrypted
- Cluster logging to CloudWatch must be configured
- Network Policies should be enabled where supported
- EKS Pod Identities should be used to allow applications and pods to communicate with AWS APIs
Build Amazon EKS
Requirements
You will need to configure the AWS CLI and set up other necessary secrets and variables.
1
2
3
4
5
6
7
# AWS Credentials
export AWS_ACCESS_KEY_ID="xxxxxxxxxxxxxxxxxx"
export AWS_SECRET_ACCESS_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export AWS_SESSION_TOKEN="xxxxxxxx"
export AWS_ROLE_TO_ASSUME="arn:aws:iam::7xxxxxxxxxx7:role/Gixxxxxxxxxxxxxxxxxxxxle"
export GOOGLE_CLIENT_ID="10xxxxxxxxxxxxxxxud.apps.googleusercontent.com"
export GOOGLE_CLIENT_SECRET="GOxxxxxxxxxxxxxxxtw"
If you plan to follow this document and its tasks, you will need to set up a few environment variables, such as:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# AWS Region
export AWS_REGION="${AWS_REGION:-us-east-1}"
# Hostname / FQDN definitions
export CLUSTER_FQDN="${CLUSTER_FQDN:-k01.k8s.mylabs.dev}"
# Base Domain: k8s.mylabs.dev
export BASE_DOMAIN="${CLUSTER_FQDN#*.}"
# Cluster Name: k01
export CLUSTER_NAME="${CLUSTER_FQDN%%.*}"
export MY_EMAIL="petr.ruzicka@gmail.com"
export TMP_DIR="${TMP_DIR:-${PWD}}"
export KUBECONFIG="${KUBECONFIG:-${TMP_DIR}/${CLUSTER_FQDN}/kubeconfig-${CLUSTER_NAME}.conf}"
# Tags used to tag the AWS resources
export TAGS="${TAGS:-Owner=${MY_EMAIL},Environment=dev,Cluster=${CLUSTER_FQDN}}"
export AWS_PARTITION="aws"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text) && export AWS_ACCOUNT_ID
mkdir -pv "${TMP_DIR}/${CLUSTER_FQDN}"
Confirm that all essential variables have been properly configured:
1
2
3
4
5
6
7
8
: "${AWS_ACCESS_KEY_ID?}"
: "${AWS_REGION?}"
: "${AWS_SECRET_ACCESS_KEY?}"
: "${AWS_ROLE_TO_ASSUME?}"
: "${GOOGLE_CLIENT_ID?}"
: "${GOOGLE_CLIENT_SECRET?}"
echo -e "${MY_EMAIL} | ${CLUSTER_NAME} | ${BASE_DOMAIN} | ${CLUSTER_FQDN}\n${TAGS}"
Install the required tools:
You can bypass these procedures if you already have all the essential software installed.
Configure AWS Route 53 Domain delegation
The DNS delegation tasks should be executed as a one-time operation.
Create a DNS zone for the EKS clusters:
1
2
3
4
5
6
7
export CLOUDFLARE_EMAIL="petr.ruzicka@gmail.com"
export CLOUDFLARE_API_KEY="1xxxxxxxxx0"
aws route53 create-hosted-zone --output json \
--name "${BASE_DOMAIN}" \
--caller-reference "$(date)" \
--hosted-zone-config="{\"Comment\": \"Created by petr.ruzicka@gmail.com\", \"PrivateZone\": false}" | jq
Utilize your domain registrar to update the nameservers for your zone (e.g., mylabs.dev) to point to Amazon Route 53 nameservers. Here’s how to discover the required Route 53 nameservers:
1
2
3
4
NEW_ZONE_ID=$(aws route53 list-hosted-zones --query "HostedZones[?Name==\`${BASE_DOMAIN}.\`].Id" --output text)
NEW_ZONE_NS=$(aws route53 get-hosted-zone --output json --id "${NEW_ZONE_ID}" --query "DelegationSet.NameServers")
NEW_ZONE_NS1=$(echo "${NEW_ZONE_NS}" | jq -r ".[0]")
NEW_ZONE_NS2=$(echo "${NEW_ZONE_NS}" | jq -r ".[1]")
Establish the NS record in k8s.mylabs.dev (your BASE_DOMAIN) for proper zone delegation. This operation’s specifics may vary based on your domain registrar; I use Cloudflare and employ Ansible for automation:
1
2
ansible -m cloudflare_dns -c local -i "localhost," localhost -a "zone=mylabs.dev record=${BASE_DOMAIN} type=NS value=${NEW_ZONE_NS1} solo=true proxied=no account_email=${CLOUDFLARE_EMAIL} account_api_token=${CLOUDFLARE_API_KEY}"
ansible -m cloudflare_dns -c local -i "localhost," localhost -a "zone=mylabs.dev record=${BASE_DOMAIN} type=NS value=${NEW_ZONE_NS2} solo=false proxied=no account_email=${CLOUDFLARE_EMAIL} account_api_token=${CLOUDFLARE_API_KEY}"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
localhost | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"result": {
"record": {
"content": "ns-885.awsdns-46.net",
"created_on": "2020-11-13T06:25:32.18642Z",
"id": "dxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb",
"locked": false,
"meta": {
"auto_added": false,
"managed_by_apps": false,
"managed_by_argo_tunnel": false,
"source": "primary"
},
"modified_on": "2020-11-13T06:25:32.18642Z",
"name": "k8s.mylabs.dev",
"proxiable": false,
"proxied": false,
"ttl": 1,
"type": "NS",
"zone_id": "2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe",
"zone_name": "mylabs.dev"
}
}
}
localhost | CHANGED => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": true,
"result": {
"record": {
"content": "ns-1692.awsdns-19.co.uk",
"created_on": "2020-11-13T06:25:37.605605Z",
"id": "9xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxb",
"locked": false,
"meta": {
"auto_added": false,
"managed_by_apps": false,
"managed_by_argo_tunnel": false,
"source": "primary"
},
"modified_on": "2020-11-13T06:25:37.605605Z",
"name": "k8s.mylabs.dev",
"proxiable": false,
"proxied": false,
"ttl": 1,
"type": "NS",
"zone_id": "2xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxe",
"zone_name": "mylabs.dev"
}
}
}
Create the service-linked role
Creating the service-linked role for Spot Instances is a one-time operation.
Create the AWSServiceRoleForEC2Spot role to use Spot Instances in the Amazon EKS cluster:
1
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
Details: Work with Spot Instances
Create Route53 zone and KMS key infrastructure
Generate a CloudFormation template that defines an Amazon Route 53 zone and an AWS Key Management Service (KMS) key.
Add the new domain CLUSTER_FQDN to Route 53, and set up DNS delegation from the BASE_DOMAIN.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
tee "${TMP_DIR}/${CLUSTER_FQDN}/aws-cf-route53-kms.yml" << \EOF
AWSTemplateFormatVersion: 2010-09-09
Description: Route53 entries and KMS key
Parameters:
BaseDomain:
Description: "Base domain where cluster domains + their subdomains will live - Ex: k8s.mylabs.dev"
Type: String
ClusterFQDN:
Description: "Cluster FQDN (domain for all applications) - Ex: k01.k8s.mylabs.dev"
Type: String
ClusterName:
Description: "Cluster Name - Ex: k01"
Type: String
Resources:
HostedZone:
Type: AWS::Route53::HostedZone
Properties:
Name: !Ref ClusterFQDN
RecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneName: !Sub "${BaseDomain}."
Name: !Ref ClusterFQDN
Type: NS
TTL: 60
ResourceRecords: !GetAtt HostedZone.NameServers
KMSAlias:
Type: AWS::KMS::Alias
Properties:
AliasName: !Sub "alias/eks-${ClusterName}"
TargetKeyId: !Ref KMSKey
KMSKey:
Type: AWS::KMS::Key
Properties:
Description: !Sub "KMS key for ${ClusterName} Amazon EKS"
EnableKeyRotation: true
PendingWindowInDays: 7
KeyPolicy:
Version: "2012-10-17"
Id: !Sub "eks-key-policy-${ClusterName}"
Statement:
- Sid: Allow direct access to key metadata to the account
Effect: Allow
Principal:
AWS:
- !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:root"
Action:
- kms:*
Resource: "*"
- Sid: Allow access through EBS for all principals in the account that are authorized to use EBS
Effect: Allow
Principal:
AWS: "*"
Action:
- kms:Encrypt
- kms:Decrypt
- kms:ReEncrypt*
- kms:GenerateDataKey*
- kms:CreateGrant
- kms:DescribeKey
Resource: "*"
Condition:
StringEquals:
kms:ViaService: !Sub "ec2.${AWS::Region}.amazonaws.com"
kms:CallerAccount: !Sub "${AWS::AccountId}"
S3AccessPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub "eksctl-${ClusterName}-s3-access-policy"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:DeleteObject
- s3:PutObject
- s3:PutObjectTagging
- s3:AbortMultipartUpload
- s3:ListMultipartUploadParts
Resource: !Sub "arn:aws:s3:::${ClusterFQDN}/*"
- Effect: Allow
Action:
- s3:ListBucket
Resource: !Sub "arn:aws:s3:::${ClusterFQDN}"
Outputs:
KMSKeyArn:
Description: The ARN of the created KMS Key to encrypt EKS related services
Value: !GetAtt KMSKey.Arn
Export:
Name:
Fn::Sub: "${AWS::StackName}-KMSKeyArn"
KMSKeyId:
Description: The ID of the created KMS Key to encrypt EKS related services
Value: !Ref KMSKey
Export:
Name:
Fn::Sub: "${AWS::StackName}-KMSKeyId"
S3AccessPolicyArn:
Description: IAM policy ARN for S3 access by EKS workloads
Value: !Ref S3AccessPolicy
Export:
Name:
Fn::Sub: "${AWS::StackName}-S3AccessPolicy"
EOF
# shellcheck disable=SC2001
eval aws cloudformation deploy --capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "BaseDomain=${BASE_DOMAIN} ClusterFQDN=${CLUSTER_FQDN} ClusterName=${CLUSTER_NAME}" \
--stack-name "${CLUSTER_NAME}-route53-kms" --template-file "${TMP_DIR}/${CLUSTER_FQDN}/aws-cf-route53-kms.yml" --tags "${TAGS//,/ }"
AWS_CLOUDFORMATION_DETAILS=$(aws cloudformation describe-stacks --stack-name "${CLUSTER_NAME}-route53-kms" --query "Stacks[0].Outputs[? OutputKey==\`KMSKeyArn\` || OutputKey==\`KMSKeyId\` || OutputKey==\`S3AccessPolicyArn\`].{OutputKey:OutputKey,OutputValue:OutputValue}")
AWS_KMS_KEY_ARN=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"KMSKeyArn\") .OutputValue")
AWS_KMS_KEY_ID=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"KMSKeyId\") .OutputValue")
AWS_S3_ACCESS_POLICY_ARN=$(echo "${AWS_CLOUDFORMATION_DETAILS}" | jq -r ".[] | select(.OutputKey==\"S3AccessPolicyArn\") .OutputValue")
After running the CloudFormation stack, you should see the following Route53 zones:
Route53 k01.k8s.mylabs.dev zone
You should also see the following KMS key:
Create Karpenter infrastructure
Use CloudFormation to set up the infrastructure needed by the EKS cluster. See CloudFormation for a complete description of what cloudformation.yaml does for Karpenter.
1
2
3
4
5
6
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/website/content/en/v1.8/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TMP_DIR}/${CLUSTER_FQDN}/cloudformation-karpenter.yml"
eval aws cloudformation deploy \
--stack-name "${CLUSTER_NAME}-karpenter" \
--template-file "${TMP_DIR}/${CLUSTER_FQDN}/cloudformation-karpenter.yml" \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}" --tags "${TAGS//,/ }"
Create Amazon EKS
I will use eksctl to create the Amazon EKS cluster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
tee "${TMP_DIR}/${CLUSTER_FQDN}/eksctl-${CLUSTER_NAME}.yml" << EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_REGION}
tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
availabilityZones:
- ${AWS_REGION}a
- ${AWS_REGION}b
accessConfig:
accessEntries:
- principalARN: arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/admin
accessPolicies:
- policyARN: arn:${AWS_PARTITION}:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy
accessScope:
type: cluster
iam:
withOIDC: true
podIdentityAssociations:
- namespace: aws-load-balancer-controller
serviceAccountName: aws-load-balancer-controller
roleName: eksctl-${CLUSTER_NAME}-aws-load-balancer-controller
wellKnownPolicies:
awsLoadBalancerController: true
- namespace: cert-manager
serviceAccountName: cert-manager
roleName: eksctl-${CLUSTER_NAME}-cert-manager
wellKnownPolicies:
certManager: true
- namespace: external-dns
serviceAccountName: external-dns
roleName: eksctl-${CLUSTER_NAME}-external-dns
wellKnownPolicies:
externalDNS: true
- namespace: karpenter
serviceAccountName: karpenter
roleName: eksctl-${CLUSTER_NAME}-karpenter
permissionPolicyARNs:
- arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
- namespace: loki
serviceAccountName: loki
roleName: eksctl-${CLUSTER_NAME}-loki
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: mimir
serviceAccountName: mimir
roleName: eksctl-${CLUSTER_NAME}-mimir
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: tempo
serviceAccountName: tempo
roleName: eksctl-${CLUSTER_NAME}-tempo
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
- namespace: velero
serviceAccountName: velero
roleName: eksctl-${CLUSTER_NAME}-velero
permissionPolicyARNs:
- ${AWS_S3_ACCESS_POLICY_ARN}
permissionPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshots"
]
Resource:
- "*"
iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
username: system:node:
groups:
- system:bootstrappers
- system:nodes
addons:
- name: coredns
- name: eks-pod-identity-agent
- name: kube-proxy
- name: snapshot-controller
- name: aws-ebs-csi-driver
configurationValues: |-
defaultStorageClass:
enabled: true
controller:
extraVolumeTags:
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
loggingFormat: json
- name: vpc-cni
configurationValues: |-
enableNetworkPolicy: "true"
env:
ENABLE_PREFIX_DELEGATION: "true"
managedNodeGroups:
- name: mng01-ng
amiFamily: Bottlerocket
instanceType: t4g.medium
desiredCapacity: 2
availabilityZones:
- ${AWS_REGION}a
minSize: 2
maxSize: 3
volumeSize: 20
volumeEncrypted: true
volumeKmsKeyID: ${AWS_KMS_KEY_ID}
privateNetworking: true
bottlerocket:
settings:
kubernetes:
seccomp-default: true
secretsEncryption:
keyARN: ${AWS_KMS_KEY_ARN}
cloudWatch:
clusterLogging:
logRetentionInDays: 1
enableTypes:
- all
EOF
eksctl create cluster --config-file "${TMP_DIR}/${CLUSTER_FQDN}/eksctl-${CLUSTER_NAME}.yml" --kubeconfig "${KUBECONFIG}" || eksctl utils write-kubeconfig --cluster="${CLUSTER_NAME}" --kubeconfig "${KUBECONFIG}"
Enhance the security posture of the EKS cluster by addressing the following concerns:
1
2
3
AWS_VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:alpha.eksctl.io/cluster-name,Values=${CLUSTER_NAME}" --query 'Vpcs[*].VpcId' --output text)
AWS_SECURITY_GROUP_ID=$(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=${AWS_VPC_ID}" "Name=group-name,Values=default" --query 'SecurityGroups[*].GroupId' --output text)
AWS_NACL_ID=$(aws ec2 describe-network-acls --filters "Name=vpc-id,Values=${AWS_VPC_ID}" --query 'NetworkAcls[*].NetworkAclId' --output text)
The default security group should have no rules configured:
1 2
aws ec2 revoke-security-group-egress --group-id "${AWS_SECURITY_GROUP_ID}" --protocol all --port all --cidr 0.0.0.0/0 | jq || true aws ec2 revoke-security-group-ingress --group-id "${AWS_SECURITY_GROUP_ID}" --protocol all --port all --source-group "${AWS_SECURITY_GROUP_ID}" | jq || true
The VPC NACL allows unrestricted SSH access, and the VPC NACL allows unrestricted RDP access:
1 2
aws ec2 create-network-acl-entry --network-acl-id "${AWS_NACL_ID}" --ingress --rule-number 1 --protocol tcp --port-range "From=22,To=22" --cidr-block 0.0.0.0/0 --rule-action Deny aws ec2 create-network-acl-entry --network-acl-id "${AWS_NACL_ID}" --ingress --rule-number 2 --protocol tcp --port-range "From=3389,To=3389" --cidr-block 0.0.0.0/0 --rule-action Deny
The VPC should have Route 53 DNS resolver with logging enabled:
1 2 3 4 5 6 7 8 9
AWS_CLUSTER_LOG_GROUP_ARN=$(aws logs describe-log-groups --query "logGroups[?logGroupName=='/aws/eks/${CLUSTER_NAME}/cluster'].arn" --output text) AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID=$(aws route53resolver create-resolver-query-log-config \ --name "${CLUSTER_NAME}-vpc-dns-logs" \ --destination-arn "${AWS_CLUSTER_LOG_GROUP_ARN}" \ --creator-request-id "$(uuidgen)" --query 'ResolverQueryLogConfig.Id' --output text) aws route53resolver associate-resolver-query-log-config \ --resolver-query-log-config-id "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}" \ --resource-id "${AWS_VPC_ID}"
Prometheus Operator CRDs
Prometheus Operator CRDs provides the Custom Resource Definitions (CRDs) that define the Prometheus operator resources. These CRDs are required before installing ServiceMonitor resources.
Install the prometheus-operator-crds Helm chart to set up the necessary CRDs:
1
helm install prometheus-operator-crds oci://ghcr.io/prometheus-community/charts/prometheus-operator-crds
AWS Load Balancer Controller
The AWS Load Balancer Controller is a controller that manages Elastic Load Balancers for a Kubernetes cluster.
Install the aws-load-balancer-controller Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
# renovate: datasource=helm depName=aws-load-balancer-controller registryUrl=https://aws.github.io/eks-charts
AWS_LOAD_BALANCER_CONTROLLER_HELM_CHART_VERSION="1.14.1"
helm repo add --force-update eks https://aws.github.io/eks-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-aws-load-balancer-controller.yml" << EOF
serviceAccount:
name: aws-load-balancer-controller
clusterName: ${CLUSTER_NAME}
serviceMonitor:
enabled: true
EOF
helm upgrade --install --version "${AWS_LOAD_BALANCER_CONTROLLER_HELM_CHART_VERSION}" --namespace aws-load-balancer-controller --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-aws-load-balancer-controller.yml" aws-load-balancer-controller eks/aws-load-balancer-controller
Pod Scheduling PriorityClasses
Configure PriorityClasses to control the scheduling priority of pods in your cluster. PriorityClasses allow you to influence which pods are scheduled or evicted first when resources are constrained. These classes help ensure that critical workloads receive scheduling priority over less important workloads.
Create custom PriorityClass resources to define priority levels for different workload types:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-scheduling-priorityclass.yml" << EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-priority
value: 100001000
globalDefault: false
description: "This priority class should be used for critical workloads only"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000000
globalDefault: false
description: "This priority class should be used for high priority workloads"
EOF
Add Storage Classes and Volume Snapshots
Configure persistent storage for your EKS cluster by setting up GP3 storage classes and volume snapshot capabilities. This ensures encrypted, expandable storage with proper backup functionality.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-storage-snapshot-storageclass-volumesnapshotclass.yml" << EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
kmsKeyId: ${AWS_KMS_KEY_ARN}
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-vsc
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: ebs.csi.aws.com
deletionPolicy: Delete
EOF
Delete the gp2 StorageClass, as gp3 will be used instead:
1
kubectl delete storageclass gp2 || true
Karpenter
Karpenter is a Kubernetes node autoscaler built for flexibility, performance, and simplicity.
Install the karpenter Helm chart and customize its default values to fit your environment and storage backend:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# renovate: datasource=github-tags depName=aws/karpenter-provider-aws
KARPENTER_HELM_CHART_VERSION="1.8.2"
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-karpenter.yml" << EOF
serviceMonitor:
enabled: true
settings:
clusterName: ${CLUSTER_NAME}
eksControlPlane: true
interruptionQueue: ${CLUSTER_NAME}
featureGates:
spotToSpotConsolidation: true
EOF
helm upgrade --install --version "${KARPENTER_HELM_CHART_VERSION}" --namespace karpenter --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-karpenter.yml" karpenter oci://public.ecr.aws/karpenter/karpenter
Configure Karpenter by applying the following provisioner definition:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
tee "${TMP_DIR}/${CLUSTER_FQDN}/k8s-karpenter-nodepool.yml" << EOF | kubectl apply -f -
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: Bottlerocket
amiSelectorTerms:
- alias: bottlerocket@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
role: "KarpenterNodeRole-${CLUSTER_NAME}"
tags:
Name: "${CLUSTER_NAME}-karpenter"
$(echo "${TAGS}" | sed "s/,/\\n /g; s/=/: /g")
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 2Gi
volumeType: gp3
encrypted: true
kmsKeyID: ${AWS_KMS_KEY_ARN}
- deviceName: /dev/xvdb
ebs:
volumeSize: 20Gi
volumeType: gp3
encrypted: true
kmsKeyID: ${AWS_KMS_KEY_ARN}
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
# keep-sorted start
- key: "topology.kubernetes.io/zone"
operator: In
values: ["${AWS_DEFAULT_REGION}a"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["t4g", "t3a"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
# keep-sorted end
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
EOF
cert-manager
cert-manager adds certificates and certificate issuers as resource types in Kubernetes clusters and simplifies the process of obtaining, renewing, and using those certificates.
The cert-manager ServiceAccount was created by eksctl. Install the cert-manager Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# renovate: datasource=helm depName=cert-manager registryUrl=https://charts.jetstack.io
CERT_MANAGER_HELM_CHART_VERSION="1.19.1"
helm repo add --force-update jetstack https://charts.jetstack.io
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-cert-manager.yml" << EOF
global:
priorityClassName: high-priority
crds:
enabled: true
extraArgs:
- --enable-certificate-owner-ref=true
serviceAccount:
name: cert-manager
enableCertificateOwnerRef: true
prometheus:
servicemonitor:
enabled: true
EOF
helm upgrade --install --version "${CERT_MANAGER_HELM_CHART_VERSION}" --namespace cert-manager --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-cert-manager.yml" cert-manager jetstack/cert-manager
Install Velero
Velero is an open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes. It enables disaster recovery, data migration, and scheduled backups by integrating with cloud storage providers such as AWS S3.
Install the velero Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# renovate: datasource=helm depName=velero registryUrl=https://vmware-tanzu.github.io/helm-charts
VELERO_HELM_CHART_VERSION="11.1.1"
helm repo add --force-update vmware-tanzu https://vmware-tanzu.github.io/helm-charts
cat > "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-velero.yml" << EOF
initContainers:
- name: velero-plugin-for-aws
# renovate: datasource=docker depName=velero/velero-plugin-for-aws extractVersion=^(?<version>.+)$
image: velero/velero-plugin-for-aws:v1.13.0
volumeMounts:
- mountPath: /target
name: plugins
priorityClassName: high-priority
metrics:
serviceMonitor:
enabled: true
# prometheusRule:
# enabled: true
# spec:
# - alert: VeleroBackupPartialFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} partially failed backups.
# expr: velero_backup_partial_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
# for: 15m
# labels:
# severity: warning
# - alert: VeleroBackupFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed backups.
# expr: velero_backup_failure_total{schedule!=""} / velero_backup_attempt_total{schedule!=""} > 0.25
# for: 15m
# labels:
# severity: warning
# - alert: VeleroBackupSnapshotFailures
# annotations:
# message: Velero backup {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed snapshot backups.
# expr: increase(velero_volume_snapshot_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
# - alert: VeleroRestorePartialFailures
# annotations:
# message: Velero restore {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} partially failed restores.
# expr: increase(velero_restore_partial_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
# - alert: VeleroRestoreFailures
# annotations:
# message: Velero restore {{ \$labels.schedule }} has {{ \$value | humanizePercentage }} failed restores.
# expr: increase(velero_restore_failure_total{schedule!=""}[1h]) > 0
# for: 15m
# labels:
# severity: warning
configuration:
backupStorageLocation:
- name:
provider: aws
bucket: ${CLUSTER_FQDN}
prefix: velero
config:
region: ${AWS_DEFAULT_REGION}
volumeSnapshotLocation:
- name:
provider: aws
config:
region: ${AWS_DEFAULT_REGION}
serviceAccount:
server:
name: velero
credentials:
useSecret: false
# Create scheduled backup to periodically backup the let's encrypt production resources in the "cert-manager" namespace:
schedules:
monthly-backup-cert-manager-production:
labels:
letsencrypt: production
schedule: "@monthly"
template:
ttl: 2160h
includedNamespaces:
- cert-manager
includedResources:
- certificates.cert-manager.io
- secrets
labelSelector:
matchLabels:
letsencrypt: production
EOF
helm upgrade --install --version "${VELERO_HELM_CHART_VERSION}" --namespace velero --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-velero.yml" velero vmware-tanzu/velero
Restore cert-manager objects
The following steps will guide you through restoring a Let’s Encrypt production certificate, previously backed up by Velero to S3, onto a new cluster.
Initiate the restore process for the cert-manager objects.
1
2
while [ -z "$(kubectl -n velero get backupstoragelocations default -o jsonpath='{.status.lastSyncedTime}')" ]; do sleep 5; done
velero restore create --from-schedule velero-monthly-backup-cert-manager-production --labels letsencrypt=production --wait --existing-resource-policy=update
View details about the restore process:
1
velero restore describe --selector letsencrypt=production --details
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Name: velero-monthly-backup-cert-manager-production-20251030075321
Namespace: velero
Labels: letsencrypt=production
Annotations: <none>
Phase: Completed
Total items to be restored: 3
Items restored: 3
Started: 2025-10-30 07:53:22 +0100 CET
Completed: 2025-10-30 07:53:24 +0100 CET
Backup: velero-monthly-backup-cert-manager-production-20250921155028
Namespaces:
Included: all namespaces found in the backup
Excluded: <none>
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Or label selector: <none>
Restore PVs: auto
CSI Snapshot Restores: <none included>
Existing Resource Policy: update
ItemOperationTimeout: 4h0m0s
Preserve Service NodePorts: auto
Uploader config:
HooksAttempted: 0
HooksFailed: 0
Resource List:
cert-manager.io/v1/Certificate:
- cert-manager/ingress-cert-production(created)
v1/Secret:
- cert-manager/ingress-cert-production(created)
- cert-manager/letsencrypt-production-dns(created)
Verify that the certificate was restored properly:
1
kubectl describe certificates -n cert-manager ingress-cert-production
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Name: ingress-cert-production
Namespace: cert-manager
Labels: letsencrypt=production
velero.io/backup-name=velero-monthly-backup-cert-manager-production-20250921155028
velero.io/restore-name=velero-monthly-backup-cert-manager-production-20251030075321
Annotations: <none>
API Version: cert-manager.io/v1
Kind: Certificate
Metadata:
Creation Timestamp: 2025-10-30T06:53:23Z
Generation: 1
Resource Version: 5521
UID: 33422558-3105-4936-87d8-468befb5dc2b
Spec:
Common Name: *.k01.k8s.mylabs.dev
Dns Names:
*.k01.k8s.mylabs.dev
k01.k8s.mylabs.dev
Issuer Ref:
Group: cert-manager.io
Kind: ClusterIssuer
Name: letsencrypt-production-dns
Secret Name: ingress-cert-production
Secret Template:
Labels:
Letsencrypt: production
Status:
Conditions:
Last Transition Time: 2025-10-30T06:53:23Z
Message: Certificate is up to date and has not expired
Observed Generation: 1
Reason: Ready
Status: True
Type: Ready
Not After: 2025-12-20T10:53:07Z
Not Before: 2025-09-21T10:53:08Z
Renewal Time: 2025-11-20T10:53:07Z
Events: <none>
ExternalDNS
ExternalDNS synchronizes exposed Kubernetes Services and Ingresses with DNS providers.
ExternalDNS will manage the DNS records. The external-dns ServiceAccount was created by eksctl. Install the external-dns Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# renovate: datasource=helm depName=external-dns registryUrl=https://kubernetes-sigs.github.io/external-dns/
EXTERNAL_DNS_HELM_CHART_VERSION="1.19.0"
helm repo add --force-update external-dns https://kubernetes-sigs.github.io/external-dns/
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-external-dns.yml" << EOF
serviceAccount:
name: external-dns
priorityClassName: high-priority
serviceMonitor:
enabled: true
interval: 20s
policy: sync
domainFilters:
- ${CLUSTER_FQDN}
EOF
helm upgrade --install --version "${EXTERNAL_DNS_HELM_CHART_VERSION}" --namespace external-dns --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-external-dns.yml" external-dns external-dns/external-dns
Ingress NGINX Controller
ingress-nginx is an Ingress controller for Kubernetes that uses nginx as a reverse proxy and load balancer.
Install the ingress-nginx Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# renovate: datasource=helm depName=ingress-nginx registryUrl=https://kubernetes.github.io/ingress-nginx
INGRESS_NGINX_HELM_CHART_VERSION="4.13.3"
helm repo add --force-update ingress-nginx https://kubernetes.github.io/ingress-nginx
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-ingress-nginx.yml" << EOF
controller:
config:
annotations-risk-level: Critical
use-proxy-protocol: true
allowSnippetAnnotations: true
ingressClassResource:
default: true
extraArgs:
default-ssl-certificate: cert-manager/ingress-cert-production
service:
annotations:
# https://www.qovery.com/blog/our-migration-from-kubernetes-built-in-nlb-to-alb-controller/
# https://www.youtube.com/watch?v=xwiRjimKW9c
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: ${TAGS//\'/}
service.beta.kubernetes.io/aws-load-balancer-name: eks-${CLUSTER_NAME}
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true
service.beta.kubernetes.io/aws-load-balancer-type: external
# loadBalancerClass: eks.amazonaws.com/nlb
metrics:
enabled: true
serviceMonitor:
enabled: true
# prometheusRule:
# enabled: true
# rules:
# - alert: NGINXConfigFailed
# expr: count(nginx_ingress_controller_config_last_reload_successful == 0) > 0
# for: 1s
# labels:
# severity: critical
# annotations:
# description: bad ingress config - nginx config test failed
# summary: uninstall the latest ingress changes to allow config reloads to resume
# - alert: NGINXCertificateExpiry
# expr: (avg(nginx_ingress_controller_ssl_expire_time_seconds{host!="_"}) by (host) - time()) < 604800
# for: 1s
# labels:
# severity: critical
# annotations:
# description: ssl certificate(s) will expire in less then a week
# summary: renew expiring certificates to avoid downtime
# - alert: NGINXTooMany500s
# expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
# for: 1m
# labels:
# severity: warning
# annotations:
# description: Too many 5XXs
# summary: More than 5% of all requests returned 5XX, this requires your attention
# - alert: NGINXTooMany400s
# expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
# for: 1m
# labels:
# severity: warning
# annotations:
# description: Too many 4XXs
# summary: More than 5% of all requests returned 4XX, this requires your attention
priorityClassName: critical-priority
EOF
helm upgrade --install --version "${INGRESS_NGINX_HELM_CHART_VERSION}" --namespace ingress-nginx --create-namespace --wait --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-ingress-nginx.yml" ingress-nginx ingress-nginx/ingress-nginx
Loki
Grafana Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost-effective and easy to operate, as it does not index the contents of the logs, but rather a set of labels for each log stream.
Install the loki Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# renovate: datasource=helm depName=loki registryUrl=https://grafana.github.io/helm-charts
LOKI_HELM_CHART_VERSION="6.45.2"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-loki.yml" << EOF
global:
priorityClassName: high-priority
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 2
storage:
bucketNames:
chunks: ${CLUSTER_FQDN}
ruler: ${CLUSTER_FQDN}
admin: ${CLUSTER_FQDN}
s3:
region: ${AWS_REGION}
endpoint: s3.${AWS_REGION}.amazonaws.com
object_store:
storage_prefix: ruzickap
s3:
endpoint: s3.${AWS_REGION}.amazonaws.com
region: ${AWS_REGION}
schemaConfig:
configs:
- from: 2024-04-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
region: ${AWS_REGION}
# bucketnames: loki-chunk
# bucketnames: loki-chunk
# s3forcepathstyle: false
# s3: s3://s3.${AWS_REGION}.amazonaws.com/loki-storage
# endpoint: s3.${AWS_REGION}.amazonaws.com
limits_config:
retention_period: 1w
# Log retention in Loki is achieved through the Compactor (https://grafana.com/docs/loki/v3.5.x/get-started/components/#compactor)
# compactor:
# delete_request_store: s3
# retention_enabled: true
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: A horizontally-scalable, highly-available log aggregation system
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/grafana/loki/5a8bc848dbe453ce27576d2058755a90f79d07b6/docs/sources/logo.png
gethomepage.dev/name: Loki
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
hosts:
- loki.${CLUSTER_FQDN}
tls:
- hosts:
- loki.${CLUSTER_FQDN}
singleBinary:
replicas: 2
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
# https://blog.devgenius.io/install-loki-in-distributed-mode-on-azure-aks-with-terraform-0918803f2ed0
ruler:
enabled: false
EOF
helm upgrade --install --version "${LOKI_HELM_CHART_VERSION}" --namespace loki --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-loki.yml" loki grafana/loki
Mimir
Grafana Mimir is an open source, horizontally scalable, multi-tenant time series database for Prometheus metrics, designed for high availability and cost efficiency. It enables you to centralize metrics from multiple clusters or environments, and integrates seamlessly with Grafana dashboards for visualization and alerting.
Install the mimir-distributed Helm chart and customize its default values to fit your environment and storage backend:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# renovate: datasource=helm depName=mimir-distributed registryUrl=https://grafana.github.io/helm-charts
MIMIR_DISTRIBUTED_HELM_CHART_VERSION="6.0.3"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mimir-distributed.yml" << EOF
serviceAccount:
name: mimir
mimir:
structuredConfig:
limits:
compactor_blocks_retention_period: 30d
# {"ts":"2025-11-04T19:30:40.472926117Z","level":"error","msg":"non-recoverable error","component_path":"/","component_id":"prometheus.remote_write.mimir","subcomponent":"rw","remote_name":"5b0906","url":"http://mimir-gateway.mimir.svc.cluster.local/api/v1/push","failedSampleCount":2000,"failedHistogramCount":0,"failedExemplarCount":0,"err":"server returned HTTP status 400 Bad Request: received a series whose number of labels exceeds the limit (actual: 31, limit: 30) series: 'karpenter_nodes_allocatable{arch=\"amd64\", capacity_type=\"spot\", container=\"controller\", endpoint=\"http-metrics\", instance=\"192.168.92.152:8080\", instance_capability_flex=\"false\", instance_category=\"t\"…' (err-mimir-max-label-names-per-series). To adjust the related per-tenant limit, configure -validation.max-label-names-per-series, or contact your service administrator.\n"}
max_label_names_per_series: 50
common:
# https://grafana.com/docs/mimir/v2.17.x/configure/configuration-parameters/
storage:
backend: s3
s3:
endpoint: s3.${AWS_REGION}.amazonaws.com
region: ${AWS_REGION}
storage_class: ONEZONE_IA
alertmanager_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimiralertmanager
blocks_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimirblocks
ruler_storage:
s3:
bucket_name: ${CLUSTER_FQDN}
storage_prefix: mimirruler
ingester:
replicas: 2
# https://github.com/grafana/helm-charts/blob/main/charts/rollout-operator/values.yaml
rollout_operator:
serviceMonitor:
enabled: true
minio:
enabled: false
EOF
helm upgrade --install --version "${MIMIR_DISTRIBUTED_HELM_CHART_VERSION}" --namespace mimir --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mimir-distributed.yml" mimir grafana/mimir-distributed
Tempo
Grafana Tempo is an open source, easy-to-use, and high-scale distributed tracing backend. It is designed to be cost-effective and simple to operate, as it only requires object storage to operate its backend and does not index the trace data.
Install the tempo Helm chart and customize its default values to fit your environment and storage requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# renovate: datasource=helm depName=tempo registryUrl=https://grafana.github.io/helm-charts
TEMPO_HELM_CHART_VERSION="1.52.7"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-tempo.yml" << EOF
global:
priorityClassName: high-priority
# https://youtu.be/PmE9mgYaoQA?t=817
metricsGenerator:
enabled: true
storage:
trace:
backend: s3
s3:
bucket: ${CLUSTER_FQDN}
endpoint: s3.${AWS_REGION}.amazonaws.com
admin:
backend: s3
s3:
bucket_name: ${CLUSTER_FQDN}
endpoint: s3.${AWS_REGION}.amazonaws.com
traces:
otlp:
http:
enabled: true
grpc:
enabled: true
metricsGenerator:
enabled: true
config:
# processor:
# # https://grafana.com/docs/tempo/latest/operations/traceql-metrics/
# local_blocks:
# filter_server_spans: false
storage:
remote_write:
- url: http://mimir-gateway.mimir.svc.cluster.local/api/v1/push
EOF
helm upgrade --install --version "${TEMPO_HELM_CHART_VERSION}" --namespace tempo --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-tempo.yml" tempo grafana/tempo-distributed
Alloy
Grafana Alloy is an open source, vendor-neutral distribution of OpenTelemetry that provides a unified way to collect, process, and export telemetry data (traces, metrics, and logs) from your infrastructure and applications.
Install the alloy Helm chart and customize its default values to fit your environment and monitoring needs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# renovate: datasource=helm depName=alloy registryUrl=https://grafana.github.io/helm-charts
ALLOY_HELM_CHART_VERSION="1.4.0"
# https://github.com/ai-cfia/howard-on-prem/blob/main/monitoring/grafana-alloy/helm/values.yaml
# https://github.com/hongbo-miao/hongbomiao.com/blob/main/kubernetes/argo-cd/projects/production-hm/alloy/manifests/hm-alloy-application.yaml
# https://github.com/RS-PYTHON/rs-infra-monitoring/blob/0cc043e9398edd80b91b3ac8768f5a8ab7fce26e/apps/alloy/values.yaml#L47
# https://stackoverflow.com/questions/79695474/grafana-alloy-no-prefect-pod-logs-on-bottlerocket
# https://developer-friendly.blog/blog/2025/03/17/migration-from-promtail-to-alloy-the-what-the-why-and-the-how/#collect-prometheus-metrics
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-alloy.yml" << EOF
alloy:
configMap:
content: |-
logging {
level = "info"
format = "json"
}
// ##########################################
// # Beyla
// ##########################################
beyla.ebpf "default" {
attributes {
kubernetes {
enable = "true"
cluster_name = "${CLUSTER_NAME}"
}
}
discovery {
instrument {
open_ports = "80,443"
}
instrument {
kubernetes {
namespace = "ingress-nginx"
}
}
}
metrics {
features = [
"application",
"application_process",
"application_service_graph",
"application_span",
"network",
]
}
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
prometheus.scrape "beyla" {
targets = beyla.ebpf.default.targets
honor_labels = true
forward_to = [prometheus.remote_write.mimir.receiver]
}
// ##########################################
// # Tempo
// ##########################################
otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.prometheus.default.input]
logs = [otelcol.exporter.loki.default.input]
traces = [otelcol.exporter.otlp.tempo.input]
}
}
otelcol.connector.spanmetrics "default" {
dimension {
name = "http.status_code"
}
dimension {
name = "http.method"
default = "GET"
}
aggregation_temporality = "DELTA"
histogram {
unit = "s"
explicit {
buckets = ["333ms", "777s", "999h"]
}
}
metrics_flush_interval = "33s"
namespace = "default"
output {
metrics = [otelcol.processor.batch.default.input]
}
}
otelcol.connector.spanlogs "default" {
roots = true
output {
logs = [otelcol.processor.batch.default.input]
}
}
otelcol.connector.servicegraph "default" {
dimensions = ["http.method", "http.target"]
output {
metrics = [otelcol.processor.batch.default.input]
}
}
otelcol.receiver.otlp "default" {
// configures the default grpc endpoint "0.0.0.0:4317"
grpc { endpoint = "0.0.0.0:4317" }
// configures the default http/protobuf endpoint "0.0.0.0:4318"
http { endpoint = "0.0.0.0:4318" }
output {
metrics = [otelcol.processor.batch.default.input]
logs = [otelcol.processor.batch.default.input]
traces = [
otelcol.connector.servicegraph.default.input,
otelcol.connector.spanlogs.default.input,
otelcol.connector.spanmetrics.default.input,
]
}
}
otelcol.auth.headers "tempo" {
header {
key = "X-Scope-OrgID"
value = "1"
}
}
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo-distributor.tempo.svc.cluster.local:4317"
auth = otelcol.auth.headers.tempo.handler
tls {
insecure = true
}
}
}
otelcol.exporter.loki "default" {
forward_to = [loki.write.default.receiver]
}
otelcol.exporter.prometheus "default" {
forward_to = [prometheus.remote_write.mimir.receiver]
}
// ##########################################
// # Loki
// ##########################################
// ========= Pod logs (via K8s API) =========
// discovery.kubernetes allows you to find scrape targets from Kubernetes resources.
// It watches cluster state and ensures targets are continually synced with what is currently running in your cluster.
// https://grafana.com/docs/alloy/v1.11/reference/components/discovery/discovery.kubernetes/
discovery.kubernetes "pod" {
role = "pod"
// Restrict to pods on the node to reduce cpu & memory usage
// https://grafana.com/docs/alloy/v1.11/reference/components/discovery/discovery.kubernetes/#limit-to-only-pods-on-the-same-node
selectors {
role = "pod"
field = "spec.nodeName=" + coalesce(sys.env("HOSTNAME"), constants.hostname)
}
}
// discovery.relabel rewrites the label set of the input targets by applying one or more relabeling rules.
// If no rules are defined, then the input targets are exported as-is.
// https://grafana.com/docs/alloy/v1.11/reference/components/loki/loki.relabel/
discovery.relabel "pod_logs" {
targets = discovery.kubernetes.pod.targets
//* Label creation - "namespace" field from "__meta_kubernetes_namespace"
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
//* Label creation - "pod" field from "__meta_kubernetes_pod_name"
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
//* Label creation - "container" field from "__meta_kubernetes_pod_container_name"
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
//* Label creation - "app" field from "__meta_kubernetes_pod_label_app_kubernetes_io_name"
rule {
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
target_label = "app"
}
//* Label creation - "job" field from "__meta_kubernetes_namespace" and "__meta_kubernetes_pod_container_name"
// Concatenate values __meta_kubernetes_namespace/__meta_kubernetes_pod_container_name
rule {
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
target_label = "job"
separator = "/"
}
//* Label creation - "container" field from "__meta_kubernetes_pod_uid" and "__meta_kubernetes_pod_container_name"
// Concatenate values __meta_kubernetes_pod_uid/__meta_kubernetes_pod_container_name.log
rule {
source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
target_label = "__path__"
separator = "/"
replacement = "/var/log/pods/*\$1/*.log"
}
//* Label creation - "container_runtime" field from "__meta_kubernetes_pod_container_id"
rule {
source_labels = ["__meta_kubernetes_pod_container_id"]
target_label = "container_runtime"
regex = "^(\\\S+):\\\/\\\/.+$"
}
// Label creation - "node_name" field from "__meta_kubernetes_pod_node_name"
rule {
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "node_name"
}
// Label creation - "component" field from "__meta_kubernetes_pod_label_app_kubernetes_io_component" and "__meta_kubernetes_pod_label_component"
rule {
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component", "__meta_kubernetes_pod_label_component"]
target_label = "component"
regex = "^;*([^;]+)(;.*)?$"
}
}
// loki.process receives log entries from other Loki components, applies one or more processing stages,
// and forwards the results to the list of receivers in the component's arguments.
loki.process "pod_logs" {
stage.cri {}
stage.decolorize {}
forward_to = [loki.write.default.receiver]
}
// loki.source.kubernetes tails logs from Kubernetes containers using the Kubernetes API.
// https://grafana.com/docs/alloy/v1.11/reference/components/loki/loki.source.kubernetes/
loki.source.kubernetes "pod_logs" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.process.pod_logs.receiver]
}
// ========= Kubernetes Events =========
// loki.source.kubernetes_events tails events from the Kubernetes API and converts them
// into log lines to forward to other Loki components.
// https://grafana.com/docs/alloy/v1.11/reference/components/loki/loki.source.kubernetes_events/
loki.source.kubernetes_events "cluster_events" {
job_name = "integrations/kubernetes/eventhandler"
// log_format = "json"
forward_to = [
loki.process.cluster_events.receiver,
]
}
// loki.process receives log entries from other loki components, applies one or more processing stages,
// and forwards the results to the list of receivers in the component's arguments.
loki.process "cluster_events" {
forward_to = [loki.write.default.receiver]
stage.static_labels {
values = {
cluster = "${CLUSTER_NAME}",
}
}
stage.labels {
values = {
kubernetes_cluster_events = "job",
}
}
}
// https://grafana.com/docs/alloy/v1.11/reference/components/loki/loki.write/
loki.write "default" {
endpoint {
url = "http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push"
tenant_id = "1"
}
}
// #####################
// # Mimir / Prometheus
// #####################
// prometheus.exporter.cadvisor "cadvisor" {
// allowlisted_container_labels = ["io.kubernetes.container.name", "io.kubernetes.pod.namespace", "io.kubernetes.pod.name"]
// enabled_metrics = ["cpu", "memory"]
// }
prometheus.exporter.unix "default" {
// https://github.com/aws/karpenter-provider-aws/issues/5406
// https://github.com/prometheus/node_exporter/issues/2692
// udev_data_path = "/rootfs/run/udev/data"
}
prometheus.scrape "scrape_metrics" {
targets = prometheus.exporter.unix.default.targets
forward_to = [prometheus.remote_write.mimir.receiver]
scrape_interval = "10s"
}
// Scrape service monitors (clustered to avoid duplicates)
prometheus.operator.servicemonitors "default" {
clustering {
enabled = true
}
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Scrape pod monitors (clustered to avoid duplicates)
prometheus.operator.podmonitors "pods" {
clustering {
enabled = true
}
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Scrape every probe (clustered to avoid duplicates)
prometheus.operator.probes "probes" {
clustering {
enabled = true
}
forward_to = [prometheus.remote_write.mimir.receiver]
}
// Expose a blackbox exporter locally so that probes can use the local exporter as a target
prometheus.exporter.blackbox "blackbox" {
config = "{ modules: { http_2xx: { prober: http, timeout: 5s } } }"
targets = [
{
name = "oauth2-proxy",
address = "https://oauth2-proxy.${CLUSTER_FQDN}",
module = "http_2xx",
},
]
}
// ##########################################
// # Common configuration
// ##########################################
prometheus.remote_write "mimir" {
endpoint {
url = "http://mimir-gateway.mimir.svc.cluster.local/api/v1/push"
headers = {
"X-Scope-OrgID" = "1",
}
}
}
extraPorts:
- name: otlp-grpc
port: 4317
targetPort: 4317
protocol: TCP
- name: otlp-http
port: 4318
targetPort: 4318
protocol: TCP
mounts:
varlog: true
# https://stackoverflow.com/questions/79400979/cannot-see-any-traces-from-alloy-in-grafana/79446696#79446696
securityContext:
appArmorProfile:
type: Unconfined
runAsUser: 0
capabilities:
drop:
- ALL
add:
- BPF
- CHECKPOINT_RESTORE
- DAC_READ_SEARCH
- NET_RAW
- PERFMON
- SYS_ADMIN
- SYS_PTRACE
controller:
priorityClassName: system-node-critical
serviceMonitor:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: OpenTelemetry Collector distribution with programmable pipelines
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/grafana/alloy/513175e2add3957310a445a7b683100b703a9b49/docs/sources/assets/alloy_icon_orange.svg
gethomepage.dev/name: Alloy
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
faroPort: 12345
hosts:
- alloy.${CLUSTER_FQDN}
tls:
- hosts:
- alloy.${CLUSTER_FQDN}
EOF
helm upgrade --install --version "${ALLOY_HELM_CHART_VERSION}" --namespace alloy --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-alloy.yml" alloy grafana/alloy
Beyla
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# renovate: datasource=helm depName=beyla registryUrl=https://grafana.github.io/helm-charts
BEYLA_HELM_CHART_VERSION="1.4.0"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-beyla.yml" << EOF
priorityClassName: system-node-critical
config:
data:
discovery:
instrument:
- open_ports: 443
otel_metrics_export:
endpoint: http://alloy.alloy.svc.cluster.local:4317
protocol: grpc
otel_traces_export:
endpoint: http://alloy.alloy.svc.cluster.local:4317
protocol: grpc
attributes:
select:
beyla_network_flow_bytes:
include:
- k8s.src.owner.name
- k8s.src.namespace
- k8s.dst.owner.name
- k8s.dst.namespace
- k8s.cluster.name
- src.zone
- dst.zone
network:
enable: true
env:
BEYLA_KUBE_CLUSTER_NAME: ${CLUSTER_NAME}
serviceMonitor:
enabled: true
EOF
helm upgrade --install --version "${BEYLA_HELM_CHART_VERSION}" --namespace beyla --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-beyla.yml" beyla grafana/beyla
Mailpit
Mailpit will be used to receive email alerts from Prometheus.
Install the mailpit Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# renovate: datasource=helm depName=mailpit registryUrl=https://jouve.github.io/charts/
MAILPIT_HELM_CHART_VERSION="0.29.2"
helm repo add --force-update jouve https://jouve.github.io/charts/
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mailpit.yml" << EOF
replicaCount: 2
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- mailpit
topologyKey: kubernetes.io/hostname
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: An email and SMTP testing tool with API for developers
gethomepage.dev/group: Apps
gethomepage.dev/icon: https://raw.githubusercontent.com/axllent/mailpit/61241f11ac94eb33bd84e399129992250eff56ce/server/ui/favicon.svg
gethomepage.dev/name: Mailpit
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
hostname: mailpit.${CLUSTER_FQDN}
EOF
helm upgrade --install --version "${MAILPIT_HELM_CHART_VERSION}" --namespace mailpit --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-mailpit.yml" mailpit jouve/mailpit
kubectl label namespace mailpit pod-security.kubernetes.io/enforce=baseline
Screenshot:
Grafana
Grafana is an open-source analytics and monitoring platform that allows you to query, visualize, alert on, and understand your metrics, logs, and traces. It provides a powerful and flexible way to create dashboards and visualizations for monitoring your Kubernetes cluster and applications.
Install the grafana Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# renovate: datasource=helm depName=grafana registryUrl=https://grafana.github.io/helm-charts
GRAFANA_HELM_CHART_VERSION="10.1.4"
helm repo add --force-update grafana https://grafana.github.io/helm-charts
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-grafana.yml" << EOF
serviceMonitor:
enabled: true
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/description: Visualization Platform
gethomepage.dev/enabled: "true"
gethomepage.dev/group: Observability
gethomepage.dev/icon: grafana.svg
gethomepage.dev/name: Grafana
gethomepage.dev/app: grafana
gethomepage.dev/pod-selector: "app.kubernetes.io/name=grafana"
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
nginx.ingress.kubernetes.io/configuration-snippet: |
auth_request_set \$email \$upstream_http_x_auth_request_email;
proxy_set_header X-Email \$email;
path: /
pathType: Prefix
hosts:
- grafana.${CLUSTER_FQDN}
tls:
- hosts:
- grafana.${CLUSTER_FQDN}
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
url: http://mimir-gateway.mimir.svc.cluster.local/prometheus
access: proxy
editable: true
isDefault: true
jsonData:
prometheusType: Mimir
prometheusVersion: 2.9.1
httpHeaderName1: X-Scope-OrgID
secureJsonData:
httpHeaderValue1: 1
- name: Loki
type: loki
url: http://loki-gateway.loki.svc.cluster.local/
access: proxy
editable: true
jsonData:
httpHeaderName1: X-Scope-OrgID
secureJsonData:
httpHeaderValue1: "1"
- name: Tempo
type: tempo
url: http://tempo-query-frontend.tempo.svc.cluster.local:3200
access: proxy
editable: true
notifiers:
notifiers.yaml:
notifiers:
- name: email-notifier
type: email
uid: email1
org_id: 1
is_default: true
settings:
addresses: ${MY_EMAIL}
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
editable: false
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
# keep-sorted start numeric=yes
1860-node-exporter-full:
# renovate: depName="Node Exporter Full"
gnetId: 1860
revision: 37
datasource: Prometheus
# 19105-prometheus:
# # renovate: depName="Prometheus"
# gnetId: 19105
# revision: 6
# datasource: Prometheus
# 19268-prometheus:
# # renovate: depName="Prometheus All Metrics"
# gnetId: 19268
# revision: 1
# datasource: Prometheus
# 20340-cert-manager:
# # renovate: depName="cert-manager"
# gnetId: 20340
# revision: 1
# datasource: Prometheus
# 20842-cert-manager-kubernetes:
# # renovate: depName="Cert-manager-Kubernetes"
# gnetId: 20842
# revision: 1
# datasource: Prometheus
9923-beyla-red-metrics:
# renovate: depName="Beyla RED Metrics"
gnetId: 9923
revision: 3
datasource: Prometheus
# 3662-prometheus-2-0-overview:
# # renovate: depName="Prometheus 2.0 Overview"
# gnetId: 3662
# revision: 2
# datasource: Prometheus
# 9614-nginx-ingress-controller:
# # renovate: depName="NGINX Ingress controller"
# gnetId: 9614
# revision: 1
# datasource: Prometheus
# 12006-kubernetes-apiserver:
# # renovate: depName="Kubernetes apiserver"
# gnetId: 12006
# revision: 1
# datasource: Prometheus
# # https://github.com/DevOps-Nirvana/Grafana-Dashboards
# 14314-kubernetes-nginx-ingress-controller-nextgen-devops-nirvana:
# # renovate: depName="Kubernetes Nginx Ingress Prometheus NextGen"
# gnetId: 14314
# revision: 2
# datasource: Prometheus
# 15038-external-dns:
# # renovate: depName="External-dns"
# gnetId: 15038
# revision: 3
# datasource: Prometheus
15757-kubernetes-views-global:
# renovate: depName="Kubernetes / Views / Global"
gnetId: 15757
revision: 42
datasource: Prometheus
15758-kubernetes-views-namespaces:
# renovate: depName="Kubernetes / Views / Namespaces"
gnetId: 15758
revision: 41
datasource: Prometheus
15759-kubernetes-views-nodes:
# renovate: depName="Kubernetes / Views / Nodes"
gnetId: 15759
revision: 40
datasource: Prometheus
# https://grafana.com/orgs/imrtfm/dashboards - https://github.com/dotdc/grafana-dashboards-kubernetes
15760-kubernetes-views-pods:
# renovate: depName="Kubernetes / Views / Pods"
gnetId: 15760
revision: 37
datasource: Prometheus
15761-kubernetes-system-api-server:
# renovate: depName="Kubernetes / System / API Server"
gnetId: 15761
revision: 18
datasource: Prometheus
16006-mimir-alertmanager-resources:
# renovate: depName="Mimir / Alertmanager resources"
gnetId: 16006
revision: 17
datasource: Prometheus
16007-mimir-alertmanager:
# renovate: depName="Mimir / Alertmanager"
gnetId: 16007
revision: 17
datasource: Prometheus
16008-mimir-compactor-resources:
# renovate: depName="Mimir / Compactor resources"
gnetId: 16008
revision: 17
datasource: Prometheus
16009-mimir-compactor:
# renovate: depName="Mimir / Compactor"
gnetId: 16009
revision: 17
datasource: Prometheus
16010-mimir-config:
# renovate: depName="Mimir / Config"
gnetId: 16010
revision: 17
datasource: Prometheus
16011-mimir-object-store:
# renovate: depName="Mimir / Object Store"
gnetId: 16011
revision: 17
datasource: Prometheus
16012-mimir-overrides:
# renovate: depName="Mimir / Overrides"
gnetId: 16012
revision: 17
datasource: Prometheus
16013-mimir-queries:
# renovate: depName="Mimir / Queries"
gnetId: 16013
revision: 17
datasource: Prometheus
16014-mimir-reads-networking:
# renovate: depName="Mimir / Reads networking"
gnetId: 16014
revision: 17
datasource: Prometheus
16015-mimir-reads-resources:
# renovate: depName="Mimir / Reads resources"
gnetId: 16015
revision: 17
datasource: Prometheus
16016-mimir-reads:
# renovate: depName="Mimir / Reads"
gnetId: 16016
revision: 17
datasource: Prometheus
16017-mimir-rollout-progress:
# renovate: depName="Mimir / Rollout progress"
gnetId: 16017
revision: 17
datasource: Prometheus
16018-mimir-ruler:
# renovate: depName="Mimir / Ruler"
gnetId: 16018
revision: 17
datasource: Prometheus
16019-mimir-scaling:
# renovate: depName="Mimir / Scaling"
gnetId: 16019
revision: 17
datasource: Prometheus
16020-mimir-slow-queries:
# renovate: depName="Mimir / Slow queries"
gnetId: 16020
revision: 17
datasource: Prometheus
16021-mimir-tenants:
# renovate: depName="Mimir / Tenants"
gnetId: 16021
revision: 17
datasource: Prometheus
16022-mimir-top-tenants:
# renovate: depName="Mimir / Top tenants"
gnetId: 16022
revision: 16
datasource: Prometheus
16023-mimir-writes-networking:
# renovate: depName="Mimir / Writes networking"
gnetId: 16023
revision: 16
datasource: Prometheus
16024-mimir-writes-resources:
# renovate: depName="Mimir / Writes resources"
gnetId: 16024
revision: 17
datasource: Prometheus
16026-mimir-writes:
# renovate: depName="Mimir / Writes"
gnetId: 16022
revision: 17
datasource: Prometheus
17605-mimir-overview-networking:
# renovate: depName="Mimir / Overview networking"
gnetId: 17605
revision: 13
datasource: Prometheus
17606-mimir-overview-resources:
# renovate: depName="Mimir / Overview resources"
gnetId: 17606
revision: 13
datasource: Prometheus
17607-mimir-overview:
# renovate: depName="Mimir / Overview"
gnetId: 17607
revision: 13
datasource: Prometheus
17608-mimir-remote-ruler-reads:
# renovate: depName="Mimir / Remote ruler reads"
gnetId: 17608
revision: 13
datasource: Prometheus
17609-mimir-remote-ruler-reads-resources:
# renovate: depName="Mimir / Remote ruler reads resources"
gnetId: 17609
revision: 13
datasource: Prometheus
# keep-sorted end
grafana.ini:
analytics:
check_for_updates: false
auth.basic:
enabled: false
auth.proxy:
enabled: true
header_name: X-Email
header_property: email
users:
auto_assign_org_role: Admin
smtp:
enabled: true
host: mailpit-smtp.mailpit.svc.cluster.local:25
from_address: grafana@${CLUSTER_FQDN}
networkPolicy:
enabled: true
EOF
helm upgrade --install --version "${GRAFANA_HELM_CHART_VERSION}" --namespace grafana --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-grafana.yml" grafana grafana/grafana
OAuth2 Proxy
Use OAuth2 Proxy to protect application endpoints with Google Authentication.
Install the oauth2-proxy Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# renovate: datasource=helm depName=oauth2-proxy registryUrl=https://oauth2-proxy.github.io/manifests
OAUTH2_PROXY_HELM_CHART_VERSION="8.3.2"
helm repo add --force-update oauth2-proxy https://oauth2-proxy.github.io/manifests
tee "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-oauth2-proxy.yml" << EOF
config:
clientID: ${GOOGLE_CLIENT_ID}
clientSecret: ${GOOGLE_CLIENT_SECRET}
cookieSecret: "$(openssl rand -base64 32 | head -c 32 | base64)"
configFile: |-
cookie_domains = ".${CLUSTER_FQDN}"
set_authorization_header = "true"
set_xauthrequest = "true"
upstreams = [ "file:///dev/null" ]
whitelist_domains = ".${CLUSTER_FQDN}"
authenticatedEmailsFile:
enabled: true
restricted_access: |-
${MY_EMAIL}
ingress:
enabled: true
ingressClassName: nginx
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/description: A reverse proxy that provides authentication with Google, Azure, OpenID Connect and many more identity providers
gethomepage.dev/group: Cluster Management
gethomepage.dev/icon: https://raw.githubusercontent.com/oauth2-proxy/oauth2-proxy/899c743afc71e695964165deb11f50b9a0703c97/docs/static/img/logos/OAuth2_Proxy_icon.svg
gethomepage.dev/name: OAuth2-Proxy
hosts:
- oauth2-proxy.${CLUSTER_FQDN}
tls:
- hosts:
- oauth2-proxy.${CLUSTER_FQDN}
priorityClassName: critical-priority
metrics:
servicemonitor:
enabled: true
EOF
helm upgrade --install --version "${OAUTH2_PROXY_HELM_CHART_VERSION}" --namespace oauth2-proxy --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-oauth2-proxy.yml" oauth2-proxy oauth2-proxy/oauth2-proxy
Homepage
Install Homepage to provide a nice dashboard.
Install the homepage Helm chart and modify its default values:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# renovate: datasource=helm depName=homepage registryUrl=http://jameswynn.github.io/helm-charts
HOMEPAGE_HELM_CHART_VERSION="2.1.0"
helm repo add --force-update jameswynn http://jameswynn.github.io/helm-charts
cat > "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-homepage.yml" << EOF
enableRbac: true
serviceAccount:
create: true
ingress:
main:
enabled: true
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/name: Homepage
gethomepage.dev/description: A modern, secure, highly customizable application dashboard
gethomepage.dev/group: Apps
gethomepage.dev/icon: homepage.png
nginx.ingress.kubernetes.io/auth-url: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/auth
nginx.ingress.kubernetes.io/auth-signin: https://oauth2-proxy.${CLUSTER_FQDN}/oauth2/start?rd=\$scheme://\$host\$request_uri
ingressClassName: nginx
hosts:
- host: ${CLUSTER_FQDN}
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- ${CLUSTER_FQDN}
config:
bookmarks:
services:
widgets:
- logo:
icon: kubernetes.svg
- kubernetes:
cluster:
show: true
cpu: true
memory: true
showLabel: true
label: "${CLUSTER_NAME}"
nodes:
show: true
cpu: true
memory: true
showLabel: true
kubernetes:
mode: cluster
settings:
hideVersion: true
title: ${CLUSTER_FQDN}
favicon: https://raw.githubusercontent.com/homarr-labs/dashboard-icons/38631ad11695467d7a9e432d5fdec7a39a31e75f/svg/kubernetes.svg
layout:
Apps:
icon: mdi-apps
Observability:
icon: mdi-chart-bell-curve-cumulative
Cluster Management:
icon: mdi-tools
env:
- name: HOMEPAGE_ALLOWED_HOSTS
value: ${CLUSTER_FQDN}
- name: LOG_TARGETS
value: stdout
EOF
helm upgrade --install --version "${HOMEPAGE_HELM_CHART_VERSION}" --namespace homepage --create-namespace --values "${TMP_DIR}/${CLUSTER_FQDN}/helm_values-homepage.yml" homepage jameswynn/homepage
Clean-up
Back up the certificate before deleting the cluster (in case it was renewed):
1
2
3
if [[ "$(kubectl get --raw /api/v1/namespaces/cert-manager/services/cert-manager:9402/proxy/metrics | awk '/certmanager_http_acme_client_request_count.*acme-v02\.api.*finalize/ { print $2 }')" -gt 0 ]]; then
velero backup create --labels letsencrypt=production --ttl 2160h --from-schedule velero-monthly-backup-cert-manager-production
fi
Stop Karpenter from launching additional nodes:
1
2
helm uninstall -n karpenter karpenter || true
helm uninstall -n ingress-nginx ingress-nginx || true
Remove any remaining EC2 instances provisioned by Karpenter (if they still exist):
1
2
3
4
for EC2 in $(aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" "Name=tag:karpenter.sh/nodepool,Values=*" Name=instance-state-name,Values=running --query "Reservations[].Instances[].InstanceId" --output text); do
echo "🗑️ Removing Karpenter EC2: ${EC2}"
aws ec2 terminate-instances --instance-ids "${EC2}"
done
Disassociate a Route 53 Resolver query log configuration from an Amazon VPC:
1
2
3
4
5
6
7
8
RESOLVER_QUERY_LOG_CONFIGS_ID=$(aws route53resolver list-resolver-query-log-configs --query "ResolverQueryLogConfigs[?contains(DestinationArn, '/aws/eks/${CLUSTER_NAME}/cluster')].Id" --output text)
if [[ -n "${RESOLVER_QUERY_LOG_CONFIGS_ID}" ]]; then
RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID=$(aws route53resolver list-resolver-query-log-config-associations --filters "Name=ResolverQueryLogConfigId,Values=${RESOLVER_QUERY_LOG_CONFIGS_ID}" --query 'ResolverQueryLogConfigAssociations[].ResourceId' --output text)
if [[ -n "${RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID}" ]]; then
aws route53resolver disassociate-resolver-query-log-config --resolver-query-log-config-id "${RESOLVER_QUERY_LOG_CONFIGS_ID}" --resource-id "${RESOLVER_QUERY_LOG_CONFIG_ASSOCIATIONS_RESOURCEID}"
sleep 5
fi
fi
Clean up AWS Route 53 Resolver query log configurations:
1
2
3
4
AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID=$(aws route53resolver list-resolver-query-log-configs --query "ResolverQueryLogConfigs[?Name=='${CLUSTER_NAME}-vpc-dns-logs'].Id" --output text)
if [[ -n "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}" ]]; then
aws route53resolver delete-resolver-query-log-config --resolver-query-log-config-id "${AWS_CLUSTER_ROUTE53_RESOLVER_QUERY_LOG_CONFIG_ID}"
fi
Remove the EKS cluster and its created components:
1
2
3
if eksctl get cluster --name="${CLUSTER_NAME}"; then
eksctl delete cluster --name="${CLUSTER_NAME}" --force
fi
Remove the Route 53 DNS records from the DNS Zone:
1
2
3
4
5
6
7
8
9
10
CLUSTER_FQDN_ZONE_ID=$(aws route53 list-hosted-zones --query "HostedZones[?Name==\`${CLUSTER_FQDN}.\`].Id" --output text)
if [[ -n "${CLUSTER_FQDN_ZONE_ID}" ]]; then
aws route53 list-resource-record-sets --hosted-zone-id "${CLUSTER_FQDN_ZONE_ID}" | jq -c '.ResourceRecordSets[] | select (.Type != "SOA" and .Type != "NS")' |
while read -r RESOURCERECORDSET; do
aws route53 change-resource-record-sets \
--hosted-zone-id "${CLUSTER_FQDN_ZONE_ID}" \
--change-batch '{"Changes":[{"Action":"DELETE","ResourceRecordSet": '"${RESOURCERECORDSET}"' }]}' \
--output text --query 'ChangeInfo.Id'
done
fi
Delete Instance profile which belongs to Karpenter role:
1
2
3
4
5
6
if AWS_INSTANCE_PROFILES_FOR_ROLE=$(aws iam list-instance-profiles-for-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --query 'InstanceProfiles[].{Name:InstanceProfileName}' --output text); then
if [[ -n "${AWS_INSTANCE_PROFILES_FOR_ROLE}" ]]; then
aws iam remove-role-from-instance-profile --instance-profile-name "${AWS_INSTANCE_PROFILES_FOR_ROLE}" --role-name "KarpenterNodeRole-${CLUSTER_NAME}"
aws iam delete-instance-profile --instance-profile-name "${AWS_INSTANCE_PROFILES_FOR_ROLE}"
fi
fi
Remove the CloudFormation stack:
1
2
3
4
5
aws cloudformation delete-stack --stack-name "${CLUSTER_NAME}-route53-kms"
aws cloudformation delete-stack --stack-name "${CLUSTER_NAME}-karpenter"
aws cloudformation wait stack-delete-complete --stack-name "${CLUSTER_NAME}-route53-kms"
aws cloudformation wait stack-delete-complete --stack-name "${CLUSTER_NAME}-karpenter"
aws cloudformation wait stack-delete-complete --stack-name "eksctl-${CLUSTER_NAME}-cluster"
Remove volumes and snapshots related to the cluster (as a precaution):
1
2
3
4
5
6
7
8
9
10
for VOLUME in $(aws ec2 describe-volumes --filter "Name=tag:KubernetesCluster,Values=${CLUSTER_NAME}" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Volumes[].VolumeId' --output text); do
echo "💾 Removing Volume: ${VOLUME}"
aws ec2 delete-volume --volume-id "${VOLUME}"
done
# Remove EBS snapshots associated with the cluster
for SNAPSHOT in $(aws ec2 describe-snapshots --owner-ids self --filter "Name=tag:Name,Values=${CLUSTER_NAME}-dynamic-snapshot*" "Name=tag:kubernetes.io/cluster/${CLUSTER_NAME},Values=owned" --query 'Snapshots[].SnapshotId' --output text); do
echo "📸 Removing Snapshot: ${SNAPSHOT}"
aws ec2 delete-snapshot --snapshot-id "${SNAPSHOT}"
done
Remove the CloudWatch log group:
1
2
3
if [[ "$(aws logs describe-log-groups --query "logGroups[?logGroupName==\`/aws/eks/${CLUSTER_NAME}/cluster\`] | [0].logGroupName" --output text)" = "/aws/eks/${CLUSTER_NAME}/cluster" ]]; then
aws logs delete-log-group --log-group-name "/aws/eks/${CLUSTER_NAME}/cluster"
fi
Remove the ${TMP_DIR}/${CLUSTER_FQDN} directory:
1
2
3
4
5
6
7
8
9
10
if [[ -d "${TMP_DIR}/${CLUSTER_FQDN}" ]]; then
for FILE in "${TMP_DIR}/${CLUSTER_FQDN}"/{kubeconfig-${CLUSTER_NAME}.conf,{aws-cf-route53-kms,cloudformation-karpenter,eksctl-${CLUSTER_NAME},helm_values-{alloy,aws-load-balancer-controller,beyla,cert-manager,external-dns,grafana,homepage,ingress-nginx,karpenter,loki,mailpit,mimir-distributed,oauth2-proxy,tempo,velero},k8s-{karpenter-nodepool,scheduling-priorityclass,storage-snapshot-storageclass-volumesnapshotclass}}.yml}; do
if [[ -f "${FILE}" ]]; then
rm -v "${FILE}"
else
echo "❌ File not found: ${FILE}"
fi
done
rmdir "${TMP_DIR}/${CLUSTER_FQDN}"
fi
Enjoy … 😉












