Runbook Library

Operational procedures, automation playbooks, and incident response guides for IT operations

Critical Incidents Server Operations Database Procedures Network Ops Cache Management Security Response

All Runbooks

Automated

Manual

Hybrid

Kubernetes Pod Recovery

RB-K8S-001

Automated recovery procedure for pods in CrashLoopBackOff or pending state. Includes diagnostics, remediation, and escalation paths.

High Automated ~5 min 234 runs

PostgreSQL Primary Failover

RB-DB-001

Emergency failover procedure for PostgreSQL primary database. Includes health checks, replica promotion, and application cutover.

Critical Hybrid ~15 min 12 runs

Redis Cache Flush & Warm

RB-CACHE-001

Selective cache invalidation and pre-warming procedure. Supports pattern-based flush with minimal impact.

Medium Automated ~3 min 567 runs

SSL Certificate Renewal

RB-SEC-001

Automated SSL certificate renewal and deployment across load balancers and application servers.

High Automated ~10 min 89 runs

Network Connectivity Diagnostics

RB-NET-001

Comprehensive network connectivity check including DNS resolution, port scanning, and latency measurements.

Low Automated ~2 min 1,234 runs

Application Pool Recycle

RB-APP-001

Graceful application pool recycle with connection draining, health validation, and rollback capabilities.

Medium Automated ~5 min 456 runs

Security Incident Response

RB-SEC-002

Comprehensive security incident response procedure including containment, investigation, and remediation steps.

Critical Manual ~60 min 8 runs

Disk Space Cleanup

RB-SRV-002

Automated disk cleanup for log files, temp directories, and old deployments. Includes safety checks and reporting.

Low Automated ~8 min 789 runs

Back to Runbooks

RB-K8S-001

Kubernetes Pod Recovery

High Automated ~5 min Last run by Sarah Chen 2 hours ago

Prerequisites

kubectl access to cluster

Namespace admin permissions

Access to monitoring dashboards

Procedure Steps

1

Identify Affected Pods

Check the current state of pods in the namespace to identify which ones are in CrashLoopBackOff or pending state.

kubectl get pods -n production | grep -E 'CrashLoopBackOff|Pending|Error'
2

Check Pod Logs

Retrieve logs from the affected pod to identify the root cause of the crash.

kubectl logs pod-name -n production --previous

Use --previous flag to get logs from the crashed container instance
3

Check Pod Events

Review Kubernetes events for additional context about the failure.

kubectl describe pod pod-name -n production | tail -20
4

Apply Remediation

Based on the identified cause, apply the appropriate fix. Common remediation includes restarting the deployment or scaling replicas.

kubectl rollout restart deployment/deployment-name -n production

This will cause brief service disruption. Ensure load balancer can handle the traffic redistribution.
5

Verify Recovery

Confirm that pods have recovered and are in Running state with ready status.

kubectl get pods -n production -w

Escalation Path

L1 Support

Platform Team

SRE Lead

Vendor Support