Runbook Library
Operational procedures, automation playbooks, and incident response guides for IT operations
Critical Incidents
Server Operations
Database Procedures
Network Ops
Cache Management
Security Response
All Runbooks
Automated
Manual
Hybrid
Kubernetes Pod Recovery
RB-K8S-001
Automated recovery procedure for pods in CrashLoopBackOff or pending state. Includes diagnostics, remediation, and escalation paths.
PostgreSQL Primary Failover
RB-DB-001
Emergency failover procedure for PostgreSQL primary database. Includes health checks, replica promotion, and application cutover.
Redis Cache Flush & Warm
RB-CACHE-001
Selective cache invalidation and pre-warming procedure. Supports pattern-based flush with minimal impact.
SSL Certificate Renewal
RB-SEC-001
Automated SSL certificate renewal and deployment across load balancers and application servers.
Network Connectivity Diagnostics
RB-NET-001
Comprehensive network connectivity check including DNS resolution, port scanning, and latency measurements.
Application Pool Recycle
RB-APP-001
Graceful application pool recycle with connection draining, health validation, and rollback capabilities.
Security Incident Response
RB-SEC-002
Comprehensive security incident response procedure including containment, investigation, and remediation steps.
Disk Space Cleanup
RB-SRV-002
Automated disk cleanup for log files, temp directories, and old deployments. Includes safety checks and reporting.
Back to Runbooks
RB-K8S-001
Kubernetes Pod Recovery
Prerequisites
kubectl access to cluster
Namespace admin permissions
Access to monitoring dashboards
Procedure Steps
-
1Identify Affected PodsCheck the current state of pods in the namespace to identify which ones are in CrashLoopBackOff or pending state.kubectl get pods -n production | grep -E 'CrashLoopBackOff|Pending|Error'
-
2Check Pod LogsRetrieve logs from the affected pod to identify the root cause of the crash.kubectl logs pod-name -n production --previousUse --previous flag to get logs from the crashed container instance
-
3Check Pod EventsReview Kubernetes events for additional context about the failure.kubectl describe pod pod-name -n production | tail -20
-
4Apply RemediationBased on the identified cause, apply the appropriate fix. Common remediation includes restarting the deployment or scaling replicas.kubectl rollout restart deployment/deployment-name -n productionThis will cause brief service disruption. Ensure load balancer can handle the traffic redistribution.
-
5Verify RecoveryConfirm that pods have recovered and are in Running state with ready status.kubectl get pods -n production -w
Escalation Path
L1 Support
Platform Team
SRE Lead
Vendor Support