Runbook Library

Operational procedures, automation playbooks, and incident response guides for IT operations

Critical Incidents Server Operations Database Procedures Network Ops Cache Management Security Response
All Runbooks
Automated
Manual
Hybrid
Kubernetes Pod Recovery
RB-K8S-001
Automated recovery procedure for pods in CrashLoopBackOff or pending state. Includes diagnostics, remediation, and escalation paths.
High Automated ~5 min 234 runs
PostgreSQL Primary Failover
RB-DB-001
Emergency failover procedure for PostgreSQL primary database. Includes health checks, replica promotion, and application cutover.
Critical Hybrid ~15 min 12 runs
Redis Cache Flush & Warm
RB-CACHE-001
Selective cache invalidation and pre-warming procedure. Supports pattern-based flush with minimal impact.
Medium Automated ~3 min 567 runs
SSL Certificate Renewal
RB-SEC-001
Automated SSL certificate renewal and deployment across load balancers and application servers.
High Automated ~10 min 89 runs
Network Connectivity Diagnostics
RB-NET-001
Comprehensive network connectivity check including DNS resolution, port scanning, and latency measurements.
Low Automated ~2 min 1,234 runs
Application Pool Recycle
RB-APP-001
Graceful application pool recycle with connection draining, health validation, and rollback capabilities.
Medium Automated ~5 min 456 runs
Security Incident Response
RB-SEC-002
Comprehensive security incident response procedure including containment, investigation, and remediation steps.
Critical Manual ~60 min 8 runs
Disk Space Cleanup
RB-SRV-002
Automated disk cleanup for log files, temp directories, and old deployments. Includes safety checks and reporting.
Low Automated ~8 min 789 runs
Back to Runbooks
RB-K8S-001

Kubernetes Pod Recovery

High Automated ~5 min Last run by Sarah Chen 2 hours ago
Prerequisites
kubectl access to cluster
Namespace admin permissions
Access to monitoring dashboards
Procedure Steps
  • 1
    Identify Affected Pods
    Check the current state of pods in the namespace to identify which ones are in CrashLoopBackOff or pending state.
    kubectl get pods -n production | grep -E 'CrashLoopBackOff|Pending|Error'
  • 2
    Check Pod Logs
    Retrieve logs from the affected pod to identify the root cause of the crash.
    kubectl logs pod-name -n production --previous
    Use --previous flag to get logs from the crashed container instance
  • 3
    Check Pod Events
    Review Kubernetes events for additional context about the failure.
    kubectl describe pod pod-name -n production | tail -20
  • 4
    Apply Remediation
    Based on the identified cause, apply the appropriate fix. Common remediation includes restarting the deployment or scaling replicas.
    kubectl rollout restart deployment/deployment-name -n production
    This will cause brief service disruption. Ensure load balancer can handle the traffic redistribution.
  • 5
    Verify Recovery
    Confirm that pods have recovered and are in Running state with ready status.
    kubectl get pods -n production -w
Escalation Path
L1 Support
Platform Team
SRE Lead
Vendor Support
DAMAC IT Assistant
Online - Ready to help

Hello! I'm your DAMAC IT Assistant.

I can help you with:

  • Create new user accounts
  • Request system access
  • Reset passwords
  • Document access
  • IT FAQs & support

What would you like help with today?

Just now