Purpose

This playbook provides a standardised response procedure for on-call engineers responding to a production server outage (P1 or P2 incident). Following this runbook ensures consistent, thorough incident handling regardless of which engineer responds.


Scope

Applicable to any Linux or Windows production server managed by Infrastructure & Operations, including web, database, application, and file servers.


Prerequisites

  • On-call access to the out-of-band management console (oob-admin.example.edu)
  • Access to the monitoring dashboard (monitoring.example.edu)
  • Access to the change record system (this portal) to check for recent changes
  • Valid SSH key or RDP credential for production servers

Response Steps

Step 1 — Acknowledge the Alert (0–5 min)

  1. Acknowledge the PagerDuty alert to stop escalation.
  2. Open the monitoring dashboard and identify the affected server(s).
  3. Note the alert type: CPU, memory, disk, network unreachable, or service down.

Step 2 — Initial Triage (5–15 min)

  1. Attempt to ping the server from the jump host:
    ping -c 5 <hostname>.example.edu
    
  2. If the server is reachable, SSH in and check:
    systemctl status          # check for failed services
    top -bn1 | head -20       # check for resource exhaustion
    df -h                     # check disk space
    dmesg | tail -50          # check kernel messages
    
  3. If the server is not reachable, connect via the out-of-band console at oob-admin.example.edu.
  4. Review the Changes section on this portal for any changes deployed in the past 24 hours that could be related.

Step 3 — Classify Severity

ConditionSeverity
Production service completely unavailable to all usersP1
Production service degraded for >50% of usersP1
Production service degraded for <50% of usersP2
Non-production service downP3

For P1 incidents, immediately page the on-call manager at ext. 5555.

Step 4 — Attempt Recovery (15–45 min)

Use the following common recovery actions in order:

  1. Restart the affected service:
    systemctl restart <service-name>
    journalctl -u <service-name> -n 100
    
  2. Free disk space if disk full:
    journalctl --vacuum-size=500M
    find /tmp -mtime +1 -delete
    
  3. Reboot the server (only if service restart fails and the situation is deteriorating):
    • For VMs: coordinate with a second engineer before rebooting to avoid data corruption.
    • Issue a graceful shutdown if possible: shutdown -r now
    • If the VM is unresponsive, use vSphere Console to perform a hard reset — document the decision in the incident ticket.

Step 5 — Verify Recovery

  1. Confirm the service is responding:
    curl -sf https://<hostname>.example.edu/health || echo "FAILED"
    
  2. Confirm monitoring alerts have cleared in PagerDuty and Grafana.
  3. Notify the on-call manager that the issue is resolved.

Step 6 — Document and Escalate (if Not Resolved)

If the issue is not resolved within 45 minutes:

  • Escalate to the on-call manager immediately.
  • Open an incident ticket if not already open.
  • Begin gathering logs for post-incident analysis (see Step 7).

Step 7 — Post-Incident Review

Within 2 business days of a P1 incident, complete a Post-Incident Review (PIR):

  1. Write a timeline of events in the incident ticket.
  2. Identify the root cause (hardware failure, software bug, configuration error, human error).
  3. Document corrective actions taken and any follow-up work items.
  4. Submit the PIR to the IT Manager for review.

Contacts

RoleContact
On-call engineerVia PagerDuty
On-call managerext. 5555
Vendor support (VMware)1-800-XXX-XXXX
Vendor support (Dell server hardware)1-800-XXX-XXXX, Contract #XXX