Server Outage Response Playbook

Purpose

This playbook provides a standardised response procedure for on-call engineers responding to a production server outage (P1 or P2 incident). Following this runbook ensures consistent, thorough incident handling regardless of which engineer responds.

Scope

Applicable to any Linux or Windows production server managed by Infrastructure & Operations, including web, database, application, and file servers.

Prerequisites

On-call access to the out-of-band management console (oob-admin.example.edu)
Access to the monitoring dashboard (monitoring.example.edu)
Access to the change record system (this portal) to check for recent changes
Valid SSH key or RDP credential for production servers

Response Steps

Step 1 — Acknowledge the Alert (0–5 min)

Acknowledge the PagerDuty alert to stop escalation.
Open the monitoring dashboard and identify the affected server(s).
Note the alert type: CPU, memory, disk, network unreachable, or service down.

Step 2 — Initial Triage (5–15 min)

Attempt to ping the server from the jump host:
```
ping -c 5 <hostname>.example.edu
```

If the server is reachable, SSH in and check:

systemctl status          # check for failed services
top -bn1 | head -20       # check for resource exhaustion
df -h                     # check disk space
dmesg | tail -50          # check kernel messages

If the server is not reachable, connect via the out-of-band console at oob-admin.example.edu.
Review the Changes section on this portal for any changes deployed in the past 24 hours that could be related.

Step 3 — Classify Severity

Condition	Severity
Production service completely unavailable to all users	P1
Production service degraded for >50% of users	P1
Production service degraded for <50% of users	P2
Non-production service down	P3

For P1 incidents, immediately page the on-call manager at ext. 5555.

Step 4 — Attempt Recovery (15–45 min)

Use the following common recovery actions in order:

Restart the affected service:

systemctl restart <service-name>
journalctl -u <service-name> -n 100

Free disk space if disk full:

journalctl --vacuum-size=500M
find /tmp -mtime +1 -delete

Reboot the server (only if service restart fails and the situation is deteriorating):
- For VMs: coordinate with a second engineer before rebooting to avoid data corruption.
- Issue a graceful shutdown if possible: shutdown -r now
- If the VM is unresponsive, use vSphere Console to perform a hard reset — document the decision in the incident ticket.

Step 5 — Verify Recovery

Confirm the service is responding:

curl -sf https://<hostname>.example.edu/health || echo "FAILED"

Confirm monitoring alerts have cleared in PagerDuty and Grafana.
Notify the on-call manager that the issue is resolved.

Step 6 — Document and Escalate (if Not Resolved)

If the issue is not resolved within 45 minutes:

Escalate to the on-call manager immediately.
Open an incident ticket if not already open.
Begin gathering logs for post-incident analysis (see Step 7).

Step 7 — Post-Incident Review

Within 2 business days of a P1 incident, complete a Post-Incident Review (PIR):

Write a timeline of events in the incident ticket.
Identify the root cause (hardware failure, software bug, configuration error, human error).
Document corrective actions taken and any follow-up work items.
Submit the PIR to the IT Manager for review.

Contacts

Role	Contact
On-call engineer	Via PagerDuty
On-call manager	ext. 5555
Vendor support (VMware)	1-800-XXX-XXXX
Vendor support (Dell server hardware)	1-800-XXX-XXXX, Contract #XXX

Purpose#

Scope#

Prerequisites#

Response Steps#

Step 1 — Acknowledge the Alert (0–5 min)#

Step 2 — Initial Triage (5–15 min)#

Step 3 — Classify Severity#

Step 4 — Attempt Recovery (15–45 min)#

Step 5 — Verify Recovery#

Step 6 — Document and Escalate (if Not Resolved)#

Step 7 — Post-Incident Review#

Contacts#