Purpose
This playbook provides a standardised response procedure for on-call engineers responding to a production server outage (P1 or P2 incident). Following this runbook ensures consistent, thorough incident handling regardless of which engineer responds.
Scope
Applicable to any Linux or Windows production server managed by Infrastructure & Operations, including web, database, application, and file servers.
Prerequisites
- On-call access to the out-of-band management console (
oob-admin.example.edu) - Access to the monitoring dashboard (
monitoring.example.edu) - Access to the change record system (this portal) to check for recent changes
- Valid SSH key or RDP credential for production servers
Response Steps
Step 1 — Acknowledge the Alert (0–5 min)
- Acknowledge the PagerDuty alert to stop escalation.
- Open the monitoring dashboard and identify the affected server(s).
- Note the alert type: CPU, memory, disk, network unreachable, or service down.
Step 2 — Initial Triage (5–15 min)
- Attempt to ping the server from the jump host:
ping -c 5 <hostname>.example.edu - If the server is reachable, SSH in and check:
systemctl status # check for failed services top -bn1 | head -20 # check for resource exhaustion df -h # check disk space dmesg | tail -50 # check kernel messages - If the server is not reachable, connect via the out-of-band console at
oob-admin.example.edu. - Review the Changes section on this portal for any changes deployed in the past 24 hours that could be related.
Step 3 — Classify Severity
| Condition | Severity |
|---|---|
| Production service completely unavailable to all users | P1 |
| Production service degraded for >50% of users | P1 |
| Production service degraded for <50% of users | P2 |
| Non-production service down | P3 |
For P1 incidents, immediately page the on-call manager at ext. 5555.
Step 4 — Attempt Recovery (15–45 min)
Use the following common recovery actions in order:
- Restart the affected service:
systemctl restart <service-name> journalctl -u <service-name> -n 100 - Free disk space if disk full:
journalctl --vacuum-size=500M find /tmp -mtime +1 -delete - Reboot the server (only if service restart fails and the situation is deteriorating):
- For VMs: coordinate with a second engineer before rebooting to avoid data corruption.
- Issue a graceful shutdown if possible:
shutdown -r now - If the VM is unresponsive, use vSphere Console to perform a hard reset — document the decision in the incident ticket.
Step 5 — Verify Recovery
- Confirm the service is responding:
curl -sf https://<hostname>.example.edu/health || echo "FAILED" - Confirm monitoring alerts have cleared in PagerDuty and Grafana.
- Notify the on-call manager that the issue is resolved.
Step 6 — Document and Escalate (if Not Resolved)
If the issue is not resolved within 45 minutes:
- Escalate to the on-call manager immediately.
- Open an incident ticket if not already open.
- Begin gathering logs for post-incident analysis (see Step 7).
Step 7 — Post-Incident Review
Within 2 business days of a P1 incident, complete a Post-Incident Review (PIR):
- Write a timeline of events in the incident ticket.
- Identify the root cause (hardware failure, software bug, configuration error, human error).
- Document corrective actions taken and any follow-up work items.
- Submit the PIR to the IT Manager for review.
Contacts
| Role | Contact |
|---|---|
| On-call engineer | Via PagerDuty |
| On-call manager | ext. 5555 |
| Vendor support (VMware) | 1-800-XXX-XXXX |
| Vendor support (Dell server hardware) | 1-800-XXX-XXXX, Contract #XXX |
·