Overview
IT Operations uses an open-source observability stack to monitor server health, network performance, application response times, and storage utilization.
Components
Prometheus
Scrapes time-series metrics from exporters on every monitored host. Retention is 30 days locally with long-term storage in Thanos.
- Node Exporter - CPU, memory, disk, and network stats for Linux servers
- Windows Exporter - equivalent metrics for Windows hosts
- SNMP Exporter - network device metrics (switches, firewalls, UPS)
- Blackbox Exporter - HTTP/HTTPS endpoint uptime probes
Grafana
Provides dashboards and visualization for IT Operations, Network Team, Development, and IT Leadership.
PagerDuty
Critical alerts from Alertmanager route to PagerDuty for on-call escalation. Lower-severity alerts go to the #it-alerts Teams channel.
Access
- Grafana: grafana.internal.example.edu - all IT staff have read access
- Prometheus/Alertmanager: Restricted to IT Infrastructure team members
Adding a New Monitor
Open a ticket at it.example.edu/support and select the Monitoring Request category.
·