About the IT Monitoring & Alerting Stack

Overview

IT Operations uses an open-source observability stack to monitor server health, network performance, application response times, and storage utilization.

Components

Prometheus

Scrapes time-series metrics from exporters on every monitored host. Retention is 30 days locally with long-term storage in Thanos.

Node Exporter - CPU, memory, disk, and network stats for Linux servers
Windows Exporter - equivalent metrics for Windows hosts
SNMP Exporter - network device metrics (switches, firewalls, UPS)
Blackbox Exporter - HTTP/HTTPS endpoint uptime probes

Grafana

Provides dashboards and visualization for IT Operations, Network Team, Development, and IT Leadership.

PagerDuty

Critical alerts from Alertmanager route to PagerDuty for on-call escalation. Lower-severity alerts go to the #it-alerts Teams channel.

Access

Grafana: grafana.internal.example.edu - all IT staff have read access
Prometheus/Alertmanager: Restricted to IT Infrastructure team members

Adding a New Monitor

Open a ticket at it.example.edu/support and select the Monitoring Request category.

Overview#

Components#

Prometheus#

Grafana#

PagerDuty#

Access#

Adding a New Monitor#