Overview

IT Operations uses an open-source observability stack to monitor server health, network performance, application response times, and storage utilization.


Components

Prometheus

Scrapes time-series metrics from exporters on every monitored host. Retention is 30 days locally with long-term storage in Thanos.

  • Node Exporter - CPU, memory, disk, and network stats for Linux servers
  • Windows Exporter - equivalent metrics for Windows hosts
  • SNMP Exporter - network device metrics (switches, firewalls, UPS)
  • Blackbox Exporter - HTTP/HTTPS endpoint uptime probes

Grafana

Provides dashboards and visualization for IT Operations, Network Team, Development, and IT Leadership.

PagerDuty

Critical alerts from Alertmanager route to PagerDuty for on-call escalation. Lower-severity alerts go to the #it-alerts Teams channel.


Access


Adding a New Monitor

Open a ticket at it.example.edu/support and select the Monitoring Request category.