Prometheus Monitoring: Complete Setup & Best Practices

Prometheus
has become the de facto standard for monitoring cloud-native applications and infrastructure, offering metrics collection, querying, and integration with visualization tools.

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit originally developed at SoundCloud in 2012 and now a Cloud Native Computing Foundation (CNCF) graduated project. It’s specifically designed for reliability and scalability in dynamic cloud environments, making it the go-to solution for monitoring microservices, containers, and Kubernetes clusters.

Key Features

Time-Series Database: Prometheus stores all data as time-series, identified by metric names and key-value pairs (labels), enabling flexible and powerful querying capabilities.

Pull-Based Model: Unlike traditional push-based systems, Prometheus actively scrapes metrics from configured targets at specified intervals, making it more reliable and easier to configure.

PromQL Query Language: A powerful functional query language allows you to slice and dice your metrics data in real-time, performing aggregations, transformations, and complex calculations.

Service Discovery: Automatic discovery of monitoring targets through various mechanisms including Kubernetes, Consul, EC2, and static configurations.

No External Dependencies: Prometheus operates as a single binary with no required external dependencies, simplifying deployment and reducing operational complexity.

Built-in Alerting: AlertManager handles alerts from Prometheus, providing deduplication, grouping, and routing to notification channels like email, PagerDuty, or Slack.

Architecture Overview

Understanding Prometheus architecture is crucial for effective deployment. The main components include:

  • Prometheus Server: Scrapes and stores metrics, evaluates rules, and serves queries
  • Client Libraries: Instrument application code to expose metrics
  • Exporters: Bridge third-party systems to Prometheus format
  • AlertManager: Handles alerts and notifications
  • Pushgateway: Accepts metrics from short-lived jobs that can’t be scraped

The typical data flow: Applications expose metrics endpoints → Prometheus scrapes these endpoints → Data is stored in time-series database → PromQL queries retrieve and analyze data → Alerts are generated based on rules → AlertManager processes and routes notifications.

When deploying infrastructure on Ubuntu 24.04, Prometheus provides an excellent foundation for comprehensive monitoring.

Installing Prometheus on Ubuntu

Let’s walk through installing Prometheus on a Linux system. We’ll use Ubuntu as the example, but the process is similar for other distributions.

Download and Install

First, create a dedicated user for Prometheus:

sudo useradd --no-create-home --shell /bin/false prometheus

Download the latest Prometheus release:

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64

Copy binaries and create directories:

sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo cp prometheus.yml /etc/prometheus/prometheus.yml
sudo chown -R prometheus:prometheus /etc/prometheus

For package management on Ubuntu, refer to our comprehensive Ubuntu Package Management guide.

Configure Prometheus

Edit /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Create Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus 
  --config.file=/etc/prometheus/prometheus.yml 
  --storage.tsdb.path=/var/lib/prometheus/ 
  --web.console.templates=/etc/prometheus/consoles 
  --web.console.libraries=/etc/prometheus/console_libraries 
  --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target

Start and enable Prometheus:

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus

Access the Prometheus web interface at http://localhost:9090.

Setting Up Node Exporter

Node Exporter exposes hardware and OS metrics for Linux systems. Install it to monitor your servers:

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create systemd service /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Start Node Exporter:

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Node Exporter now exposes metrics on port 9100.

Understanding PromQL

PromQL (Prometheus Query Language) is the heart of querying Prometheus data. Here are essential query patterns:

Basic Queries

Select all time-series for a metric:

node_cpu_seconds_total

Filter by label:

node_cpu_seconds_total{mode="idle"}

Multiple label filters:

node_cpu_seconds_total{mode="idle",cpu="0"}

Range Vectors and Aggregations

Calculate rate over time:

rate(node_cpu_seconds_total{mode="idle"}[5m])

Sum across all CPUs:

sum(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Group by label:

sum by (mode) (rate(node_cpu_seconds_total[5m]))

Practical Examples

CPU usage percentage:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Disk usage:

(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

Network traffic rate:

rate(node_network_receive_bytes_total[5m])

Docker Deployment

Running Prometheus in Docker containers offers flexibility and easier management:

Create docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'
    ports:
      - "9100:9100"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  alertmanager_data:

Start the stack:

docker-compose up -d

Kubernetes Monitoring

Prometheus excels at monitoring Kubernetes clusters. The kube-prometheus-stack Helm chart provides a complete monitoring solution.

Install using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack

This installs:

  • Prometheus Operator
  • Prometheus instance
  • AlertManager
  • Grafana
  • Node Exporter
  • kube-state-metrics
  • Pre-configured dashboards and alerts

Access Grafana:

kubectl port-forward svc/prometheus-grafana 3000:80

Default credentials: admin/prom-operator

For various Kubernetes distributions, the deployment process is similar with minor adjustments for platform-specific features.

Setting Up Alerting

AlertManager handles alerts sent by Prometheus. Configure alert rules and notification channels.

Alert Rules

Create /etc/prometheus/alert_rules.yml:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Available disk space is below 15% on {{ $labels.mountpoint }}"

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 2 minutes"

AlertManager Configuration

Create /etc/prometheus/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-email'
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
    - match:
        severity: warning
      receiver: 'team-slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        headers:
          Subject: '{{ .GroupLabels.alertname }}: {{ .Status | toUpper }}'

  - name: 'team-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'team-pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'

Integration with Grafana

While Prometheus has a basic web interface, Grafana provides superior visualization capabilities for creating comprehensive dashboards.

Add Prometheus as Data Source

  1. Open Grafana and navigate to Configuration → Data Sources
  2. Click “Add data source”
  3. Select “Prometheus”
  4. Set URL to http://localhost:9090 (or your Prometheus server)
  5. Click “Save & Test”

Popular Dashboard IDs

Import pre-built dashboards from grafana.com:

  • Node Exporter Full (ID: 1860): Comprehensive Linux metrics
  • Kubernetes Cluster Monitoring (ID: 7249): K8s overview
  • Docker Container Monitoring (ID: 193): Container metrics
  • Prometheus Stats (ID: 2): Prometheus internal metrics

Creating Custom Dashboards

Create panels using PromQL queries:

{
  "title": "CPU Usage",
  "targets": [{
    "expr": "100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)"
  }]
}

Popular Exporters

Extend Prometheus monitoring with specialized exporters:

Blackbox Exporter

Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP:

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115

Database Exporters

  • mysqld_exporter: MySQL/MariaDB metrics
  • postgres_exporter: PostgreSQL metrics
  • mongodb_exporter: MongoDB metrics
  • redis_exporter: Redis metrics

Application Exporters

  • nginx_exporter: NGINX web server metrics
  • apache_exporter: Apache HTTP server metrics
  • haproxy_exporter: HAProxy load balancer metrics

Cloud Exporters

  • cloudwatch_exporter: AWS CloudWatch metrics
  • stackdriver_exporter: Google Cloud metrics
  • azure_exporter: Azure Monitor metrics

Best Practices

Data Retention

Configure appropriate retention based on your needs:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

Recording Rules

Pre-calculate frequently queried expressions:

groups:
  - name: example_rules
    interval: 30s
    rules:
      - record: job:node_cpu_utilization:avg
        expr: 100 - (avg by (job) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Label Management

  • Keep label cardinality low
  • Use consistent naming conventions
  • Avoid high-cardinality labels (user IDs, timestamps)

Security

  • Enable authentication and HTTPS
  • Restrict access to Prometheus API
  • Use network policies in Kubernetes
  • Implement RBAC for sensitive metrics

High Availability

  • Run multiple Prometheus instances
  • Use Thanos or Cortex for long-term storage
  • Implement federation for hierarchical setups

Troubleshooting Common Issues

High Memory Usage

  • Reduce scrape frequency
  • Decrease retention period
  • Optimize PromQL queries
  • Implement recording rules

Missing Metrics

  • Check target status in /targets
  • Verify network connectivity
  • Validate scrape configuration
  • Check exporter logs

Slow Queries

  • Use recording rules for complex aggregations
  • Optimize label filters
  • Reduce time range
  • Add indices if using remote storage

Performance Optimization

Query Optimization

# Bad: High cardinality
sum(rate(http_requests_total[5m]))

# Good: Group by relevant labels
sum by (status, method) (rate(http_requests_total[5m]))

Resource Limits

For Kubernetes deployments:

resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "4Gi"
    cpu: "2000m"

Conclusion

Prometheus provides a robust, scalable monitoring solution for modern infrastructure. Its pull-based architecture, powerful query language, and extensive ecosystem of exporters make it ideal for monitoring everything from bare-metal servers to complex Kubernetes clusters.

By combining Prometheus with Grafana for visualization and AlertManager for notifications, you create a comprehensive observability platform capable of handling enterprise-scale monitoring requirements. The active community and CNCF backing ensure continued development and support.

Start with basic metrics collection, gradually add exporters for your specific services, and refine your alerting rules based on real-world experience. Prometheus scales with your infrastructure, from single-server deployments to multi-datacenter monitoring architectures.

Related Resources

External References

Leave a Reply