Monitoring

This guide covers monitoring strategies for your Palpo server to ensure reliability and performance.

Key Metrics to Monitor

Server Health

Metric	Description	Alert Threshold
CPU Usage	Server CPU utilization	>80% sustained
Memory Usage	RAM utilization	>85%
Disk Space	Storage utilization	>80%
Disk I/O	Read/write operations	High latency

Application Metrics

Metric	Description	Alert Threshold
HTTP Response Time	API latency	>500ms p95
Error Rate	5xx responses	>1%
Active Connections	Concurrent users	Near max limit
Federation Queue	Pending federation events	Growing continuously

Database Metrics

Metric	Description	Alert Threshold
Query Time	Database query latency	>100ms average
Connection Pool	Active connections	>80% pool size
Replication Lag	If using replicas	>10 seconds
Table Bloat	Dead tuples	>20% table size

Logging Configuration

Basic Logging

Configure logging in your palpo.toml:

[logger]
# Log level: trace, debug, info, warn, error
level = "info"
# Output format: json, pretty
format = "json"
# Enable ANSI colors (for terminal output)
color = false

Structured Logging

For production, use JSON format for easier parsing:

[logger]
format = "json"
level = "info"

Log output:

{"timestamp":"2024-01-15T10:30:00Z","level":"info","target":"palpo","message":"Request processed","method":"GET","path":"/_matrix/client/v3/sync","status":200,"duration_ms":45}

Log Rotation

Use logrotate for managing log files:

# /etc/logrotate.d/palpo
/var/log/palpo/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    create 0640 palpo palpo
    postrotate
        systemctl reload palpo > /dev/null 2>&1 || true
    endscript
}

Health Checks

HTTP Health Endpoint

Check server health:

curl -f http://localhost:8008/_matrix/client/versions

Systemd Health Check

# /etc/systemd/system/palpo.service
[Service]
ExecStart=/usr/local/bin/palpo --config /etc/palpo/palpo.toml
ExecReload=/bin/kill -HUP $MAINPID
Type=simple
Restart=on-failure
RestartSec=5

# Health check
WatchdogSec=30s

Docker Health Check

# docker-compose.yml
services:
  palpo:
    image: palpo/palpo:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8008/_matrix/client/versions"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Prometheus Metrics

Exposing Metrics

Palpo can expose Prometheus-compatible metrics. Enable in configuration:

[metrics]
enable = true
port = 9090

Common Metrics

# HTTP request duration
palpo_http_request_duration_seconds{method="GET",path="/sync"}

# Active connections
palpo_active_connections

# Federation queue size
palpo_federation_queue_size

# Database query duration
palpo_db_query_duration_seconds{query_type="select"}

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'palpo'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Grafana Dashboard

Key Panels

Request Rate - Requests per second
Response Time - p50, p95, p99 latencies
Error Rate - 4xx and 5xx responses
Active Users - Concurrent connected users
Federation Health - Queue size and delivery rate
Resource Usage - CPU, memory, disk

Sample Dashboard JSON

{
  "title": "Palpo Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(palpo_http_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Response Time (p95)",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, rate(palpo_http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p95"
        }
      ]
    }
  ]
}

Alerting

Alert Examples

High Error Rate:

- alert: HighErrorRate
  expr: rate(palpo_http_requests_total{status=~"5.."}[5m]) / rate(palpo_http_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate on Palpo"
    description: "Error rate is {{ $value | humanizePercentage }}"

Slow Response Time:

- alert: SlowResponseTime
  expr: histogram_quantile(0.95, rate(palpo_http_request_duration_seconds_bucket[5m])) > 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Slow response time on Palpo"
    description: "p95 latency is {{ $value }}s"

Federation Queue Growing:

- alert: FederationQueueGrowing
  expr: increase(palpo_federation_queue_size[1h]) > 1000
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "Federation queue is growing"

Low Disk Space:

- alert: LowDiskSpace
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.2
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Low disk space"
    description: "Only {{ $value | humanizePercentage }} disk space remaining"

External Monitoring

Uptime Monitoring Services

Monitor your server's availability from outside:

UptimeRobot
Pingdom
StatusCake
Better Uptime

Endpoint to monitor:

https://your-server.com/_matrix/client/versions

Federation Tester

Test federation connectivity:

https://federationtester.matrix.org/api/report?server_name=your-server.com

Troubleshooting with Monitoring

High CPU Usage

Check slow queries in database
Review active requests
Look for federation issues
Check for runaway processes

Memory Leaks

Monitor memory over time
Check for growing connection pools
Review long-running operations
Consider restart schedule if needed

Slow Responses

Check database query times
Review disk I/O
Check network latency
Look for lock contention

Federation Issues

Monitor federation queue size
Check destination server health
Review error logs for specific failures
Reset problematic connections via Admin API

Best Practices

Set up alerts before problems occur - Don't wait for users to report issues
Monitor trends, not just thresholds - A gradual increase may indicate a developing problem
Keep historical data - Useful for capacity planning and debugging
Document your monitoring setup - So others can understand and maintain it
Test your alerts - Ensure they fire when expected
Have runbooks - Document response procedures for common alerts

#Monitoring

#Key Metrics to Monitor

#Server Health

#Application Metrics

#Database Metrics

#Logging Configuration

#Basic Logging

#Structured Logging

#Log Rotation

#Health Checks

#HTTP Health Endpoint

#Systemd Health Check

#Docker Health Check

#Prometheus Metrics

#Exposing Metrics

#Common Metrics

#Prometheus Configuration

#Grafana Dashboard

#Key Panels

#Sample Dashboard JSON

#Alerting

#Alert Examples

#External Monitoring

#Uptime Monitoring Services

#Federation Tester

#Troubleshooting with Monitoring

#High CPU Usage

#Memory Leaks

#Slow Responses

#Federation Issues

#Best Practices