Monitoring¶

Monitor USSO in production.

Health Endpoint¶

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "version": "0.1.0"
}

Metrics¶

USSO exposes Prometheus metrics at /metrics:

# Authentication metrics
usso_login_attempts_total
usso_login_success_total
usso_login_failure_total

# Token metrics
usso_token_issued_total
usso_token_verified_total
usso_token_expired_total

# API metrics
usso_http_requests_total
usso_http_request_duration_seconds

Logging¶

Log Levels¶

ERROR - Errors requiring attention
WARNING - Important events
INFO - General information
DEBUG - Detailed debugging

Log Format¶

{
  "timestamp": "2025-10-04T10:00:00Z",
  "level": "INFO",
  "message": "User login successful",
  "user_id": "user:abc123",
  "ip": "192.168.1.1",
  "user_agent": "Mozilla/5.0..."
}

Centralized Logging¶

ELK Stack:

services:
  app:
    logging:
      driver: "json-file"
      options:
        labels: "service=usso"

CloudWatch:

import watchtower
logging.getLogger().addHandler(watchtower.CloudWatchLogHandler())

Alerting¶

Key Alerts¶

High Error Rate
Threshold: > 5% errors
Action: Investigate logs
Slow Responses
Threshold: p95 > 1s
Action: Check database
Failed Logins
Threshold: > 100/min
Action: Possible attack
Database Issues
Health check fails
Action: Check connectivity

Alert Configuration¶

# Prometheus AlertManager
groups:
- name: usso
  rules:
  - alert: HighErrorRate
    expr: rate(usso_http_requests_total{status=~"5.."}[5m]) > 0.05
    annotations:
      summary: "High error rate detected"

Dashboards¶

Grafana Dashboard¶

Key panels: - Request rate - Error rate - Response time (p50, p95, p99) - Active sessions - Database connections

Example Query¶

# Request rate
rate(usso_http_requests_total[5m])

# Error rate
rate(usso_http_requests_total{status=~"5.."}[5m])

# Response time p95
histogram_quantile(0.95, rate(usso_http_request_duration_seconds_bucket[5m]))