Metric Management System (Grafana/Prometheus/Loki Stack)
The Metric Management System is the subsystem used for collecting, storing, processing, and monitoring technical Key Performance Indicators (KPIs) and logs in real-time to ensure FOSPS system health.
Purpose
Provides:
- System Monitoring: Real-time platform health
- Performance Tracking: Response times, throughput
- Resource Usage: CPU, memory, storage
- Log Aggregation: Centralized log collection
- Alerting: Proactive issue notification
- Visualization: Customizable dashboards
Technology Stack
Prometheus
Metrics collection and storage:
- Time-series database
- Pull-based metrics scraping
- PromQL query language
- Alert rule engine
Grafana
Visualization and dashboards:
- Customizable dashboards
- Real-time graphs
- Multi-datasource support
- Alert visualization
Loki
Log aggregation:
- Distributed log collection
- LogQL query language
- Integration with Grafana
- Cost-effective storage
Monitored Metrics
Platform Metrics
- Service availability/uptime
- Request rates and latency
- Error rates and types
- Resource utilization
Component-Specific
Focusing Manager
- Focusing requests per minute
- Average focusing duration
- Preprocessor execution times
- Lens selection patterns
FHIR Server
- Query performance
- Resource creation/update rates
- Storage usage
- Connection pool status
Connectors
- External source response times
- Retrieval success/failure rates
- Cache hit ratios
LEE
- Lens execution duration
- Memory usage per execution
- Concurrent execution count
Log Aggregation
Collects logs from:
- All FOSPS architectural layers
- Kubernetes pods
- Istio service mesh
- Audit Log (for monitoring, not storage)
Dashboards
Pre-configured dashboards for:
- Platform Overview: System-wide health
- Service Health: Per-component metrics
- User Experience: Response times, error rates
- Security: Failed authentication, suspicious activity
- Capacity Planning: Resource trends
Alerting
Alerts configured for:
- Service downtime
- High error rates
- Performance degradation
- Resource exhaustion
- Security events
Notifications sent via:
- Slack/Teams
- PagerDuty
- Webhooks
Access Control
Dashboard access restricted by Keycloak:
- Administrators: Full access
- Developers: Read-only
- Service accounts: API access
Integration
- Kubernetes: Pod metrics
- Istio: Service mesh telemetry
- Audit Log: Security event correlation
- OpenAPI: API performance tracking
Related Concepts
- FOSPS - Monitored platform
- Architectural Layers - Monitored components
- Audit Log - Security logging
- Keycloak - Access control
- Kubernetes - Deployment platform