Metrics Reference
Headwind exposes Prometheus metrics on port 9090 at /metrics for comprehensive monitoring and alerting.
Accessing Metrics
# Port forward metrics endpoint
kubectl port-forward -n headwind-system svc/headwind-metrics 9090:9090
# View metrics
curl http://localhost:9090/metrics
# Or open in browser
open http://localhost:9090/metrics
Webhook Metrics
Track webhook event processing:
headwind_webhook_events_total
Type: Counter
Description: Total webhook events received from container registries
Labels:
registry- Registry type (dockerhub, harbor, gitlab, etc.)
Example:
# Rate of webhook events per minute
rate(headwind_webhook_events_total[5m]) * 60
# Total events by registry
sum by (registry) (headwind_webhook_events_total)
headwind_webhook_events_processed
Type: Counter
Description: Successfully processed webhook events
Example:
# Processing success rate
rate(headwind_webhook_events_processed[5m]) / rate(headwind_webhook_events_total[5m])
Polling Metrics
Monitor registry polling operations:
headwind_polling_cycles_total
Type: Counter
Description: Total polling cycles completed
Example:
# Polling frequency
rate(headwind_polling_cycles_total[5m])
headwind_polling_errors_total
Type: Counter
Description: Polling errors encountered
Example:
# Error rate
rate(headwind_polling_errors_total[5m])
headwind_polling_images_checked_total
Type: Counter
Description: Container images checked during polling
Example:
# Images checked per polling cycle
rate(headwind_polling_images_checked_total[5m]) / rate(headwind_polling_cycles_total[5m])
headwind_polling_new_tags_found_total
Type: Counter
Description: New image tags discovered via polling
Example:
# Tag discovery rate
rate(headwind_polling_new_tags_found_total[1h])
headwind_polling_helm_charts_checked_total
Type: Counter
Description: Helm charts checked during polling
Example:
# Helm charts checked per cycle
rate(headwind_polling_helm_charts_checked_total[5m])
headwind_polling_helm_new_versions_found_total
Type: Counter
Description: New Helm chart versions discovered via polling
Example:
# Helm version discovery rate
rate(headwind_polling_helm_new_versions_found_total[1h])
headwind_polling_resources_filtered_total
Type: Counter
Description: Resources filtered out from polling due to event-source annotation
Details: Incremented when resources have event-source: webhook or event-source: none set. These resources are skipped during polling cycles to reduce unnecessary registry API calls.
Example:
# Resources filtered from polling
headwind_polling_resources_filtered_total
# Filter rate per polling cycle
rate(headwind_polling_resources_filtered_total[5m]) / rate(headwind_polling_cycles_total[5m])
# Percentage of resources using webhooks only
headwind_polling_resources_filtered_total /
(headwind_polling_resources_filtered_total + headwind_polling_images_checked_total)
Use Cases:
- Monitor adoption of webhook vs polling event sources
- Track resource distribution across event source types
- Optimize polling efficiency
Update Metrics
Track update requests and their lifecycle:
headwind_updates_pending
Type: Gauge
Description: Number of UpdateRequests currently awaiting approval
Example:
# Current pending updates
headwind_updates_pending
# Alert on too many pending updates
headwind_updates_pending > 20
headwind_updates_approved_total
Type: Counter
Description: Total approved updates
Example:
# Approval rate
rate(headwind_updates_approved_total[1h])
headwind_updates_rejected_total
Type: Counter
Description: Total rejected updates
Example:
# Rejection rate
rate(headwind_updates_rejected_total[1h])
# Approval vs rejection ratio
headwind_updates_approved_total / (headwind_updates_approved_total + headwind_updates_rejected_total)
headwind_updates_applied_total
Type: Counter
Description: Successfully applied updates
Labels:
kind- Workload kind (Deployment, StatefulSet, DaemonSet, HelmRelease)
Example:
# Update success rate
rate(headwind_updates_applied_total[1h])
# Updates by workload type
sum by (kind) (headwind_updates_applied_total)
headwind_updates_failed_total
Type: Counter
Description: Failed update attempts
Example:
# Failure rate
rate(headwind_updates_failed_total[5m])
# Update success rate
rate(headwind_updates_applied_total[5m]) / (rate(headwind_updates_applied_total[5m]) + rate(headwind_updates_failed_total[5m]))
headwind_updates_skipped_interval_total
Type: Counter
Description: Updates skipped due to minimum update interval not elapsed
Example:
# Rate of skipped updates
rate(headwind_updates_skipped_interval_total[1h])
Controller Metrics
Monitor Kubernetes controllers:
headwind_reconcile_duration_seconds
Type: Histogram
Description: Time spent in reconciliation loops
Buckets: 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0
Example:
# 95th percentile reconciliation time
histogram_quantile(0.95, rate(headwind_reconcile_duration_seconds_bucket[5m]))
# Average reconciliation duration
rate(headwind_reconcile_duration_seconds_sum[5m]) / rate(headwind_reconcile_duration_seconds_count[5m])
headwind_reconcile_errors_total
Type: Counter
Description: Reconciliation errors
Example:
# Error rate
rate(headwind_reconcile_errors_total[5m])
Workload Watching Metrics
Track resources being monitored:
headwind_deployments_watched
Type: Gauge
Description: Number of Deployments being monitored by Headwind
Example:
headwind_deployments_watched
headwind_statefulsets_watched
Type: Gauge
Description: Number of StatefulSets being monitored
Example:
headwind_statefulsets_watched
headwind_daemonsets_watched
Type: Gauge
Description: Number of DaemonSets being monitored
Example:
headwind_daemonsets_watched
headwind_helm_releases_watched
Type: Gauge
Description: Number of HelmReleases being monitored
Example:
headwind_helm_releases_watched
# Total workloads watched
headwind_deployments_watched + headwind_statefulsets_watched + headwind_daemonsets_watched + headwind_helm_releases_watched
Helm Metrics
Track Helm chart version discovery and updates:
headwind_helm_chart_versions_checked_total
Type: Counter
Description: Helm chart version checks performed
Example:
rate(headwind_helm_chart_versions_checked_total[5m])
headwind_helm_updates_found_total
Type: Counter
Description: Helm chart updates discovered
Example:
rate(headwind_helm_updates_found_total[1h])
headwind_helm_updates_approved_total
Type: Counter
Description: Helm chart updates approved by policy
Example:
# Approval rate
headwind_helm_updates_approved_total / headwind_helm_updates_found_total
headwind_helm_updates_rejected_total
Type: Counter
Description: Helm chart updates rejected by policy
Example:
# Rejection rate
headwind_helm_updates_rejected_total / headwind_helm_updates_found_total
headwind_helm_updates_applied_total
Type: Counter
Description: Helm chart updates successfully applied
Example:
rate(headwind_helm_updates_applied_total[1h])
headwind_helm_repository_queries_total
Type: Counter
Description: Helm repository queries performed
Example:
rate(headwind_helm_repository_queries_total[5m])
headwind_helm_repository_errors_total
Type: Counter
Description: Helm repository query errors
Example:
# Error rate
rate(headwind_helm_repository_errors_total[5m]) / rate(headwind_helm_repository_queries_total[5m])
headwind_helm_repository_query_duration_seconds
Type: Histogram
Description: Helm repository query duration
Example:
# 95th percentile query time
histogram_quantile(0.95, rate(headwind_helm_repository_query_duration_seconds_bucket[5m]))
Rollback Metrics
Monitor rollback operations:
headwind_rollbacks_total
Type: Counter
Description: Total rollback operations (manual + automatic)
Example:
rate(headwind_rollbacks_total[1h])
headwind_rollbacks_manual_total
Type: Counter
Description: Manual rollback operations
Example:
rate(headwind_rollbacks_manual_total[1h])
headwind_rollbacks_automatic_total
Type: Counter
Description: Automatic rollback operations triggered by health failures
Example:
rate(headwind_rollbacks_automatic_total[1h])
# Automatic rollback ratio
headwind_rollbacks_automatic_total / headwind_rollbacks_total
headwind_rollbacks_failed_total
Type: Counter
Description: Failed rollback operations
Example:
# Rollback success rate
(headwind_rollbacks_total - headwind_rollbacks_failed_total) / headwind_rollbacks_total
headwind_deployment_health_checks_total
Type: Counter
Description: Deployment health checks performed after updates
Example:
rate(headwind_deployment_health_checks_total[5m])
headwind_deployment_health_failures_total
Type: Counter
Description: Deployment health check failures detected
Example:
# Health failure rate
rate(headwind_deployment_health_failures_total[5m]) / rate(headwind_deployment_health_checks_total[5m])
Notification Metrics
Track notification delivery:
headwind_notifications_sent_total
Type: Counter
Description: Total notifications sent successfully
Example:
rate(headwind_notifications_sent_total[5m])
headwind_notifications_failed_total
Type: Counter
Description: Total notification failures
Example:
# Failure rate
rate(headwind_notifications_failed_total[5m]) / rate(headwind_notifications_sent_total[5m])
headwind_notifications_slack_sent_total
Type: Counter
Description: Notifications sent to Slack
Example:
rate(headwind_notifications_slack_sent_total[5m])
headwind_notifications_teams_sent_total
Type: Counter
Description: Notifications sent to Microsoft Teams
Example:
rate(headwind_notifications_teams_sent_total[5m])
headwind_notifications_webhook_sent_total
Type: Counter
Description: Notifications sent via generic webhooks
Example:
rate(headwind_notifications_webhook_sent_total[5m])
Prometheus Alerts
Example alert rules for Headwind:
groups:
- name: headwind
rules:
# Update alerts
- alert: HeadwindStaleUpdateRequests
expr: headwind_updates_pending > 10
for: 1h
annotations:
summary: "Many pending UpdateRequests"
description: "{{ $value }} UpdateRequests pending for over 1 hour"
- alert: HeadwindHighUpdateFailureRate
expr: rate(headwind_updates_failed_total[5m]) > 0.1
for: 5m
annotations:
summary: "High update failure rate"
description: "Update failures detected"
# Rollback alerts
- alert: HeadwindAutomaticRollback
expr: increase(headwind_rollbacks_automatic_total[5m]) > 0
annotations:
summary: "Automatic rollback triggered"
description: "Headwind triggered an automatic rollback"
- alert: HeadwindFrequentRollbacks
expr: rate(headwind_rollbacks_total[1h]) > 3
for: 5m
annotations:
summary: "Frequent rollbacks detected"
description: "{{ $value }} rollbacks in the last hour"
# Helm alerts
- alert: HeadwindHelmRepositoryErrors
expr: rate(headwind_helm_repository_errors_total[5m]) > 0
for: 5m
annotations:
summary: "Helm repository query errors"
description: "Errors querying Helm repositories"
# Notification alerts
- alert: HeadwindNotificationFailures
expr: rate(headwind_notifications_failed_total[5m]) > 0
for: 5m
annotations:
summary: "Notification failures detected"
description: "Headwind notifications are failing"
# Reconciliation alerts
- alert: HeadwindSlowReconciliation
expr: histogram_quantile(0.95, rate(headwind_reconcile_duration_seconds_bucket[5m])) > 5
for: 10m
annotations:
summary: "Slow reconciliation loops"
description: "95th percentile reconciliation time > 5s"
- alert: HeadwindReconciliationErrors
expr: rate(headwind_reconcile_errors_total[5m]) > 0.1
for: 5m
annotations:
summary: "Reconciliation errors"
description: "Controller reconciliation errors detected"
Grafana Dashboard
Example PromQL queries for a Grafana dashboard:
Overview Panel
# Pending updates
headwind_updates_pending
# Watched resources
sum(headwind_deployments_watched + headwind_statefulsets_watched + headwind_daemonsets_watched + headwind_helm_releases_watched)
# Update success rate (last hour)
rate(headwind_updates_applied_total[1h]) / (rate(headwind_updates_applied_total[1h]) + rate(headwind_updates_failed_total[1h]))
Update Activity Panel
# Updates approved (rate)
rate(headwind_updates_approved_total[5m])
# Updates applied by type
sum by (kind) (rate(headwind_updates_applied_total[5m]))
# Updates rejected (rate)
rate(headwind_updates_rejected_total[5m])
Rollback Panel
# Total rollbacks
rate(headwind_rollbacks_total[1h])
# Automatic vs Manual
rate(headwind_rollbacks_automatic_total[1h])
rate(headwind_rollbacks_manual_total[1h])
# Health check failure rate
rate(headwind_deployment_health_failures_total[5m]) / rate(headwind_deployment_health_checks_total[5m])
Performance Panel
# Reconciliation latency (p95)
histogram_quantile(0.95, rate(headwind_reconcile_duration_seconds_bucket[5m]))
# Helm repository query latency (p95)
histogram_quantile(0.95, rate(headwind_helm_repository_query_duration_seconds_bucket[5m]))
Scraping Configuration
Configure Prometheus to scrape Headwind metrics:
scrape_configs:
- job_name: 'headwind'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- headwind-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+):(.+)
replacement: $1:9090
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
Next Steps
- Notifications - Configure notifications
- Rollback Configuration - Set up rollback
- API Reference - REST API documentation