Send Alerts to Slack/Discord When a Kubernetes Pod Restarts
Main Takeaway
Monitor Kubernetes pod restarts using Grafana and Prometheus, then enrich your Slack/Discord alerts with real-time Kubernetes events and recent pod logs by leveraging custom webhook payloads and integration with Loki. This comprehensive guide provides production-ready implementation steps.
Introduction
Effective Kubernetes incident response requires not just detecting pod restarts, but understanding why they occurred. Combining pod restart metrics with Kubernetes cluster events and application logs provides operators with complete context for rapid troubleshooting. This technical deep-dive covers:
- Monitoring pod restarts via Prometheus and kube-state-metrics
- Capturing Kubernetes events using event exporters
- Aggregating pod logs with Grafana Loki
- Crafting rich Slack/Discord webhook payloads that include events and logs
- Building a complete alerting workflow from detection to notification
Part 1: Foundational Monitoring Setup
Step 1: Deploy kube-state-metrics and Prometheus
Deploy kube-state-metrics to expose pod restart counts:
kubectl apply -f https://github.com/kubernetes/kube-state-metrics/releases/latest/download/kube-state-metrics.yaml
Verify that Prometheus scrapes the metric:
kube_pod_container_status_restarts_total{namespace="default",pod="nginx-deployment-66b6c48dd5-abc123",container="nginx"}
Part 2: Collecting Kubernetes Events
Kubernetes events provide critical context about pod state changes. They’re ephemeral (default TTL: 1 hour) and stored in etcd, so exporting them is essential for long-term analysis.
Understanding Kubernetes Events
Kubernetes events capture state transitions:
- Pod Created: When a pod is scheduled
- Pod Failed: When a container exits with non-zero status
- BackOff: When restarts exceed retry limits
- Killing: When a pod is terminated
- Liveness Probe Failed: When health checks fail
Each event contains:
- Reason: Event type (e.g.,
Backoff,Failed) - Message: Human-readable description
- Count: How many times the event occurred
- Source: Which component reported (e.g.,
kubelet,kube-controller-manager) - Timestamp: When the event occurred
Deploy Event Exporter
Use kubernetes-event-exporter (open-source) or kube-events to bridge Kubernetes events to Prometheus:
# Add Helm repository
helm repo add resmoio https://resmoio.github.io/helm-charts
helm repo update
# Install event exporter
helm install event-exporter resmoio/kubernetes-event-exporter \
--namespace monitoring \
--set logLevel=info
The event exporter exposes these metrics:
kube_event_count{
involved_object_kind="Pod",
involved_object_name="my-pod",
involved_object_namespace="default",
reason="Backoff",
type="Warning"
}
kube_event_unique_events_total{...}
Query Recent Events for a Pod
In Prometheus, query pod-specific events:
kube_event_count{involved_object_name=~"my-pod.*",involved_object_kind="Pod"}
Part 3: Aggregating Pod Logs with Grafana Loki
Loki indexes logs by label, not by content, making it ideal for Kubernetes log aggregation.
Deploy Loki and Alloy (Log Collector)
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki Stack (includes Loki and Alloy)
helm install loki grafana/loki-stack \
--namespace loki \
--create-namespace \
--set loki.persistence.enabled=true \
--set promtail.enabled=true
Configure Loki Datasource in Grafana
- Navigate to Configuration → Data Sources
- Add Loki as a datasource
- Set URL to
http://loki:3100(adjust for your setup)
Query Pod Logs in LogQL
Query recent logs for a restarted pod:
{namespace="default",pod="nginx-deployment-66b6c48dd5-abc123"} | json
Extract error-level logs:
{namespace="default",pod="nginx-deployment-66b6c48dd5-abc123"} | json level="error"
Limit to last 100 lines:
{namespace="default",pod="nginx-deployment-66b6c48dd5-abc123"} | line_format "" | limit 100
Part 4: Creating Rich Alert Rules in Grafana
Create Alert Rule with Context
In Grafana Alerting, create a rule that triggers on pod restarts:
Rule Configuration:
- Query A (Prometheus):
increase(kube_pod_container_status_restarts_total[5m]) > 0Triggers when any pod restarts within 5 minutes.
-
Alert Condition: Status is firing
- Evaluation Interval: 1 minute
Annotations (Alert Metadata):
summary: "Pod restarted"
description: "Pod in namespace has restarted times in the last 5 minutes."
pod_name: ""
namespace: ""
restart_count: ""
Labels (Routing):
severity: "critical"
component: "pod-restart-alert"
Part 5: Webhook Endpoint to Enrich Alerts with Events and Logs
Grafana sends webhook payloads to your custom endpoint. This endpoint fetches additional context (events and logs) before forwarding to Slack/Discord.
Python Webhook Receiver
Create a Flask application to handle Grafana webhooks:
from flask import Flask, request, jsonify
import json
import requests
from kubernetes import client, config, watch
from datetime import datetime, timedelta
import subprocess
import os
app = Flask(__name__)
# Load Kubernetes config
config.load_incluster_config() # For in-cluster pods
# OR: config.load_kube_config() # For local development
v1 = client.CoreV1Api()
apps_v1 = client.AppsV1Api()
def get_pod_recent_events(namespace, pod_name):
"""Fetch recent Kubernetes events for a pod"""
try:
events = v1.list_namespaced_event(namespace=namespace)
pod_events = [
e for e in events.items
if e.involved_object.name == pod_name
and e.involved_object.kind == "Pod"
]
# Sort by timestamp (most recent first)
pod_events.sort(
key=lambda e: e.last_timestamp or e.first_timestamp,
reverse=True
)
return pod_events[:5] # Return last 5 events
except Exception as e:
print(f"Error fetching events: {e}")
return []
def get_pod_logs(namespace, pod_name, container_name=None, tail_lines=50):
"""Fetch recent pod logs using kubectl"""
try:
cmd = [
"kubectl", "logs",
f"--namespace={namespace}",
pod_name,
f"--tail={tail_lines}"
]
if container_name:
cmd.extend(["-c", container_name])
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
return result.stdout if result.returncode == 0 else f"Error: {result.stderr}"
except Exception as e:
return f"Failed to retrieve logs: {str(e)}"
def get_pod_previous_logs(namespace, pod_name, container_name=None, tail_lines=50):
"""Fetch logs from previous container instance (for crashes)"""
try:
cmd = [
"kubectl", "logs",
f"--namespace={namespace}",
pod_name,
"--previous",
f"--tail={tail_lines}"
]
if container_name:
cmd.extend(["-c", container_name])
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
return result.stdout if result.returncode == 0 else None
except Exception as e:
return None
def format_events_for_slack(events):
"""Format Kubernetes events as Slack message blocks"""
blocks = []
for event in events:
blocks.append({
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Reason:* `{event.reason}`\n*Message:* {event.message}\n*Count:* {event.count}\n*Time:* {event.last_timestamp or event.first_timestamp}"
}
})
return blocks
def send_to_slack(webhook_url, message_payload):
"""Send message to Slack webhook"""
try:
response = requests.post(webhook_url, json=message_payload, timeout=10)
return response.status_code == 200
except Exception as e:
print(f"Slack send error: {e}")
return False
def send_to_discord(webhook_url, message_payload):
"""Send message to Discord webhook"""
try:
response = requests.post(webhook_url, json=message_payload, timeout=10)
return response.status_code == 204
except Exception as e:
print(f"Discord send error: {e}")
return False
@app.route('/alert', methods=['POST'])
def handle_alert():
"""Handle Grafana webhook alert"""
try:
payload = request.json
# Extract pod information from alert
alerts = payload.get('alerts', [])
if not alerts:
return jsonify({"error": "No alerts found"}), 400
alert = alerts[0]
labels = alert.get('labels', {})
pod_name = labels.get('pod', 'unknown')
namespace = labels.get('namespace', 'default')
restart_count = alert.get('values', {}).get('A', {}).get('Value', 'unknown')
# Fetch enrichment data
pod_events = get_pod_recent_events(namespace, pod_name)
current_logs = get_pod_logs(namespace, pod_name, tail_lines=30)
previous_logs = get_pod_previous_logs(namespace, pod_name, tail_lines=30)
# Determine target platform from query parameter
target = request.args.get('target', 'slack') # slack or discord
webhook_url = os.getenv(f'{target.upper()}_WEBHOOK_URL')
if not webhook_url:
return jsonify({"error": f"{target} webhook URL not configured"}), 500
# Build message payload
if target.lower() == 'slack':
message = build_slack_message(
pod_name, namespace, restart_count, pod_events, current_logs, previous_logs
)
success = send_to_slack(webhook_url, message)
else: # discord
message = build_discord_message(
pod_name, namespace, restart_count, pod_events, current_logs, previous_logs
)
success = send_to_discord(webhook_url, message)
return jsonify({"success": success}), 200 if success else 500
except Exception as e:
print(f"Error handling alert: {e}")
return jsonify({"error": str(e)}), 500
def build_slack_message(pod_name, namespace, restart_count, events, logs, previous_logs):
"""Build Slack message with events and logs"""
# Format logs (truncate to avoid exceeding Slack limits)
logs_text = logs[:1500] if logs else "No logs available"
if len(logs_text) > 1000:
logs_text = logs_text[:1000] + "\n... (truncated)"
previous_logs_text = previous_logs[:1500] if previous_logs else "No previous logs"
if previous_logs_text and len(previous_logs_text) > 800:
previous_logs_text = previous_logs_text[:800] + "\n... (truncated)"
# Format events
events_text = ""
for event in events:
events_text += f"• *{event.reason}*: {event.message} (Count: {event.count})\n"
events_text = events_text or "No recent events"
return {
"text": f"🚨 Pod Restart Alert: {pod_name}",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"Pod Restart: {pod_name}",
"emoji": True
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": f"*Pod:*\n{pod_name}"
},
{
"type": "mrkdwn",
"text": f"*Namespace:*\n{namespace}"
},
{
"type": "mrkdwn",
"text": f"*Restarts (5m):*\n{restart_count}"
},
{
"type": "mrkdwn",
"text": f"*Time:*\n{datetime.utcnow().isoformat()}"
}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Recent Events:*\n{events_text}"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Current Logs (Last 30 lines):*\n```{logs_text}```"
}
}
]
}
# Add previous logs section if available
if previous_logs:
return {
**{
"text": f"🚨 Pod Restart Alert: {pod_name}",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"Pod Restart: {pod_name}",
"emoji": True
}
},
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": f"*Pod:*\n{pod_name}"
},
{
"type": "mrkdwn",
"text": f"*Namespace:*\n{namespace}"
},
{
"type": "mrkdwn",
"text": f"*Restarts (5m):*\n{restart_count}"
},
{
"type": "mrkdwn",
"text": f"*Time:*\n{datetime.utcnow().isoformat()}"
}
]
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Recent Events:*\n{events_text}"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Current Logs:*\n```{logs_text}```"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Previous Logs (before restart):*\n```{previous_logs_text}```"
}
}
]
}
}
def build_discord_message(pod_name, namespace, restart_count, events, logs, previous_logs):
"""Build Discord embed message with events and logs"""
# Format logs
logs_text = logs[:1000] if logs else "No logs available"
previous_logs_text = previous_logs[:800] if previous_logs else "No previous logs"
# Format events
events_text = ""
for event in events:
events_text += f"• **{event.reason}**: {event.message} (x{event.count})\n"
events_text = events_text or "No recent events"
embed = {
"title": f"Pod Restart: {pod_name}",
"description": f"Pod restarted {restart_count} times in the last 5 minutes",
"color": 15158332, # Red
"fields": [
{
"name": "Pod",
"value": pod_name,
"inline": True
},
{
"name": "Namespace",
"value": namespace,
"inline": True
},
{
"name": "Restart Count (5m)",
"value": str(restart_count),
"inline": True
},
{
"name": "Recent Events",
"value": f"```{events_text}```",
"inline": False
},
{
"name": "Current Logs (Last 30 lines)",
"value": f"```{logs_text}```",
"inline": False
}
],
"timestamp": datetime.utcnow().isoformat()
}
if previous_logs:
embed["fields"].append({
"name": "Previous Logs (Before Restart)",
"value": f"```{previous_logs_text}```",
"inline": False
})
return {
"embeds": [embed]
}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Deploy Webhook Receiver as Kubernetes Service
Create a deployment manifest for the webhook receiver:
apiVersion: v1
kind: ServiceAccount
metadata:
name: alert-webhook
namespace: monitoring
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: alert-webhook
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"]
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: alert-webhook
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: alert-webhook
subjects:
- kind: ServiceAccount
name: alert-webhook
namespace: monitoring
apiVersion: apps/v1
kind: Deployment
metadata:
name: alert-webhook
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: alert-webhook
template:
metadata:
labels:
app: alert-webhook
spec:
serviceAccountName: alert-webhook
containers:
- name: webhook
image: python:3.11-slim
command: ["sh", "-c"]
args:
- |
pip install flask requests kubernetes &&
python /app/webhook.py
volumeMounts:
- name: webhook-code
mountPath: /app
ports:
- containerPort: 5000
env:
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: alert-webhooks
key: slack-url
- name: DISCORD_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: alert-webhooks
key: discord-url
volumes:
- name: webhook-code
configMap:
name: alert-webhook-code
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-webhook-code
namespace: monitoring
data:
webhook.py: |
# [Paste the Flask code from above]
apiVersion: v1
kind: Service
metadata:
name: alert-webhook
namespace: monitoring
spec:
selector:
app: alert-webhook
ports:
- port: 80
targetPort: 5000
type: ClusterIP
apiVersion: v1
kind: Secret
metadata:
name: alert-webhooks
namespace: monitoring
type: Opaque
stringData:
slack-url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
discord-url: "https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK"
Deploy:
kubectl apply -f alert-webhook-deployment.yaml
Part 6: Configure Grafana Webhook Contact Point
- Login to Grafana
- Navigate to Alerting → Contact Points → New Contact Point
- Select Webhook
-
Configure:
- Name:
pod-restart-webhook - URL:
http://alert-webhook.monitoring.svc.cluster.local/alert?target=slack - HTTP Method: POST
- Name:
- Test and save
Part 7: Advanced Webhook Customization
Custom Payload Template
Use Grafana’s custom payload feature for fine-grained control:
{
"alert_name": "",
"status": "",
"pod": "",
"namespace": "",
"grafana_url": "",
"timestamp": ""
}
Integration with Loki Logs in Alerts
Embed log queries directly in alert annotations:
annotations:
logs_link: "https://grafana.example.com/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"refId\":\"A\",\"expr\":\"{namespace=\\\"\\\",pod=\\\"\\\"}\"}]}"
Part 9: Production Best Practices
1. Log Retention
Configure Loki retention policies to balance cost and compliance:
ingester:
chunk_retain_period: 1m
max_chunk_age: 2h
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
limits_config:
retention_period: 720h # 30 days
2. Alert Routing and Grouping
Define notification policies to avoid alert fatigue:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-alert-notification-policy
namespace: monitoring
data:
notification-policy.yaml: |
receiver: 'slack-critical'
group_by: ['namespace', 'pod']
group_wait: 10s
group_interval: 1m
repeat_interval: 4h
routes:
- receiver: 'slack-prod'
match:
environment: 'production'
group_wait: 5s
repeat_interval: 2h
- receiver: 'slack-staging'
match:
environment: 'staging'
3. Rate Limiting
Prevent webhook endpoint overload:
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
limiter = Limiter(app, key_func=get_remote_address)
@app.route('/alert', methods=['POST'])
@limiter.limit("100 per minute") # Max 100 requests/minute
def handle_alert():
# ...
4. Monitoring the Webhook Receiver
Add Prometheus metrics to track webhook performance:
from prometheus_client import Counter, Histogram
webhook_requests = Counter(
'webhook_requests_total',
'Total webhook requests',
['status']
)
webhook_duration = Histogram(
'webhook_duration_seconds',
'Webhook processing duration'
)
@app.route('/alert', methods=['POST'])
def handle_alert():
with webhook_duration.time():
# Handle alert
webhook_requests.labels(status='success').inc()
Part 10: Other Scenarios
Scenario 1: CrashLoopBackOff Detection
Combine pod restart metrics with event data to detect crash loops:
# Alert when pod restarts exceed 5 in 10 minutes
rate(kube_pod_container_status_restarts_total[10m]) > 0.5
Query related events:
kube_event_count{
involved_object_kind="Pod",
reason=~"BackOff|Failed"
}
Scenario 2: Multi-Container Pod Restarts
Identify which container in a multi-container pod is restarting:
kube_pod_container_status_restarts_total{container!=""}
Update webhook to fetch logs per-container:
# Get all containers in the pod
pod = v1.read_namespaced_pod(pod_name, namespace)
containers = [c.name for c in pod.spec.containers]
# Fetch logs for each
for container_name in containers:
logs = get_pod_logs(namespace, pod_name, container_name)
Scenario 3: Cross-Namespace Pod Restart Correlation
Alert on spikes across multiple namespaces:
sum by (namespace) (
rate(kube_pod_container_status_restarts_total[5m])
) > 0.5
Conclusion
By combining Prometheus metrics, Kubernetes events, and application logs within Grafana alerts, you create a powerful incident response system that surfaces context-rich notifications. This setup enables ops teams to move from reactive firefighting to proactive, informed incident triage.
Key Takeaways:
- Use event-exporter to bridge ephemeral Kubernetes events to Prometheus
- Deploy Loki to centralize and query pod logs at scale
- Build custom webhook receivers to enrich alerts with real-time cluster state
- Implement rate limiting and monitoring on webhook endpoints for production reliability
- Use structured logging and label-based routing to reduce alert fatigue