prometheus+grafana+alertmanager监控k8s无坑版
摘要
k8s搭建完成并正常使用的基础上,需要有一个动态存储
我的环境:
k8s版本    | Kubeadm部署 v1.18.0
——– | —–
k8s-master  | 172.22.254.57
k8s-node1  | 172.22.254.62
k8s-node2  | 172.22.254.63(nfs服务端)
StorageClass | nfs-storage
k8s-master有污点,如果需要监控到master,去除污点即可(非必要)
1
   | kubectl taint nodes node1 key1=value1:NoSchedule-
   | 
 
prometheus-rules中的规则字段可能随着版本更新出现变化,如有变化可以通知我,我实时更新文档。目前规则内的字段在此版本我已更新过。放心使用
还有一个小细节:prmetheus跟alertmanager的configmap是支持热更新的。可以用以下命令来热更新,可能执行刷新的时候会有点儿久,等一下就好
1
   | curl -X POST http://ClusterIP:PORT/-/reload
   | 
 
资源下载:https://github.com/alexclownfish/k8s-monitor
部署正文
创建ops命名空间
prometheus yaml文件
prometheus配置文件 prometheus-configmap.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
   | apiVersion: v1 kind: ConfigMap metadata:   name: prometheus-config   namespace: ops  data:   prometheus.yml: |     rule_files:     - /etc/config/rules/*.rules
      scrape_configs:     - job_name: prometheus       static_configs:       - targets:         - localhost:9090
      - job_name: kubernetes-apiservers       kubernetes_sd_configs:       - role: endpoints       relabel_configs:       - action: keep         regex: default;kubernetes;https         source_labels:         - __meta_kubernetes_namespace         - __meta_kubernetes_service_name         - __meta_kubernetes_endpoint_port_name       scheme: https       tls_config:         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt         insecure_skip_verify: true       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token       - job_name: kubernetes-nodes-kubelet       kubernetes_sd_configs:       - role: node  # 发现集群中的节点       relabel_configs:       # 将标签(.*)作为新标签名,原有值不变       - action: labelmap         regex: __meta_kubernetes_node_label_(.+)       scheme: https       tls_config:         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt         insecure_skip_verify: true       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - job_name: kubernetes-nodes-cadvisor       kubernetes_sd_configs:       - role: node       relabel_configs:       # 将标签(.*)作为新标签名,原有值不变       - action: labelmap         regex: __meta_kubernetes_node_label_(.+)       # 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor,这里替换默认指标URL路径       - target_label: __metrics_path__         replacement: /metrics/cadvisor       scheme: https       tls_config:         ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt         insecure_skip_verify: true       bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - job_name: kubernetes-service-endpoints       kubernetes_sd_configs:       - role: endpoints  # 从Service列表中的Endpoint发现Pod为目标       relabel_configs:       # Service没配置注解prometheus.io/scrape的不采集       - action: keep         regex: true         source_labels:         - __meta_kubernetes_service_annotation_prometheus_io_scrape       # 重命名采集目标协议       - action: replace         regex: (https?)         source_labels:         - __meta_kubernetes_service_annotation_prometheus_io_scheme         target_label: __scheme__       # 重命名采集目标指标URL路径       - action: replace         regex: (.+)         source_labels:         - __meta_kubernetes_service_annotation_prometheus_io_path         target_label: __metrics_path__       # 重命名采集目标地址       - action: replace         regex: ([^:]+)(?::\d+)?;(\d+)         replacement: $1:$2         source_labels:         - __address__         - __meta_kubernetes_service_annotation_prometheus_io_port         target_label: __address__       # 将K8s标签(.*)作为新标签名,原有值不变       - action: labelmap         regex: __meta_kubernetes_service_label_(.+)       # 生成命名空间标签       - action: replace         source_labels:         - __meta_kubernetes_namespace         target_label: kubernetes_namespace       # 生成Service名称标签       - action: replace         source_labels:         - __meta_kubernetes_service_name         target_label: kubernetes_name
      - job_name: kubernetes-pods       kubernetes_sd_configs:       - role: pod   # 发现所有Pod为目标       # 重命名采集目标协议       relabel_configs:       - action: keep         regex: true         source_labels:         - __meta_kubernetes_pod_annotation_prometheus_io_scrape       # 重命名采集目标指标URL路径       - action: replace         regex: (.+)         source_labels:         - __meta_kubernetes_pod_annotation_prometheus_io_path         target_label: __metrics_path__       # 重命名采集目标地址       - action: replace         regex: ([^:]+)(?::\d+)?;(\d+)         replacement: $1:$2         source_labels:         - __address__         - __meta_kubernetes_pod_annotation_prometheus_io_port         target_label: __address__       # 将K8s标签(.*)作为新标签名,原有值不变       - action: labelmap         regex: __meta_kubernetes_pod_label_(.+)       # 生成命名空间标签       - action: replace         source_labels:         - __meta_kubernetes_namespace         target_label: kubernetes_namespace       # 生成Service名称标签       - action: replace         source_labels:         - __meta_kubernetes_pod_name         target_label: kubernetes_pod_name
      alerting:       alertmanagers:       - static_configs:           - targets: ["alertmanager:80"]
   | 
 
kube-state-metrics 采集了k8s中各种资源对象的状态信息 kube-state-metrics.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
   | apiVersion: apps/v1  kind: Deployment metadata:   name: kube-state-metrics   namespace: ops   labels:     k8s-app: kube-state-metrics spec:   selector:     matchLabels:       k8s-app: kube-state-metrics       version: v1.3.0   replicas: 1   template:     metadata:       labels:         k8s-app: kube-state-metrics         version: v1.3.0     spec:       serviceAccountName: kube-state-metrics       containers:       - name: kube-state-metrics         image: lizhenliang/kube-state-metrics:v1.8.0          ports:         - name: http-metrics           containerPort: 8080         - name: telemetry           containerPort: 8081         readinessProbe:           httpGet:             path: /healthz             port: 8080           initialDelaySeconds: 5           timeoutSeconds: 5       - name: addon-resizer         image: lizhenliang/addon-resizer:1.8.6         resources:           limits:             cpu: 100m             memory: 30Mi           requests:             cpu: 100m             memory: 30Mi         env:           - name: MY_POD_NAME             valueFrom:               fieldRef:                 fieldPath: metadata.name           - name: MY_POD_NAMESPACE             valueFrom:               fieldRef:                 fieldPath: metadata.namespace         volumeMounts:           - name: config-volume             mountPath: /etc/config         command:           - /pod_nanny           - --config-dir=/etc/config           - --container=kube-state-metrics           - --cpu=100m           - --extra-cpu=1m           - --memory=100Mi           - --extra-memory=2Mi           - --threshold=5           - --deployment=kube-state-metrics       volumes:         - name: config-volume           configMap:             name: kube-state-metrics-config --- apiVersion: v1 kind: ConfigMap metadata:   name: kube-state-metrics-config   namespace: ops data:   NannyConfiguration: |-     apiVersion: nannyconfig/v1alpha1     kind: NannyConfiguration --- apiVersion: v1 kind: Service metadata:   name: kube-state-metrics   namespace: ops   annotations:     prometheus.io/scrape: 'true' spec:   ports:   - name: http-metrics     port: 8080     targetPort: http-metrics     protocol: TCP   - name: telemetry     port: 8081     targetPort: telemetry     protocol: TCP   selector:     k8s-app: kube-state-metrics --- apiVersion: v1 kind: ServiceAccount metadata:   name: kube-state-metrics   namespace: ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:   name: kube-state-metrics rules: - apiGroups: [""]   resources:   - configmaps   - secrets   - nodes   - pods   - services   - resourcequotas   - replicationcontrollers   - limitranges   - persistentvolumeclaims   - persistentvolumes   - namespaces   - endpoints   verbs: ["list", "watch"] - apiGroups: ["apps"]   resources:   - statefulsets   - daemonsets   - deployments   - replicasets   verbs: ["list", "watch"] - apiGroups: ["batch"]   resources:   - cronjobs   - jobs   verbs: ["list", "watch"] - apiGroups: ["autoscaling"]   resources:   - horizontalpodautoscalers   verbs: ["list", "watch"] - apiGroups: ["networking.k8s.io", "extensions"]   resources:   - ingresses    verbs: ["list", "watch"] - apiGroups: ["storage.k8s.io"]   resources:   - storageclasses    verbs: ["list", "watch"] - apiGroups: ["certificates.k8s.io"]   resources:   - certificatesigningrequests   verbs: ["list", "watch"] - apiGroups: ["policy"]   resources:   - poddisruptionbudgets    verbs: ["list", "watch"]
  --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata:   name: kube-state-metrics-resizer   namespace: ops rules: - apiGroups: [""]   resources:   - pods   verbs: ["get"] - apiGroups: ["extensions","apps"]   resources:   - deployments   resourceNames: ["kube-state-metrics"]   verbs: ["get", "update"] --- apiVersion: rbac.authorization.k8s.io/v1  kind: ClusterRoleBinding metadata:   name: kube-state-metrics roleRef:   apiGroup: rbac.authorization.k8s.io   kind: ClusterRole   name: kube-state-metrics subjects: - kind: ServiceAccount   name: kube-state-metrics   namespace: ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata:   name: kube-state-metrics   namespace: ops roleRef:   apiGroup: rbac.authorization.k8s.io   kind: Role   name: kube-state-metrics-resizer subjects: - kind: ServiceAccount   name: kube-state-metrics   namespace: ops
   | 
 
prometheus部署文件  prometheus-deploy.yaml(注意版本需要用2.20)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
   | apiVersion: apps/v1 kind: Deployment metadata:   name: prometheus    namespace: ops   labels:     k8s-app: prometheus spec:   replicas: 1   selector:     matchLabels:       k8s-app: prometheus   template:     metadata:       labels:         k8s-app: prometheus     spec:       serviceAccountName: prometheus       initContainers:       - name: "init-chown-data"         image: "busybox:latest"         imagePullPolicy: "IfNotPresent"         command: ["chown", "-R", "65534:65534", "/data"]         volumeMounts:         - name: prometheus-data           mountPath: /data           subPath: ""       containers:         - name: prometheus-server-configmap-reload           image: "jimmidyson/configmap-reload:v0.1"           imagePullPolicy: "IfNotPresent"           args:             - --volume-dir=/etc/config             - --webhook-url=http://localhost:9090/-/reload           volumeMounts:             - name: config-volume               mountPath: /etc/config               readOnly: true             - mountPath: /etc/localtime               name: timezone           resources:             limits:               cpu: 10m               memory: 100Mi             requests:               cpu: 10m               memory: 100Mi
          - name: prometheus-server           image: "prom/prometheus:v2.20.0"           imagePullPolicy: "IfNotPresent"           args:             - --config.file=/etc/config/prometheus.yml             - --storage.tsdb.path=/data             - --web.console.libraries=/etc/prometheus/console_libraries             - --web.console.templates=/etc/prometheus/consoles             - --web.enable-lifecycle           ports:             - containerPort: 9090           readinessProbe:             httpGet:               path: /-/ready               port: 9090             initialDelaySeconds: 30             timeoutSeconds: 30           livenessProbe:             httpGet:               path: /-/healthy               port: 9090             initialDelaySeconds: 30             timeoutSeconds: 30           resources:             limits:               cpu: 500m               memory: 800Mi             requests:               cpu: 200m               memory: 400Mi                        volumeMounts:             - name: config-volume               mountPath: /etc/config             - name: prometheus-data               mountPath: /data               subPath: ""             - name: prometheus-rules               mountPath: /etc/config/rules             - mountPath: /etc/localtime               name: timezone         volumes:         - name: config-volume           configMap:             name: prometheus-config         - name: prometheus-rules           configMap:             name: prometheus-rules         - name: prometheus-data           persistentVolumeClaim:             claimName: prometheus         - name: timezone           hostPath:             path: /usr/share/zoneinfo/Asia/Shanghai                                                    --- apiVersion: v1 kind: PersistentVolumeClaim metadata:   name: prometheus   namespace: ops spec:   storageClassName: "nfs-storage"   accessModes:     - ReadWriteMany   resources:     requests:       storage: 10Gi --- apiVersion: v1 kind: Service metadata:    name: prometheus   namespace: ops spec:    type: NodePort   ports:      - name: http        port: 9090       protocol: TCP       targetPort: 9090       nodePort: 30089   selector:      k8s-app: prometheus --- apiVersion: v1 kind: ServiceAccount metadata:   name: prometheus   namespace: ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:   name: prometheus rules:   - apiGroups:       - ""     resources:       - nodes       - nodes/metrics       - services       - endpoints       - pods     verbs:       - get       - list       - watch   - apiGroups:       - ""     resources:       - configmaps     verbs:       - get   - nonResourceURLs:       - "/metrics"     verbs:       - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata:   name: prometheus roleRef:   apiGroup: rbac.authorization.k8s.io   kind: ClusterRole   name: prometheus subjects: - kind: ServiceAccount   name: prometheus   namespace: ops  
   | 
 
prometheus配置报警规则 prometheus-rules.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211
   | apiVersion: v1 kind: ConfigMap metadata:   name: prometheus-rules   namespace: ops data:   general.rules: |     groups:     - name: general.rules       rules:       - alert: InstanceDown         expr: up == 0         for: 1m         labels:           severity: error          annotations:           summary: "Instance {{ $labels.instance }} 停止工作"           description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."                   node.rules: |     groups:     - name: node.rules       rules:       - alert: NodeFilesystemUsage         expr: |           100 - (node_filesystem_free_bytes / node_filesystem_size_bytes) * 100 > 60         for: 1m         labels:           severity: warning          annotations:           summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"           description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于60% (当前值: {{ $value }})"
        - alert: NodeMemoryUsage         expr: |           100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 60         for: 1m         labels:           severity: warning         annotations:           summary: "Instance {{ $labels.instance }} 内存使用率过高"           description: "{{ $labels.instance }}内存使用大于60% (当前值: {{ $value }})"
        - alert: NodeCPUUsage             expr: |           100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60          for: 1m         labels:           severity: warning         annotations:           summary: "Instance {{ $labels.instance }} CPU使用率过高"                  description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})"
        - alert: KubeNodeNotReady         expr: |           kube_node_status_condition{condition="Ready",status="true"} == 0         for: 1m         labels:           severity: error         annotations:           message: '{{ $labels.node }} 已经有10多分钟没有准备好了.'
    pod.rules: |     groups:     - name: pod.rules       rules:       - alert: PodCPUUsage         expr: |            sum by(pod, namespace) (rate(container_cpu_usage_seconds_total{image!=""}[5m]) * 100) > 5         for: 5m         labels:           severity: warning          annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"
        - alert: PodMemoryUsage         expr: |            sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 80         for: 5m         labels:           severity: error          annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"
        - alert: PodNetworkReceive         expr: |            sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000         for: 5m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"           
        - alert: PodNetworkTransmit         expr: |             sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000         for: 5m         labels:           severity: warning          annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 出口流量大于30MB/s (当前值: {{ $value }}/K/s)"
        - alert: PodRestart         expr: |            sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 0         for: 1m         labels:           severity: warning          annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"
        - alert: PodFailed         expr: |            sum(kube_pod_status_phase{phase="Failed"}) by (pod,namespace) > 0         for: 5s         labels:           severity: error          annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Failed (当前值: {{ $value }})"
        - alert: PodPending         expr: |             sum(kube_pod_status_phase{phase="Pending"}) by (pod,namespace) > 0         for: 1m         labels:           severity: error         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态Pending (当前值: {{ $value }})"
        - alert: PodErrImagePull         expr: |            sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ErrImagePull"}) == 1         for: 1m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态ErrImagePull (当前值: {{ $value }})"
        - alert: PodImagePullBackOff         expr: |            sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}) == 1         for: 1m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态ImagePullBackOff (当前值: {{ $value }})"
        - alert: PodCrashLoopBackOff         expr: |            sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}) == 1         for: 1m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态CrashLoopBackOff (当前值: {{ $value }})"
        - alert: PodInvalidImageName         expr: |            sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="InvalidImageName"}) == 1         for: 1m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态InvalidImageName (当前值: {{ $value }})"
        - alert: PodCreateContainerConfigError         expr: |            sum by(namespace,pod) (kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"}) == 1         for: 1m         labels:           severity: warning         annotations:           summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}  Pod状态CreateContainerConfigError (当前值: {{ $value }})"
    volume.rules: |     groups:     - name: volume.rules       rules:       - alert: PersistentVolumeClaimLost         expr: |            sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Lost"}) == 1         for: 2m         labels:           severity: warning         annotations:           summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is lost\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"       - alert: PersistentVolumeClaimPendig         expr: |            sum by(namespace, persistentvolumeclaim) (kube_persistentvolumeclaim_status_phase{phase="Pendig"}) == 1         for: 2m         labels:           severity: warning         annotations:           summary: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pendig\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"       - alert: PersistentVolume Failed         expr: |            sum(kube_persistentvolume_status_phase{phase="Failed",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1         for: 2m         labels:           severity: warning         annotations:           summary: "Persistent volume is failed state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"       - alert: PersistentVolume Pending         expr: |            sum(kube_persistentvolume_status_phase{phase="Pending",job="kubernetes-service-endpoints"}) by (persistentvolume) == 1         for: 2m         labels:           severity: warning         annotations:           summary: "Persistent volume is pending state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 
   | 
 
node-exporter配置node-exporter.yaml(注意版本需要用1.0.1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
   | apiVersion: apps/v1  kind: DaemonSet metadata:   name: node-exporter    namespace: ops    labels:     k8s-app: node-exporter  spec:   selector:     matchLabels:       k8s-app: node-exporter       version: v1.0.1   template:     metadata:       labels:         k8s-app: node-exporter          version: v1.0.1     spec:       containers:         - name: prometheus-node-exporter           image: "prom/node-exporter:v1.0.1"           #imagePullPolicy: "Always"           args:             - --path.procfs=/host/proc             - --path.sysfs=/host/sys           ports:             - name: metrics               containerPort: 9100               hostPort: 9100           volumeMounts:             - name: proc               mountPath: /host/proc               readOnly:  true             - name: sys               mountPath: /host/sys               readOnly: true           resources:             limits:               cpu: 10m               memory: 50Mi             requests:               cpu: 10m               memory: 50Mi       hostNetwork: true       hostPID: true       hostIPC: true       volumes:         - name: proc           hostPath:             path: /proc         - name: sys           hostPath:             path: /sys         - name: rootfs           hostPath:             path: /         - name: dev           hostPath:             path: /dev --- apiVersion: v1 kind: Service metadata:   name: node-exporter   namespace: ops   annotations:     prometheus.io/scrape: "true" spec:   clusterIP: None   ports:     - name: metrics       port: 9100       protocol: TCP       targetPort: 9100   selector:     k8s-app: node-exporter
 
   | 
 
alertmanager yaml文件
alertmanager配置文件alertmanger-configmap.yaml
注:邮箱需要自己去网易邮箱申请并且取得授权管理密码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
   | apiVersion: v1 kind: ConfigMap metadata:   name: alertmanager-config   namespace: ops data:   alertmanager.yml: |-     global:       # 在没有报警的情况下声明为已解决的时间       resolve_timeout: 5m       # 配置邮件发送信息       smtp_smarthost: 'smtp.163.com:465'       smtp_from: 'xxx@163.com'       smtp_auth_username: 'xxx@163.com'       smtp_auth_password: 'xxxxxx'       smtp_hello: '163.com'       smtp_require_tls: false     # 所有报警信息进入后的根路由,用来设置报警的分发策略     route:       # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面       group_by: ['alertname', 'cluster']       # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。       group_wait: 30s         # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。       group_interval: 5m         # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们       repeat_interval: 5m         # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器       receiver: default         # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。       routes:       - receiver: email         group_wait: 10s         match:           team: node     templates:       - '/etc/config/template/email.tmpl'     receivers:     - name: 'default'       email_configs:       - to: 'xxxx@qq.com'         html: '{{ template "email.html" . }}'         headers: { Subject: "[WARN] Prometheus 告警邮件" }         #send_resolved: true     - name: 'email'       email_configs:       - to: 'xxxx@gmail.com'         send_resolved: true
 
   | 
 
alertmanager template文件alertmanager-template.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
   | #自定义告警模板 apiVersion: v1 kind: ConfigMap metadata:   name: alertmanager-template-volume   namespace: ops data:   email.tmpl: |     {{ define "email.html" }}         {{ range .Alerts }}     <pre>         ========start==========        告警程序: prometheus_alert_email         告警级别: {{ .Labels.severity }} 级别         告警类型: {{ .Labels.alertname }}         故障主机: {{ .Labels.instance }}         告警主题: {{ .Annotations.summary }}        告警详情: {{ .Annotations.description }}        处理方法: {{ .Annotations.console }}        触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}        ========end==========     </pre>         {{ end }}     {{ end }}
 
   | 
 
alertmanager部署文件alertmanager-deployment.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
   | apiVersion: apps/v1 kind: Deployment metadata:   name: alertmanager   namespace: ops spec:   replicas: 1   selector:     matchLabels:       k8s-app: alertmanager       version: v0.14.0   template:     metadata:       labels:         k8s-app: alertmanager         version: v0.14.0     spec:       containers:         - name: prometheus-alertmanager           image: "prom/alertmanager:v0.14.0"           imagePullPolicy: "IfNotPresent"           args:             - --config.file=/etc/config/alertmanager.yml             - --storage.path=/data             - --web.external-url=/           ports:             - containerPort: 9093           readinessProbe:             httpGet:               path: /#/status               port: 9093             initialDelaySeconds: 30             timeoutSeconds: 30           volumeMounts:             - name: config-volume               mountPath: /etc/config #自定义告警模板             - name: config-template-volume               mountPath: /etc/config/template             - name: storage-volume               mountPath: "/data"               subPath: ""             - mountPath: /etc/localtime               name: timezone           resources:             limits:               cpu: 10m               memory: 200Mi             requests:               cpu: 10m               memory: 100Mi         - name: prometheus-alertmanager-configmap-reload           image: "jimmidyson/configmap-reload:v0.1"           imagePullPolicy: "IfNotPresent"           args:             - --volume-dir=/etc/config             - --webhook-url=http://localhost:9093/-/reload           volumeMounts:             - name: config-volume               mountPath: /etc/config               readOnly: true           resources:             limits:               cpu: 10m               memory: 200Mi             requests:               cpu: 10m               memory: 100Mi       volumes:         - name: config-volume           configMap:             name: alertmanager-config         - name: config-template-volume           configMap:             name: alertmanager-template-volume         - name: storage-volume           persistentVolumeClaim:             claimName: alertmanager         - name: timezone           hostPath:             path: /usr/share/zoneinfo/Asia/Shanghai --- apiVersion: v1 kind: PersistentVolumeClaim metadata:   name: alertmanager   namespace: ops spec:   storageClassName: nfs-storage   accessModes:     - ReadWriteOnce   resources:     requests:       storage: "2Gi" --- apiVersion: v1 kind: Service metadata:   name: alertmanager   namespace: ops   labels:     kubernetes.io/cluster-service: "true"     addonmanager.kubernetes.io/mode: Reconcile     kubernetes.io/name: "Alertmanager" spec:   type: "NodePort"   ports:     - name: http       port: 80       protocol: TCP       targetPort: 9093       nodePort: 30093   selector:     k8s-app: alertmanager
 
 
   | 
 
grafana yaml文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
   | apiVersion: apps/v1 kind: Deployment  metadata:   name: grafana   namespace: ops spec:   replicas: 1   selector:     matchLabels:       app: grafana   template:     metadata:       labels:         app: grafana     spec:       containers:       - name: grafana         image: grafana/grafana:7.1.0         ports:           - containerPort: 3000             protocol: TCP         resources:           limits:             cpu: 100m                         memory: 256Mi                     requests:             cpu: 100m                         memory: 256Mi         volumeMounts:           - name: grafana-data             mountPath: /var/lib/grafana             subPath: grafana           - mountPath: /etc/localtime             name: timezone       securityContext:         fsGroup: 472         runAsUser: 472       volumes:       - name: grafana-data         persistentVolumeClaim:           claimName: grafana       - name: timezone         hostPath:           path: /usr/share/zoneinfo/Asia/Shanghai  --- apiVersion: v1 kind: PersistentVolumeClaim metadata:   name: grafana    namespace: ops spec:   storageClassName: "nfs-storage"   accessModes:     - ReadWriteMany   resources:     requests:       storage: 5Gi --- apiVersion: v1 kind: Service metadata:   name: grafana   namespace: ops spec:   type: NodePort   ports:   - port : 80     targetPort: 3000     nodePort: 30030   selector:     app: grafana
 
   | 
 
部署到k8s中
grafana数据源和监控
grafana添加数据源

点击datasource - add datasource

之后点击save&test,添加数据源结束
import导入模板
模板下载:https://github.com/alexclownfish/k8s-monitor/tree/main/grafana_template


修改prometheus rules验证监控触发报警并发送邮件
修改prometheus-rules.yaml

1 2 3
   | #热更新configmap kubectl apply -f prometheus-rules.yaml curl -X POST http://10.1.230.219:9090/-/reload
   | 
 


看到已经触发报警并发送邮件
至此结束
感谢大佬
https://blog.51cto.com/luoguoling
https://alexcld.com