synchronization
This commit is contained in:
212
Helm/Helm部署Kube-Prometheus-Stack.md
Normal file
212
Helm/Helm部署Kube-Prometheus-Stack.md
Normal file
@@ -0,0 +1,212 @@
|
||||
> 本文作者:丁辉
|
||||
|
||||
# Helm部署Kube-Prometheus-Stack
|
||||
|
||||
## 介绍
|
||||
|
||||
**Kube-Prometheus-Stack 是一个全面的监控解决方案,专为 Kubernetes 集群设计,集成了 Prometheus、Grafana、Alertmanager 等组件**。它通过提供预配置的部署,简化了在 Kubernetes 环境中设置监控系统的过程。
|
||||
|
||||
## 开始部署
|
||||
|
||||
[官方仓库](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
|
||||
|
||||
1. 添加仓库
|
||||
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo update
|
||||
```
|
||||
|
||||
2. 创建命名空间
|
||||
|
||||
```bash
|
||||
kubectl create namespace monitor
|
||||
```
|
||||
|
||||
3. 编写 values.yaml 文件
|
||||
|
||||
```bash
|
||||
vi kube-prometheus-stack-values.yaml
|
||||
```
|
||||
|
||||
内容如下
|
||||
|
||||
```yaml
|
||||
prometheusOperator:
|
||||
admissionWebhooks:
|
||||
patch:
|
||||
image:
|
||||
registry: registry.aliyuncs.com # 配置镜像加速
|
||||
repository: google_containers/kube-webhook-certgen
|
||||
|
||||
# 关闭默认报警策略(建议关闭后自定义报警策略)
|
||||
defaultRules:
|
||||
create: false
|
||||
|
||||
# 配置 alertmanager 飞书报警通知
|
||||
# Helm部署PrometheusAlert
|
||||
# 文档: https://gitee.com/offends/Kubernetes/blob/main/Helm/Helm%E9%83%A8%E7%BD%B2PrometheusAlert.md
|
||||
alertmanager:
|
||||
tplConfig: true
|
||||
stringConfig: |
|
||||
global:
|
||||
# 在警报被标记为已解决后,Alertmanager 等待 5 分钟以更新警报状态。如果在此时间内警报消失,Alertmanager 将其标记为已解决。
|
||||
resolve_timeout: 5m
|
||||
route:
|
||||
# 将具有相同警报名称(alertname)的警报分组在一起。
|
||||
group_by: ['alertname']
|
||||
# 首次接收到警报后,Alertmanager 将等待 30 秒再发送,以便将可能相关的警报合并。
|
||||
group_wait: 30s
|
||||
# 在同一个组的警报被发送后,Alertmanager 等待 5 分钟后才会发送下一个组的警报。
|
||||
group_interval: 5m
|
||||
# 重复发送同一组警报的时间间隔为 30 分钟,以提醒长时间存在的问题。
|
||||
repeat_interval: 30m
|
||||
receiver: 'web.hook.prometheusalert'
|
||||
receivers:
|
||||
- name: 'web.hook.prometheusalert'
|
||||
webhook_configs:
|
||||
- url: 'http://prometheusalert.monitor.svc.cluster.local:8080/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/****'
|
||||
send_resolved: true #通知已经恢复的告警
|
||||
inhibit_rules:
|
||||
# 用于设置警报抑制规则。
|
||||
- source_match:
|
||||
severity: 'critical'
|
||||
target_match:
|
||||
severity: 'warning'
|
||||
equal: ['alertname', 'dev', 'instance']
|
||||
alertmanagerSpec:
|
||||
# 强制启用集群模式,即使只有一个副本也可以启用集群模式。
|
||||
forceEnableClusterMode: false
|
||||
storage:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
storageClassName:
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 50Gi
|
||||
|
||||
grafana:
|
||||
# 开启默认仪表片
|
||||
defaultDashboardsEnabled: false
|
||||
# 配置 grafana 时区
|
||||
defaultDashboardsTimezone: cst
|
||||
# 配置 grafana 密码
|
||||
adminPassword: admin
|
||||
# grafana 挂载持久化存储
|
||||
persistence:
|
||||
enabled: true
|
||||
storageClassName: "" # 指定存储卷, 不指定则需要集群内存在默认的存储卷
|
||||
# 开启 ingress 对外访问
|
||||
ingress:
|
||||
enabled: true
|
||||
ingressClassName: # 指定 ingress 控制器, 不指定则需要集群内存在默认的 ingress 控制器
|
||||
hosts:
|
||||
- # 域名
|
||||
path: /
|
||||
tls:
|
||||
- secretName: grafana-general-tls
|
||||
hosts:
|
||||
- # 域名
|
||||
|
||||
prometheus:
|
||||
prometheusSpec:
|
||||
# 指定外部 alertmanager
|
||||
#additionalAlertManagerConfigs:
|
||||
#- static_configs:
|
||||
#- targets:
|
||||
#- "192.168.1.10:9093"
|
||||
# 是否启用 --web.enable-remote-write-receiver 特性
|
||||
enableRemoteWriteReceiver: false
|
||||
# 评估频率
|
||||
evaluationInterval: "30s"
|
||||
# 抓去数据间隔
|
||||
scrapeInterval: "5s"
|
||||
# 这些设置表明所提及的选择器(规则、服务监视器、Pod 监视器和抓取配置)将具有独立的配置,而不会基于 Helm 图形值。(否则你的 ServiceMonitor 可能不会被自动发现)
|
||||
ruleSelectorNilUsesHelmValues: false
|
||||
serviceMonitorSelectorNilUsesHelmValues: false
|
||||
podMonitorSelectorNilUsesHelmValues: false
|
||||
probeSelectorNilUsesHelmValues: false
|
||||
scrapeConfigSelectorNilUsesHelmValues: false
|
||||
# prometheus 挂载持久化存储
|
||||
storageSpec:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
storageClassName: # 指定存储卷, 不指定则需要集群内存在默认的存储卷
|
||||
accessModes: ["ReadWriteOnce"]
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
|
||||
# 子 chart 镜像加速
|
||||
kube-state-metrics:
|
||||
image:
|
||||
registry: k8s.mirror.nju.edu.cn
|
||||
```
|
||||
|
||||
4. 创建Nginx证书secret
|
||||
|
||||
> cert为.pem和.crt文件都可以
|
||||
|
||||
```bash
|
||||
kubectl create secret tls grafana-general-tls --key nginx.key --cert nginx.pem -n monitor
|
||||
```
|
||||
|
||||
5. 安装
|
||||
|
||||
```bash
|
||||
helm install kube-prometheus-stack -f kube-prometheus-stack-values.yaml \
|
||||
prometheus-community/kube-prometheus-stack -n monitor
|
||||
```
|
||||
|
||||
> 访问 Grafana 面板,初始账号 `admin` 密码是 `prom-operator`
|
||||
|
||||
## 卸载
|
||||
|
||||
1. 卸载 kube-prometheus-stack
|
||||
|
||||
```bash
|
||||
helm uninstall kube-prometheus-stack -n monitor
|
||||
```
|
||||
|
||||
2. 删除 secret
|
||||
|
||||
```bash
|
||||
kubectl delete secret grafana-general-tls -n monitor
|
||||
```
|
||||
|
||||
3. 删除命名空间
|
||||
|
||||
```bash
|
||||
kubectl delete namespace monitor
|
||||
```
|
||||
|
||||
## 问题记录
|
||||
|
||||
- 当我使用 Nginx 代理 Grafana 访问地址为 `https://localhost/monitor` 时, Grafana 无法被正常代理
|
||||
|
||||
**解决方法:**
|
||||
|
||||
1. 编辑 configmap
|
||||
|
||||
```bash
|
||||
kubectl edit configmap kube-prometheus-stack-grafana -n monitor
|
||||
```
|
||||
|
||||
2. 在 `[server]` 下方添加或更改
|
||||
|
||||
```bash
|
||||
domain = 'localhost'
|
||||
root_url=%(protocol)s://%(domain)s:%(http_port)s/monitor
|
||||
```
|
||||
|
||||
|
||||
- RKE1 部署的 Kubernetes 集群无法监控到 Kubernetes 组件部分组件, 需要添加额外 yaml 参数, 内容如下
|
||||
|
||||
[RKE1-Kubernetes-values参数](https://gitee.com/offends/Kubernetes/blob/main/File/Yaml/rke-kube-prometheus-stack-values.yaml)
|
||||
|
||||
配置完成后发现无法连接, 原因是组件监控未对外开放访问, 按照文档操作开放后解决
|
||||
|
||||
[Rancher组件公开Metrics访问](https://gitee.com/offends/Kubernetes/blob/main/%E9%83%A8%E7%BD%B2%E6%96%87%E6%A1%A3/Rancher/Rancher%E7%BB%84%E4%BB%B6%E5%85%AC%E5%BC%80Metrics%E8%AE%BF%E9%97%AE.md)
|
||||
|
||||
|
Reference in New Issue
Block a user