1.背景
“由于阿里云 ACK 中,存活、就绪和启动探针的告警被包含在通用 warn 告警中,且该告警的触发频率为一次性触发,这导致我司项目中三大探针的告警频率过于频繁。因此,需要将这三大探针的告警从‘通用 warn 告警’中剥离。”
配置
1.找到ack集群告警配置项:报警配置→运维管理→告警配置
2.点击warn事件集→高级设置→搜索通用
所有的warn事件都在这个里面。
注意:阿里云ack的k8s event告警事件本质上都是通过记录sls日志,通过sls告警来通知。所以这里只要熟悉以及会修改sls的sql就行。
2.1 warn告警剥离三大探针
1.点击K8s通用Warn警示事件编辑
将sql改为如下:
SQL level: Warning and not "Error updatingEndpoint Slices for Service" and not (eventId.reason: AccessACRApiFailedand eventId.message:USER_NOT_EXIST) and not eventId.reason:"CIS.ScheduleTask.Warning" and not eventId.reason:"CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" not like'Liveness probe failed%' and "eventId.message" not like 'Readiness probe failed:%' and "eventId.message" not like 'Startup probe failed:%' GROUP by namespace, kind, object_name |
查看以上代码块
粗体三行就是过滤出三大探针的告警。
2.2 自定义启动、存活、就绪探针
这里有现成的模版,这里直接复制修改即可。
1.复制K8s通用Warn警示事件
2.自定义命名以及选择project以及logstore
3.修改告警规则sql
以下仅演示启动;就绪,存活探针按照启动方式复制修改即可
SQL #启动探针 * and not "Error updating Endpoint Slices for Service" and not(eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) andnot eventId.reason: "CIS.ScheduleTask.Warning" and noteventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like'Startup probe failed%' GROUP by namespace, kind, object_name
#存活探针 * and not "Error updating Endpoint Slices for Service" and not(eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) andnot eventId.reason: "CIS.ScheduleTask.Warning" and noteventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like'Liveness probe failed%' GROUP by namespace, kind, object_name #就绪探针 * and not "Error updatingEndpoint Slices for Service" and not (eventId.reason: AccessACRApiFailedand eventId.message:USER_NOT_EXIST) and not eventId.reason:"CIS.ScheduleTask.Warning" and not eventId.reason:"CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like'Readiness probe failed%' GROUP by namespace, kind, object_name |
按照以上的sql分别修改对应的探针,结果如下:
2.3 验证