1、因消息积压发现磁盘使用率不均
告警内容: er-iot-log-queue 队列消息积压已超过 500,当前为 5832
2、信息收集
1.使用 _cat/nodes?v API 查看每个节点的负载情况
curl -X GET "http://localhost:9200/_cat/nodes?v"
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
18.10.13.197 53 98 35 3.05 2.79 3.09 cdfhilmrstw * node-1
18.10.13.198 23 86 8 0.83 0.75 0.70 cdfhilmrstw - node-2
18.10.13.199 34 95 13 0.97 0.92 0.89 cdfhilmrstw - node-3
2.使用 _cat/allocation API 查看每个节点的磁盘使用情况
curl -X GET "http://localhost:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node16 311.9gb 316.6gb 183.2gb 499.9gb 63 18.10.13.197 18.10.13.197 node-115 268.6gb 273.3gb 226.6gb 499.9gb 54 18.10.13.198 18.10.13.198 node-215 57.3gb 61.7gb 438.1gb 499.9gb 12 18.10.13.199 18.10.13.199 node-3
disk.percent:磁盘使用百分比
disk.used:已使用的磁盘空间
disk.avail:可用的磁盘空间
disk.total:总磁盘空间
3.使用 _cat/shards API 查看每个索引的分片分布情况
curl -X GET "http://localhost:9200/_cat/shards?v"
index shard prirep state docs store ip node
.kibana-event-log-7.17.6-000013 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000013 0 r STARTED 18.10.13.198 node-2
.ds-.logs-deprecation.elasticsearch-default-2025.01.05-000023 0 p STARTED 18.10.13.197 node-1
.ds-.logs-deprecation.elasticsearch-default-2025.01.05-000023 0 r STARTED 18.10.13.199 node-3
.apm-agent-configuration 0 p STARTED 0 226b 18.10.13.198 node-2
.apm-agent-configuration 0 r STARTED 0 226b 18.10.13.199 node-3
.ds-.logs-deprecation.elasticsearch-default-2024.12.06-000021 0 p STARTED 18.10.13.197 node-1
.ds-.logs-deprecation.elasticsearch-default-2024.12.06-000021 0 r STARTED 18.10.13.199 node-3
.ds-ilm-history-5-2024.11.06-000018 0 p STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.11.06-000018 0 r STARTED 18.10.13.198 node-2
iot_ele_data_202501 0 p STARTED 108780659 52.8gb 18.10.13.197 node-1
iot_ele_data_202501 0 r STARTED 108780659 52.7gb 18.10.13.198 node-2
.kibana-event-log-7.17.6-000012 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000012 0 r STARTED 18.10.13.198 node-2
iot_ele_log_bean 0 p STARTED 8 52.6kb 18.10.13.197 node-1
iot_ele_log_bean 0 r STARTED 8 52.6kb 18.10.13.199 node-3
.apm-custom-link 0 r STARTED 0 226b 18.10.13.198 node-2
.apm-custom-link 0 p STARTED 0 226b 18.10.13.199 node-3
iot_ele_base_202412 0 r STARTED 220751580 102.6gb 18.10.13.197 node-1
iot_ele_base_202412 0 p STARTED 220751580 102.6gb 18.10.13.198 node-2
iot_ele_demo_bean 0 p STARTED 0 226b 18.10.13.197 node-1
iot_ele_demo_bean 0 r STARTED 0 226b 18.10.13.198 node-2
.async-search 0 p STARTED 0 257b 18.10.13.197 node-1
.async-search 0 r STARTED 0 257b 18.10.13.198 node-2
.ds-ilm-history-5-2024.10.07-000016 0 r STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.10.07-000016 0 p STARTED 18.10.13.199 node-3
.ds-ilm-history-5-2024.12.06-000020 0 r STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.12.06-000020 0 p STARTED 18.10.13.199 node-3
iot_ele_base_202501 0 p STARTED 108792480 50.9gb 18.10.13.197 node-1
iot_ele_base_202501 0 r STARTED 108792480 51gb 18.10.13.199 node-3
.tasks 0 r STARTED 4 21.6kb 18.10.13.197 node-1
.tasks 0 p STARTED 4 21.6kb 18.10.13.199 node-3
.ds-ilm-history-5-2025.01.05-000022 0 p STARTED 18.10.13.198 node-2
.ds-ilm-history-5-2025.01.05-000022 0 r STARTED 18.10.13.199 node-3
.kibana_task_manager_7.17.6_001 0 p STARTED 17 118.7mb 18.10.13.198 node-2
.kibana_task_manager_7.17.6_001 0 r STARTED 17 171.5mb 18.10.13.199 node-3
er_sxf_data_bean 0 p STARTED 4299971 6.1gb 18.10.13.198 node-2
er_sxf_data_bean 0 r STARTED 4299971 6.1gb 18.10.13.199 node-3
.kibana-event-log-7.17.6-000011 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000011 0 r STARTED 18.10.13.199 node-3
.kibana-event-log-7.17.6-000010 0 r STARTED 18.10.13.198 node-2
.kibana-event-log-7.17.6-000010 0 p STARTED 18.10.13.199 node-3
.kibana_7.17.6_001 0 p STARTED 449 2.5mb 18.10.13.198 node-2
.kibana_7.17.6_001 0 r STARTED 449 2.5mb 18.10.13.199 node-3
iot_ele_data_202412 0 p STARTED 220727742 105.5gb 18.10.13.197 node-1
iot_ele_data_202412 0 r STARTED 220727742 105.4gb 18.10.13.198 node-2
index:索引名称
shard:分片编号
prirep:p 表示主分片,r 表示副本分片
state:分片状态(如 STARTED 表示正常运行)
store:分片占用的磁盘空间
node:分片所在的节点
3、结论
发现压力都给到了 node-1,打算把 iot_ele_data_202501 和 iot_ele_base_202412 分片迁移到 node-3
4、备份数据
在 Elasticsearch 中手动迁移分片时,通常不需要显式备份数据,因为 Elasticsearch 本身具有高可用性和数据冗余机制(如副本分片)。然而,根据你的集群配置和迁移操作的复杂性,仍然建议采取一些预防措施,以确保数据安全
1.什么情况下需要备份?
尽管 Elasticsearch 具有高可用性,但在以下情况下,建议在手动迁移分片前进行备份:
- 没有副本分片:如果你的索引配置为 number_of_replicas: 0(即没有副本分片),数据丢失的风险较高
- 重要数据:如果数据非常重要,且不能承受任何丢失风险,备份是必要的
- 复杂操作:如果你计划进行大规模的分片迁移或集群重构,备份可以防止操作失误导致的数据丢失
- 集群状态不稳定:如果集群本身存在健康问题(如 yellow 或 red 状态),备份是必要的
2.因为确定不需要备份,所以我直接迁移的分片,有需要的可以参考以下备份方案
Elasticsearch 提供了多种备份机制,以下是常用的方法:
- 使用快照(Snapshot)
快照是 Elasticsearch 官方推荐的备份方式,支持增量备份和恢复 - 配置快照仓库
首先,需要配置一个快照仓库(如本地文件系统、S3、HDFS 等)
PUT /_snapshot/my_backup_repo
{"type": "fs","settings": {"location": "/path/to/backup"}
}
- 创建快照
为索引或整个集群创建快照
PUT /_snapshot/my_backup_repo/snapshot_1
{"indices": "my_index", // 指定索引名称,或使用 * 备份所有索引"ignore_unavailable": true,"include_global_state": false
}
- 恢复快照:
如果需要恢复数据,可以使用以下命令:
POST /_snapshot/my_backup_repo/snapshot_1/_restore
{"indices": "my_index","ignore_unavailable": true,"include_global_state": false
}
5、分片分布不均,可以分 手动迁移 和 自动分片重新平衡(POST _cluster/reroute?retry_failed=true)
我这边因为是生产环境,所以选择了保守的手动迁移,并在迁移前提前让业务停止了数据写入 es,改为临时写入本地文件!!!
1. 检查集群健康状态
curl -X GET "http://localhost:9200/_cluster/health"
{"cluster_name":"es-ny","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":23,"active_shards":46,"relocating_shards":1,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
确保集群状态为 green
2. 禁用分片分配
在迁移过程中,暂时禁用分片自动分配,以防止 Elasticsearch 自动重新分配分片
// 查看当前 自动重新分配分片 设置
curl -XGET "127.0.0.1:9200/_cluster/settings?include_defaults=true&flat_settings=true" |grep cluster.routing.allocation.enable
all// 禁用
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"transient": {"cluster.routing.allocation.enable": "none"}
}'// 检查
curl -XGET "127.0.0.1:9200/_cluster/settings?include_defaults=true&flat_settings=true" |grep cluster.routing.allocation.enable
- 仅使用 transient 设置
如果你希望设置临时生效(集群重启后失效),只使用 transient: - 仅使用 persistent 设置
如果你希望设置永久生效(即使集群重启),只使用 persistent: - 同时使用 transient 和 persistent
如果你希望临时覆盖持久化设置,可以同时使用 persistent 和 transient,但要确保它们的值一致
3. 使用 _cluster/reroute API 手动迁移分片
POST _cluster/reroute
{"commands": [{"move": {"index": "my_index", # 索引名称"shard": 0, # 分片编号"from_node": "node-1",# 当前节点名称"to_node": "node-2" # 目标节点名称}}]
}index:索引名称
shard:分片编号
from_node:当前分片所在的节点
to_node:目标节点
以下为本人实操记录
// 获取当前正在执行的任务列表
curl -X GET "http://localhost:9200/_cat/tasks?v"curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{"commands": [{"move": {"index": "iot_ele_data_202501","shard": 0,"from_node": "node-1","to_node": "node-3"}}]
}
'
// 确定迁移任务开启
curl -X GET "http://localhost:9200/_cat/tasks?v"
// 当 Elasticsearch 卡死时,可以此命令来查看正在执行的任务:
// 这将返回一个包含正在执行的任务信息的列表,包括任务ID、节点ID、任务类型等
// 如果ES卡死无法响应请求,可以尝试通过ES的日志文件(通常在ES的日志目录下,比如/var/log/elasticsearch/)来查看最近的日志信息,可能会有相关的错误或异常信息提示哪些任务正在执行造成了卡死
// 可以使用ES的API来取消或终止卡死的任务,比如使用POST /_tasks/{task_id}/_cancel来取消指定的任务// 执行这些命令后,Elasticsearch 将开始迁移指定的分片到目标节点。你可以使用以下命令来监控迁移进度
curl -X GET "http://localhost:9200/_cat/recovery?v"// 第二个分片迁移
curl -X POST "localhost:9200/_cluster/reroute" -H 'Content-Type: application/json' -d'
{"commands": [{"move": {"index": "iot_ele_base_202412","shard": 0,"from_node": "node-1","to_node": "node-3"}}]
}
'curl -X GET "http://localhost:9200/_cat/recovery?v"
4. 使用 GET _cat/recovery?v API 命令监控分片迁移状态
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
.apm-agent-configuration 0 18ms peer done 18.10.13.199 node-3 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
.apm-agent-configuration 0 18ms peer done 18.10.13.198 node-2 18.10.13.199 node-3 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
iot_ele_demo_bean 0 16ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
iot_ele_demo_bean 0 84ms empty_store done n/a n/a 18.10.13.197 node-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
.kibana_7.17.6_001 0 183ms peer done 18.10.13.199 node-3 18.10.13.198 node-2 n/a n/a 33 33 100.0% 33 2549750 2549750 100.0% 2549750 0 0 100.0%
.kibana_7.17.6_001 0 198ms peer done 18.10.13.198 node-2 18.10.13.199 node-3 n/a n/a 45 45 100.0% 45 2567297 2567297 100.0% 2567297 0 0 100.0%
iot_ele_base_202412 0 24.3m peer done 18.10.13.199 node-3 18.10.13.198 node-2 n/a n/a 218 218 100.0% 218 43114300329 43114300329 100.0% 43114300329 608845 608844 100.0%
iot_ele_base_202412 0 86ms peer done 18.10.13.199 node-3 18.10.13.197 node-1 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 17 17 100.0%
iot_ele_base_202501 0 28ms empty_store done n/a n/a 18.10.13.197 node-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
iot_ele_base_202501 0 16ms peer done 18.10.13.197 node-1 18.10.13.199 node-3 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
.tasks 0 48ms peer done 18.10.13.198 node-2 18.10.13.197 node-1 n/a n/a 10 10 100.0% 10 22174 22174 100.0% 22174 0 0 100.0%
.tasks 0 26ms peer done 18.10.13.198 node-2 18.10.13.199 node-3 n/a n/a 10 10 100.0% 10 22174 22174 100.0% 22174 0 0 100.0%
.apm-custom-link 0 532ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
.apm-custom-link 0 527ms peer done 18.10.13.197 node-1 18.10.13.199 node-3 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
er_sxf_data_bean 0 42.6s peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 77 77 100.0% 77 523265300 523265300 100.0% 523265300 59 59 100.0%
er_sxf_data_bean 0 1.6m peer done 18.10.13.198 node-2 18.10.13.199 node-3 n/a n/a 124 124 100.0% 124 1980170906 1980170906 100.0% 1980170906 6829 6829 100.0%
.kibana_task_manager_7.17.6_001 0 975ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 7 7 100.0% 7 328782 328782 100.0% 328782 840 840 100.0%
.kibana_task_manager_7.17.6_001 0 35.1s peer done 18.10.13.198 node-2 18.10.13.199 node-3 n/a n/a 144 144 100.0% 144 93837421 93837421 100.0% 93837421 362051 362051 100.0%
iot_ele_log_bean 0 532ms peer done 18.10.13.199 node-3 18.10.13.197 node-1 n/a n/a 4 4 100.0% 4 7021 7021 100.0% 7021 0 0 100.0%
iot_ele_log_bean 0 534ms peer done 18.10.13.197 node-1 18.10.13.199 node-3 n/a n/a 4 4 100.0% 4 7021 7021 100.0% 7021 0 0 100.0%
.async-search 0 853ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 257 257 100.0% 257 0 0 100.0%
.async-search 0 535ms peer done 18.10.13.199 node-3 18.10.13.197 node-1 n/a n/a 1 1 100.0% 1 253 253 100.0% 253 0 0 100.0%
iot_ele_data_202501 0 16ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 0 0 100.0%
iot_ele_data_202501 0 10ms empty_store done n/a n/a 18.10.13.197 node-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
iot_ele_data_202412 0 551ms peer done 18.10.13.197 node-1 18.10.13.198 node-2 n/a n/a 1 1 100.0% 1 226 226 100.0% 226 72 72 100.0%
iot_ele_data_202412 0 24ms empty_store done n/a n/a 18.10.13.197 node-1 n/a n/a 0 0 0.0% 0 0 0 0.0% 0 0 0 100.0%
查看分片的恢复进度(percentage 字段)
确保分片状态从 RELOCATING 变为 STARTED
5. 使用 GET _cat/shards?v API 确定迁移完成
curl -X GET "localhost:9200/_cat/shards?v"
index shard prirep state docs store ip node
iot_ele_base_202412 0 r RELOCATING 220751580 102.6gb 18.10.13.197 node-1 -> 18.10.13.199 Mp2Fk7MVQpG0hpMwWBA7qg node-3
iot_ele_base_202412 0 p STARTED 220751580 102.6gb 18.10.13.198 node-2
iot_ele_base_202501 0 p STARTED 109709708 51.3gb 18.10.13.197 node-1
iot_ele_base_202501 0 r STARTED 109709708 51.3gb 18.10.13.199 node-3
.kibana_task_manager_7.17.6_001 0 p STARTED 17 120.2mb 18.10.13.198 node-2
.kibana_task_manager_7.17.6_001 0 r STARTED 17 173.1mb 18.10.13.199 node-3
iot_ele_log_bean 0 p STARTED 8 52.6kb 18.10.13.197 node-1
iot_ele_log_bean 0 r STARTED 8 52.6kb 18.10.13.199 node-3
.async-search 0 p STARTED 0 257b 18.10.13.197 node-1
.async-search 0 r STARTED 0 257b 18.10.13.198 node-2
iot_ele_data_202501 0 r STARTED 109697802 53.1gb 18.10.13.198 node-2
iot_ele_data_202501 0 p STARTED 109697802 53.1gb 18.10.13.199 node-3
.kibana-event-log-7.17.6-000013 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000013 0 r STARTED 18.10.13.198 node-2
.ds-ilm-history-5-2024.11.06-000018 0 p STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.11.06-000018 0 r STARTED 18.10.13.198 node-2
iot_ele_data_202412 0 p STARTED 220727742 105.5gb 18.10.13.197 node-1
iot_ele_data_202412 0 r STARTED 220727742 105.4gb 18.10.13.198 node-2
.kibana_7.17.6_001 0 p STARTED 475 2.4mb 18.10.13.198 node-2
.kibana_7.17.6_001 0 r STARTED 475 2.4mb 18.10.13.199 node-3
.ds-.logs-deprecation.elasticsearch-default-2025.01.05-000023 0 p STARTED 18.10.13.197 node-1
.ds-.logs-deprecation.elasticsearch-default-2025.01.05-000023 0 r STARTED 18.10.13.199 node-3
.kibana-event-log-7.17.6-000012 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000012 0 r STARTED 18.10.13.198 node-2
.apm-custom-link 0 r STARTED 0 226b 18.10.13.198 node-2
.apm-custom-link 0 p STARTED 0 226b 18.10.13.199 node-3
.apm-agent-configuration 0 p STARTED 0 226b 18.10.13.198 node-2
.apm-agent-configuration 0 r STARTED 0 226b 18.10.13.199 node-3
.kibana-event-log-7.17.6-000010 0 r STARTED 18.10.13.198 node-2
.kibana-event-log-7.17.6-000010 0 p STARTED 18.10.13.199 node-3
.ds-ilm-history-5-2024.10.07-000016 0 r STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.10.07-000016 0 p STARTED 18.10.13.199 node-3
iot_ele_demo_bean 0 p STARTED 0 226b 18.10.13.197 node-1
iot_ele_demo_bean 0 r STARTED 0 226b 18.10.13.198 node-2
.kibana-event-log-7.17.6-000011 0 p STARTED 18.10.13.197 node-1
.kibana-event-log-7.17.6-000011 0 r STARTED 18.10.13.199 node-3
.tasks 0 r STARTED 4 21.6kb 18.10.13.197 node-1
.tasks 0 p STARTED 4 21.6kb 18.10.13.199 node-3
.ds-ilm-history-5-2025.01.05-000022 0 p STARTED 18.10.13.198 node-2
.ds-ilm-history-5-2025.01.05-000022 0 r STARTED 18.10.13.199 node-3
.ds-ilm-history-5-2024.12.06-000020 0 r STARTED 18.10.13.197 node-1
.ds-ilm-history-5-2024.12.06-000020 0 p STARTED 18.10.13.199 node-3
.ds-.logs-deprecation.elasticsearch-default-2024.12.06-000021 0 p STARTED 18.10.13.197 node-1
.ds-.logs-deprecation.elasticsearch-default-2024.12.06-000021 0 r STARTED 18.10.13.199 node-3
er_sxf_data_bean 0 p STARTED 4304100 6.1gb 18.10.13.198 node-2
er_sxf_data_bean 0 r STARTED 4304100 6.1gb 18.10.13.199 node-3
分片的状态会从 INITIALIZING 变为 RELOCATING,最后变为 STARTED,表示迁移完成
6. 验证迁移结果
检查分片是否成功迁移,并确保集群状态为 green
curl -X GET "http://localhost:9200/_cat/recovery?v"
curl -X GET "localhost:9200/_cat/shards?v"
curl -X GET "http://localhost:9200/_cluster/health"
curl -X GET "http://localhost:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node15 267.4gb 271.5gb 228.3gb 499.9gb 54 18.10.13.198 18.10.13.198 node-214 156.9gb 161.3gb 338.5gb 499.9gb 32 18.10.13.197 18.10.13.197 node-117 213.5gb 217.5gb 282.3gb 499.9gb 43 18.10.13.199 18.10.13.199 node-3
7. 重新启用分片分配
迁移完成后,重新启用分片分配
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"transient": {"cluster.routing.allocation.enable": "all"}
}'// curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
// {
// "persistent": {
// "cluster.routing.allocation.enable": "all"
// }
// }'// 检查
curl -XGET "127.0.0.1:9200/_cluster/settings?include_defaults=true&flat_settings=true" |grep cluster.routing.allocation.enable
{"persistent":{"cluster.routing.allocation.enable":"all"},"transient":{"cluster.routing.allocation.enable":"all"},
------------------------------- THE END -------------------------------