每节点心跳连接数配置

配置项作用 #

transport.connections_per_node.ping 配置项控制集群中每个节点之间用于心跳检测（Ping）的并发连接数。心跳连接用于故障检测机制，定期检测节点是否存活。这是数据量最小但最关键的连接类型，直接影响集群的故障检测速度。

配置项类型 #

该配置项为静态配置，需要在启动时设置，修改后需要重启节点才能生效。

默认值 #

是否必需 #

可选配置项（有默认值）

取值范围 #

1 ~ 正整数

配置格式 #

# 默认配置
transport.connections_per_node.ping: 1

# 增加心跳连接数（不推荐）
transport.connections_per_node.ping: 2

# 保持默认
transport.connections_per_node.ping: 1

配置项	默认值	说明
`transport.connections_per_node.ping`	1	心跳连接数
`cluster.fault_detection.follower_check.timeout`	10s	跟随节点检查超时
`cluster.fault_detection.leader_check.timeout`	10s	主节点检查超时

工作原理 #

心跳连接使用机制：

┌─────────────────────────────────────────────────────────────────┐
│                    心跳检测连接使用场景                           │
└─────────────────────────────────────────────────────────────────┘

故障检测机制
    │
    ├── 主节点检测跟随节点
    │   ├── 定期发送心跳
    │   ├── 使用 ping 连接
    │   └── 期待响应
    │
    └── 跟随节点检测主节点
        ├── 定期发送心跳
        ├── 使用 ping 连接
        └── 期待响应

心跳特点:
    - 极小的数据包
    - 定期发送（通常是秒级）
    - 超时检测（通常是 10 秒）
    - 失败后标记节点故障

连接数影响:
    - 1 个连接足够用于心跳
    - 增加连接数通常没有帮助

故障检测机制 #

故障检测详细流程:

1. 定期心跳发送
    ↓
2. 等待响应
    ├── 正常响应 → 节点健康
    └── 超时无响应 → 可能故障
    ↓
3. 重试确认
    ├── 重试成功 → 误报，节点健康
    └── 重试失败 → 确认故障
    ↓
4. 故障处理
    ├── 从集群移除节点
    ├── 重新分配分片
    └── 可能触发选举

关键配置:
    - ping 连接数: 1（通常足够）
    - 检查超时: 10s
    - 重试次数: 3 次

使用场景 #

1. 默认配置（强烈推荐） #

transport.connections_per_node.ping: 1

适用于所有集群配置，1 个心跳连接完全足够。

2. 不推荐的配置 #

transport.connections_per_node.ping: 2

不推荐理由：

心跳数据量极小
1 个连接完全满足需求
增加连接数没有实际益处
徒增资源消耗

集群规模	网络条件	推荐值	说明
所有规模	任意	1	默认即可
大规模集群	任意	1	无需增加
跨地域集群	任意	1	无需增加

为什么默认值是 1 #

心跳连接只需要 1 个的原因:

1. 数据量极小
    - 心跳包只有几十字节
    - 1 个连接带宽绰绰有余

2. 频率不高
    - 通常每秒发送一次
    - 不需要高并发

3. 单向检测
    - 不需要多个并发心跳
    - 顺序发送足够

4. 超时检测
    - 超时时间通常是 10 秒
    - 网络延迟影响远大于连接数

结论: 增加连接数没有实际意义

心跳相关配置 #

故障检测配置体系:

1. transport.connections_per_node.ping
   - 心跳连接数
   - 默认: 1
   - 通常不需要修改

2. cluster.fault_detection.follower_check.timeout
   - 跟随节点检查超时
   - 默认: 10s
   - 影响故障检测速度

3. cluster.fault_detection.leader_check.timeout
   - 主节点检查超时
   - 默认: 10s
   - 影响故障检测速度

4. cluster.fault_detection.follower_check.retry_count
   - 重试次数
   - 默认: 3
   - 避免误判

建议:
    - 保持 ping 连接数为 1
    - 根据网络条件调整超时
    - 根据稳定性调整重试次数

故障检测性能影响 #

超时时间 vs 故障检测速度:

超时时间短 (5s):
    优点:
        ✓ 快速检测故障
        ✓ 快速恢复服务

    缺点:
        ✗ 容易误判
        ✗ 网络抖动触发误报
        ✗ 可能导致不必要的分片迁移

超时时间长 (30s):
    优点:
        ✓ 减少误判
        ✓ 容忍网络抖动
        ✓ 更稳定

    缺点:
        ✗ 故障检测慢
        ✗ 服务恢复慢
        ✗ 可能影响可用性

推荐:
    - 稳定网络: 5-10s
    - 一般网络: 10-15s
    - 不稳定网络: 15-30s

配置示例 #

# 场景 1: 标准集群（推荐）
cluster.name: prod-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 10s
cluster.fault_detection.leader_check.timeout: 10s

# 场景 2: 快速检测集群
cluster.name: fast-failover-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 5s
cluster.fault_detection.leader_check.timeout: 5s

# 场景 3: 稳定优先集群
cluster.name: stable-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 15s
cluster.fault_detection.leader_check.timeout: 15s

监控建议 #

# 查看当前配置
GET /_nodes/settings?filter_path=nodes.*.transport.connections_per_node.ping

# 查看故障检测配置
GET /_cluster/settings?filter_path=*.cluster.fault_detection*

# 查看节点健康状态
GET /_cat/nodes?v

# 查看集群健康
GET /_cluster/health

# 查看连接统计
GET /_nodes/stats/transport

故障排查 #

节点频繁被标记为故障:

1. 检查网络稳定性
   ping -c 100 <节点>
   # 如果有丢包，优化网络

2. 检查系统负载
   top, htop
   # 如果 CPU/内存满，优化负载

3. 检查垃圾回收
   # 如果频繁 GC，调整 JVM

4. 检查超时配置
   GET /_cluster/settings?filter_path=*.fault_detection*
   # 如果太短，增加超时时间

5. 检查心跳连接
   GET /_nodes/stats/transport
   # 确认心跳正常

解决措施:
    - 增加故障检测超时
    - 优化网络连接
    - 减少 JVM 压力
    - 不要增加 ping 连接数

注意事项 #

静态配置：修改需要重启节点
默认值最优：1 个连接完全满足需求
不应修改：增加连接数没有实际益处
故障检测关键：影响集群稳定性
调整超时：通过调整超时时间控制检测速度
误判风险：太短的超时容易误判

标签

传输配置故障检测节点健康