---
title: "每节点心跳连接数配置"
date: 2026-01-15
lastmod: 2026-01-15
description: "控制每节点心跳连接数量的配置项说明"
tags: ["传输配置", "故障检测", "节点健康"]
summary: "配置项作用 #  transport.connections_per_node.ping 配置项控制集群中每个节点之间用于心跳检测（Ping）的并发连接数。心跳连接用于故障检测机制，定期检测节点是否存活。这是数据量最小但最关键的连接类型，直接影响集群的故障检测速度。
配置项类型 #  该配置项为静态配置，需要在启动时设置，修改后需要重启节点才能生效。
默认值 #  1 是否必需 #  可选配置项（有默认值）
取值范围 #  1 ~ 正整数 配置格式 #  # 默认配置 transport.connections_per_node.ping: 1 # 增加心跳连接数（不推荐） transport.connections_per_node.ping: 2 # 保持默认 transport.connections_per_node.ping: 1 相关配置项 #     配置项 默认值 说明     transport.connections_per_node.ping 1 心跳连接数   cluster.fault_detection.follower_check.timeout 10s 跟随节点检查超时   cluster.fault_detection.leader_check.timeout 10s 主节点检查超时    工作原理 #  心跳连接使用机制："
---


## 配置项作用

`transport.connections_per_node.ping` 配置项控制集群中每个节点之间用于心跳检测（Ping）的并发连接数。心跳连接用于故障检测机制，定期检测节点是否存活。这是数据量最小但最关键的连接类型，直接影响集群的故障检测速度。

## 配置项类型

该配置项为**静态配置**，需要在启动时设置，修改后需要重启节点才能生效。

## 默认值

```
1
```

## 是否必需

**可选配置项**（有默认值）

## 取值范围

```
1 ~ 正整数
```

## 配置格式

```yaml
# 默认配置
transport.connections_per_node.ping: 1

# 增加心跳连接数（不推荐）
transport.connections_per_node.ping: 2

# 保持默认
transport.connections_per_node.ping: 1
```

## 相关配置项

| 配置项 | 默认值 | 说明 |
|-------|-------|------|
| `transport.connections_per_node.ping` | 1 | 心跳连接数 |
| `cluster.fault_detection.follower_check.timeout` | 10s | 跟随节点检查超时 |
| `cluster.fault_detection.leader_check.timeout` | 10s | 主节点检查超时 |

## 工作原理

心跳连接使用机制：

```
┌─────────────────────────────────────────────────────────────────┐
│                    心跳检测连接使用场景                           │
└─────────────────────────────────────────────────────────────────┘

故障检测机制
    │
    ├── 主节点检测跟随节点
    │   ├── 定期发送心跳
    │   ├── 使用 ping 连接
    │   └── 期待响应
    │
    └── 跟随节点检测主节点
        ├── 定期发送心跳
        ├── 使用 ping 连接
        └── 期待响应

心跳特点:
    - 极小的数据包
    - 定期发送（通常是秒级）
    - 超时检测（通常是 10 秒）
    - 失败后标记节点故障

连接数影响:
    - 1 个连接足够用于心跳
    - 增加连接数通常没有帮助
```

## 故障检测机制

```
故障检测详细流程:

1. 定期心跳发送
    ↓
2. 等待响应
    ├── 正常响应 → 节点健康
    └── 超时无响应 → 可能故障
    ↓
3. 重试确认
    ├── 重试成功 → 误报，节点健康
    └── 重试失败 → 确认故障
    ↓
4. 故障处理
    ├── 从集群移除节点
    ├── 重新分配分片
    └── 可能触发选举

关键配置:
    - ping 连接数: 1（通常足够）
    - 检查超时: 10s
    - 重试次数: 3 次
```

## 使用场景

### 1. 默认配置（强烈推荐）

```yaml
transport.connections_per_node.ping: 1
```

适用于所有集群配置，1 个心跳连接完全足够。

### 2. 不推荐的配置

```yaml
transport.connections_per_node.ping: 2
```

**不推荐理由：**
- 心跳数据量极小
- 1 个连接完全满足需求
- 增加连接数没有实际益处
- 徒增资源消耗

## 推荐设置建议

| 集群规模 | 网络条件 | 推荐值 | 说明 |
|---------|---------|-------|------|
| 所有规模 | 任意 | 1 | 默认即可 |
| 大规模集群 | 任意 | 1 | 无需增加 |
| 跨地域集群 | 任意 | 1 | 无需增加 |

## 为什么默认值是 1

```
心跳连接只需要 1 个的原因:

1. 数据量极小
    - 心跳包只有几十字节
    - 1 个连接带宽绰绰有余

2. 频率不高
    - 通常每秒发送一次
    - 不需要高并发

3. 单向检测
    - 不需要多个并发心跳
    - 顺序发送足够

4. 超时检测
    - 超时时间通常是 10 秒
    - 网络延迟影响远大于连接数

结论: 增加连接数没有实际意义
```

## 心跳相关配置

```
故障检测配置体系:

1. transport.connections_per_node.ping
   - 心跳连接数
   - 默认: 1
   - 通常不需要修改

2. cluster.fault_detection.follower_check.timeout
   - 跟随节点检查超时
   - 默认: 10s
   - 影响故障检测速度

3. cluster.fault_detection.leader_check.timeout
   - 主节点检查超时
   - 默认: 10s
   - 影响故障检测速度

4. cluster.fault_detection.follower_check.retry_count
   - 重试次数
   - 默认: 3
   - 避免误判

建议:
    - 保持 ping 连接数为 1
    - 根据网络条件调整超时
    - 根据稳定性调整重试次数
```

## 故障检测性能影响

```
超时时间 vs 故障检测速度:

超时时间短 (5s):
    优点:
        ✓ 快速检测故障
        ✓ 快速恢复服务

    缺点:
        ✗ 容易误判
        ✗ 网络抖动触发误报
        ✗ 可能导致不必要的分片迁移

超时时间长 (30s):
    优点:
        ✓ 减少误判
        ✓ 容忍网络抖动
        ✓ 更稳定

    缺点:
        ✗ 故障检测慢
        ✗ 服务恢复慢
        ✗ 可能影响可用性

推荐:
    - 稳定网络: 5-10s
    - 一般网络: 10-15s
    - 不稳定网络: 15-30s
```

## 配置示例

```yaml
# 场景 1: 标准集群（推荐）
cluster.name: prod-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 10s
cluster.fault_detection.leader_check.timeout: 10s

# 场景 2: 快速检测集群
cluster.name: fast-failover-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 5s
cluster.fault_detection.leader_check.timeout: 5s

# 场景 3: 稳定优先集群
cluster.name: stable-cluster
transport.connections_per_node.ping: 1
cluster.fault_detection.follower_check.timeout: 15s
cluster.fault_detection.leader_check.timeout: 15s
```

## 监控建议

```bash
# 查看当前配置
GET /_nodes/settings?filter_path=nodes.*.transport.connections_per_node.ping

# 查看故障检测配置
GET /_cluster/settings?filter_path=*.cluster.fault_detection*

# 查看节点健康状态
GET /_cat/nodes?v

# 查看集群健康
GET /_cluster/health

# 查看连接统计
GET /_nodes/stats/transport
```

## 故障排查

```
节点频繁被标记为故障:

1. 检查网络稳定性
   ping -c 100 <节点>
   # 如果有丢包，优化网络

2. 检查系统负载
   top, htop
   # 如果 CPU/内存满，优化负载

3. 检查垃圾回收
   # 如果频繁 GC，调整 JVM

4. 检查超时配置
   GET /_cluster/settings?filter_path=*.fault_detection*
   # 如果太短，增加超时时间

5. 检查心跳连接
   GET /_nodes/stats/transport
   # 确认心跳正常

解决措施:
    - 增加故障检测超时
    - 优化网络连接
    - 减少 JVM 压力
    - 不要增加 ping 连接数
```

## 注意事项

1. **静态配置**：修改需要重启节点
2. **默认值最优**：1 个连接完全满足需求
3. **不应修改**：增加连接数没有实际益处
4. **故障检测关键**：影响集群稳定性
5. **调整超时**：通过调整超时时间控制检测速度
6. **误判风险**：太短的超时容易误判