---
title: "主节点故障检测重试次数配置"
date: 2026-01-02
lastmod: 2026-01-02
description: "控制主节点故障检测重试次数的配置项说明"
tags: ["集群配置", "故障检测", "高可用"]
summary: "配置项作用 #  cluster.fault_detection.leader_check.retry_count 配置项控制判定主节点故障前允许的连续检查失败次数。从节点向主节点发送健康检查，只有连续失败达到此次数后，才会判定主节点故障并触发重新选举。
配置项类型 #  该配置项为静态配置，需要在启动时设置，修改后需要重启节点才能生效。
默认值 #  3 是否必需 #  可选配置项（有默认值）
取值范围 #  1 ~ 正整数 配置格式 #  # 默认配置 cluster.fault_detection.leader_check.retry_count: 3 # 快速故障检测（高可用） cluster.fault_detection.leader_check.retry_count: 2 # 高容错（避免误判） cluster.fault_detection.leader_check.retry_count: 5 # 网络不稳定环境 cluster.fault_detection.leader_check.retry_count: 7 相关配置项 #     配置项 默认值 说明     cluster.fault_detection.leader_check.interval 1s 检查间隔   cluster.fault_detection.leader_check.timeout 10s 检查超时时间   cluster.fault_detection.leader_check.retry_count 3 失败重试次数    工作原理 #  重试计数机制："
---


## 配置项作用

`cluster.fault_detection.leader_check.retry_count` 配置项控制判定主节点故障前允许的连续检查失败次数。从节点向主节点发送健康检查，只有连续失败达到此次数后，才会判定主节点故障并触发重新选举。

## 配置项类型

该配置项为**静态配置**，需要在启动时设置，修改后需要重启节点才能生效。

## 默认值

```
3
```

## 是否必需

**可选配置项**（有默认值）

## 取值范围

```
1 ~ 正整数
```

## 配置格式

```yaml
# 默认配置
cluster.fault_detection.leader_check.retry_count: 3

# 快速故障检测（高可用）
cluster.fault_detection.leader_check.retry_count: 2

# 高容错（避免误判）
cluster.fault_detection.leader_check.retry_count: 5

# 网络不稳定环境
cluster.fault_detection.leader_check.retry_count: 7
```

## 相关配置项

| 配置项 | 默认值 | 说明 |
|-------|-------|------|
| `cluster.fault_detection.leader_check.interval` | 1s | 检查间隔 |
| `cluster.fault_detection.leader_check.timeout` | 10s | 检查超时时间 |
| `cluster.fault_detection.leader_check.retry_count` | 3 | 失败重试次数 |

## 工作原理

重试计数机制：

```
┌─────────────────────────────────────────────────────────────────┐
│                    故障检测重试流程                              │
└─────────────────────────────────────────────────────────────────┘

开始健康检查
    │
    ▼
发送检查请求
    │
    ├── 收到响应 ──> 失败计数归零 ──> 等待下次检查
    │
    └── 未收到响应 (超时)
         │
         ▼
    失败计数 +1
         │
         ├── 失败计数 < retry_count
         │   │
         │   └── 等待 interval 时间后重试
         │
         └── 失败计数 >= retry_count
             │
             └── 判定主节点故障
                 │
                 └── 触发选举
```

## 故障检测时间计算

```
总故障检测时间 ≈ interval × retry_count

示例 1: 使用默认值
interval = 1s, retry_count = 3
检测时间 = 1s × 3 = 3秒

示例 2: 快速检测
interval = 300ms, retry_count = 2
检测时间 = 300ms × 2 = 600ms

示例 3: 高容错
interval = 1s, retry_count = 5
检测时间 = 1s × 5 = 5秒

示例 4: 跨地域
interval = 3s, retry_count = 3
检测时间 = 3s × 3 = 9秒
```

## 重试次数影响分析

### 小 retry_count（2）

```
优点:
  ✓ 快速检测故障
  ✓ 快速故障转移
  ✓ 适合高可用场景

缺点:
  ✗ 可能误判
  ✗ 网络抖动可能导致误判
  ✗ 可能频繁触发选举
```

### 大 retry_count（5-7）

```
优点:
  ✓ 高容错性
  ✓ 避免误判
  ✓ 适合不稳定网络

缺点:
  ✗ 故障检测慢
  ✗ 故障恢复时间长
  ✗ 可能延迟发现问题
```

## 使用场景

### 1. 默认配置（推荐）

```yaml
cluster.fault_detection.leader_check.retry_count: 3
```

适用于大多数集群配置。

### 2. 高可用系统

```yaml
cluster.fault_detection.leader_check.interval: 300ms
cluster.fault_detection.leader_check.retry_count: 2
```

**适用场景：**
- 金融交易系统
- 关键业务系统
- 快速故障转移需求

### 3. 网络不稳定

```yaml
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 7
```

**适用场景：**
- 网络波动大
- 跨地域部署
- 避免误判

### 4. 稳定生产环境

```yaml
cluster.fault_detection.leader_check.interval: 1s
cluster.fault_detection.leader_check.retry_count: 4
```

**适用场景：**
- 稳定的内网环境
- 避免偶发性问题
- 平衡可用性和稳定性

## 推荐设置建议

| 场景 | interval | retry_count | 检测时间 | 说明 |
|-----|----------|-------------|----------|------|
| 高可用 | 300ms | 2 | 600ms | 快速检测 |
| 默认 | 1s | 3 | 3s | 标准配置 |
| 稳定 | 1s | 4 | 4s | 提高容错 |
| 不稳定网络 | 2-3s | 5-7 | 10-21s | 避免误判 |
| 跨地域 | 3s | 3-5 | 9-15s | 考虑延迟 |

## 实际案例

### 案例 1: 网络抖动导致误判

```
配置: retry_count=2
20:00:00 - 第1次检查失败
20:00:01 - 第2次检查失败 → 触发选举（误判）

调整: retry_count=5
20:00:00 - 第1次检查失败
20:00:01 - 第2次检查失败
20:00:02 - 第3次检查成功 → 恢复正常
```

### 案例 2: 真实故障检测

```
配置: interval=1s, retry_count=3
10:00:00 - 第1次检查失败
10:00:01 - 第2次检查失败
10:00:02 - 第3次检查失败 → 触发选举
10:00:05 - 新主节点选举完成
总故障时间: 5秒
```

## 监控建议

```bash
# 查看当前配置
GET /_cluster/settings?filter_path=*.cluster.fault_detection.leader_check.*

# 查看主节点变化
GET /_cat/nodes?v&h=name,ip,master

# 查看集群健康状态
GET /_cluster/health

# 查看日志中的故障检测事件
# grep "leader.*failed.*consecutive checks" /path/to/logs
```

## 故障排查

### 频繁误判

1. 检查网络稳定性
2. 增加 retry_count
3. 考虑增加 interval

### 故障检测慢

1. 检查 retry_count 是否过大
2. 考虑减小 interval 和 retry_count
3. 检查网络延迟

### 平衡配置建议

```yaml
# 稳定环境: 平衡速度和容错
cluster.fault_detection.leader_check.interval: 1s
cluster.fault_detection.leader_check.retry_count: 3

# 高可用: 快速检测
cluster.fault_detection.leader_check.interval: 300ms
cluster.fault_detection.leader_check.retry_count: 2

# 不稳定环境: 高容错
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 6
```

## 注意事项

1. **静态配置**：修改需要重启节点
2. **与 interval 配合**：故障检测时间 = interval × retry_count
3. **网络环境**：不稳定网络需要更大的 retry_count
4. **业务需求**：高可用场景使用较小的值
5. **误判风险**：值过小可能导致频繁误判
6. **检测延迟**：值过大会延长故障发现时间