---
title: "集群故障检测跟随者检查超时配置"
date: 2026-01-13
lastmod: 2026-01-13
description: "cluster.fault_detection.follower_check.timeout 配置项用于控制 Leader 节点向 Follower 节点发送健康检查请求的超时时间，确保准确判断节点状态。"
tags: ["集群", "故障检测", "超时", "跟随者", "高可用"]
summary: "配置项作用 #  cluster.fault_detection.follower_check.timeout 配置项定义了 Leader 节点向每个 Follower 节点发送健康检查请求后的等待超时时间。
当 Leader 需要确认 Follower 节点是否仍然健康和可达时，会发送一个检查请求。如果在此时间内未收到响应，该次检查将被计为失败。
配置项属性 #   配置路径: cluster.fault_detection.follower_check.timeout 数据类型: TimeValue（时间值） 默认值: 10s（10秒） 最小值: 1ms（1毫秒） 是否可选: 是  配置项详解 #  超时机制 #  Leader 节点 Follower 节点 │ │ │ ────── health check ──────&gt;│ │ │ │ (等待响应，最多 timeout) │ │ │ │ 收到响应 ──────────────────&gt;│ 在超时时间内 │ 检查成功，重置失败计数 │ │ │ │ │ │ ──── health check ───────&gt;│ │ │ │ (超时未响应) │ 超过 timeout │ │ │ 失败计数 +1 │ 未收到响应 工作原理 #   发送检查: Leader 向 Follower 发送健康检查请求 等待响应: 等待 Follower 响应，最长等待 timeout 时间 超时判定: 如果在 timeout 时间内未收到响应，该次检查计为失败 失败累积: 连续失败次数达到 retry_count 时，节点被标记为故障  与其他配置的关系 #  故障确认时间 = timeout × retry_count 例如默认配置： timeout = 10s, retry_count = 3 故障确认时间 = 10s × 3 = 30s 配置建议 #  生产环境（标准） #  cluster."
---


## 配置项作用

`cluster.fault_detection.follower_check.timeout` 配置项定义了 Leader 节点向每个 Follower 节点发送健康检查请求后的等待超时时间。

当 Leader 需要确认 Follower 节点是否仍然健康和可达时，会发送一个检查请求。如果在此时间内未收到响应，该次检查将被计为失败。

## 配置项属性

- **配置路径**: `cluster.fault_detection.follower_check.timeout`
- **数据类型**: `TimeValue`（时间值）
- **默认值**: `10s`（10秒）
- **最小值**: `1ms`（1毫秒）
- **是否可选**: 是

## 配置项详解

## 超时机制

```
Leader 节点                Follower 节点
    │                            │
    │ ────── health check ──────>│
    │                            │
    │  (等待响应，最多 timeout)   │
    │                            │
    │  收到响应 ──────────────────>│  在超时时间内
    │  检查成功，重置失败计数      │
    │                            │
    │                            │
    │  ──── health check ───────>│
    │                            │
    │  (超时未响应)               │  超过 timeout
    │                            │
    │  失败计数 +1                │  未收到响应
```

## 工作原理

1. **发送检查**: Leader 向 Follower 发送健康检查请求
2. **等待响应**: 等待 Follower 响应，最长等待 `timeout` 时间
3. **超时判定**: 如果在 `timeout` 时间内未收到响应，该次检查计为失败
4. **失败累积**: 连续失败次数达到 `retry_count` 时，节点被标记为故障

## 与其他配置的关系

```
故障确认时间 = timeout × retry_count

例如默认配置：
timeout = 10s, retry_count = 3
故障确认时间 = 10s × 3 = 30s
```

## 配置建议

## 生产环境（标准）

```yaml
cluster.fault_detection.follower_check.timeout: 10s
```

**建议**: 保持默认值 `10s`。适用于大多数生产环境，能准确判断节点状态。

## 高延迟网络环境

```yaml
cluster.fault_detection.follower_check.timeout: 30s
```

**建议**: 增加到 `20s-60s`。当节点间存在高网络延迟（如跨地域部署）时，需要更长的超时时间。

## 低延迟稳定网络

```yaml
cluster.fault_detection.follower_check.timeout: 5s
```

**建议**: 减少到 `3s-5s`。在局域网或低延迟云环境中，可以减少超时时间以更快发现故障。

## 高负载节点环境

```yaml
cluster.fault_detection.follower_check.timeout: 15s
```

**建议**: 适当增加到 `15s-20s`。当节点经常处于高负载状态，可能导致响应延迟时使用。

## 快速故障检测要求

```yaml
cluster.fault_detection.follower_check.timeout: 3s
```

**建议**: 减少到 `2s-5s`，配合较小的 `retry_count` 使用，实现快速故障切换。

## 代码示例

## easysearch.yml 配置

```yaml
# 生产环境标准配置
cluster:
  fault_detection:
    follower_check:
      timeout: 10s
      interval: 1s
      retry_count: 3
```

## 跨地域部署配置

```yaml
cluster:
  fault_detection:
    follower_check:
      timeout: 30s       # 跨地域需要更长超时
      interval: 2s       # 增加检查间隔
      retry_count: 5     # 增加重试次数
```

## 高负载环境配置

```yaml
cluster:
  fault_detection:
    follower_check:
      timeout: 15s       # 节点高负载时响应可能变慢
      interval: 1s
      retry_count: 4
```

## 相关配置

| 配置项 | 作用 | 默认值 |
|--------|------|--------|
| `cluster.fault_detection.follower_check.interval` | 检查间隔时间 | 1s |
| `cluster.fault_detection.follower_check.retry_count` | 失败重试次数 | 3 |
| `cluster.fault_detection.leader_check.timeout` | Leader 检查超时时间 | 10s |

## 完整故障检测时间计算

```
总故障检测时间 = (timeout + interval) × retry_count

默认配置示例：
(10s + 1s) × 3 = 33 秒

高延迟网络配置：
(30s + 2s) × 5 = 160 秒
```

## 超时设置指导

| 场景 | 推荐超时 | 说明 |
|------|----------|------|
| 局域网集群 | 3-5s | 低延迟网络，快速检测 |
| 同城云集群 | 5-10s | 稳定云环境，标准设置 |
| 跨地域集群 | 20-60s | 考虑网络延迟 |
| 高负载节点 | 10-20s | 考虑 GC 和处理延迟 |
| 不稳定网络 | 20-30s | 容忍网络抖动 |

## 性能与可靠性影响

| timeout 设置 | 优点 | 缺点 |
|--------------|------|------|
| 较短（1-3s） | 快速发现节点故障 | 可能误判，网络延迟高时错误移除节点 |
| 中等（5-10s） | 平衡准确性和检测速度 | 标准设置，适合大多数场景 |
| 较长（20-60s） | 容忍网络延迟和 GC 停顿 | 故障检测慢，影响集群恢复速度 |

## 注意事项

1. **应大于网络延迟**: timeout 应大于正常网络往返时间（RTT）的 3-5 倍。

2. **考虑 GC 影响**: 节点进行垃圾回收时可能导致响应延迟，timeout 应能容忍典型的 GC 停顿时间。

3. **与 interval 的关系**: 通常 timeout 应大于 interval，但不是必须的。

4. **避免设置过小**: 如果 timeout 设置过小（如毫秒级），可能导致正常节点被误判为故障。

5. **动态更新**: 可以通过集群设置 API 动态更新，无需重启节点。

6. **监控建议**: 监控超时事件频率。如果频繁超时，考虑增加 timeout 值或检查网络质量。