---
title: "作业调度器重试次数配置"
date: 2026-03-27
lastmod: 2026-03-27
description: "jobscheduler.retry_count 配置项用于控制作业调度器在清扫操作失败时的重试次数。"
tags: ["作业调度", "重试机制", "容错处理", "指数退避"]
summary: "配置项作用 #  jobscheduler.retry_count 配置项用于控制作业调度器的 Job Sweeper 在执行搜索请求失败时的重试次数。
当搜索请求遇到临时性故障（如 502 Bad Gateway、504 Gateway Timeout、503 Service Unavailable 等）时，Job Sweeper 会使用指数退避策略进行重试，此配置决定了最大重试次数。
配置项属性 #   配置路径: jobscheduler.retry_count 代码中的实际名称: jobscheduler.sweeper.backoff_retry_count 数据类型: Integer（整数） 默认值: 3 最小值: 0（不重试） 最大值: 无明确上限（受 Integer 类型限制） 是否可选: 是 作用域: NodeScope（节点级别） 动态更新: 是（支持动态更新）  配置项详解 #  工作机制 #  重试流程 retry_count = 3 请求: ├── 尝试 1: 失败 (502) │ ↓ 退避 ~50ms ├── 尝试 2: 失败 (504) │ ↓ 退避 ~100ms ├── 尝试 3: 失败 (503) │ ↓ 退避 ~150ms └── 尝试 4: 失败 ↓ 最终失败 ❌ 如果任一尝试成功，立即返回 ✅ 指数退避策略 #  退避时间计算 公式: delay(n) = initialDelay + 10 × (e^(0."
---


## 配置项作用

`jobscheduler.retry_count` 配置项用于控制**作业调度器的 Job Sweeper 在执行搜索请求失败时的重试次数**。

当搜索请求遇到临时性故障（如 502 Bad Gateway、504 Gateway Timeout、503 Service Unavailable 等）时，Job Sweeper 会使用指数退避策略进行重试，此配置决定了最大重试次数。

## 配置项属性

- **配置路径**: `jobscheduler.retry_count`
- **代码中的实际名称**: `jobscheduler.sweeper.backoff_retry_count`
- **数据类型**: `Integer`（整数）
- **默认值**: `3`
- **最小值**: `0`（不重试）
- **最大值**: 无明确上限（受 Integer 类型限制）
- **是否可选**: 是
- **作用域**: NodeScope（节点级别）
- **动态更新**: 是（支持动态更新）

## 配置项详解

## 工作机制

```
重试流程

retry_count = 3

请求:
├── 尝试 1: 失败 (502)
│   ↓ 退避 ~50ms
├── 尝试 2: 失败 (504)
│   ↓ 退避 ~100ms
├── 尝试 3: 失败 (503)
│   ↓ 退避 ~150ms
└── 尝试 4: 失败
    ↓
  最终失败 ❌


如果任一尝试成功，立即返回 ✅
```

## 指数退避策略

```
退避时间计算

公式:
delay(n) = initialDelay + 10 × (e^(0.8×n) - 1) 毫秒

其中:
- n: 第 n 次重试（从 0 开始）
- initialDelay: backoff_millis 配置（默认 50ms）


计算示例 (initialDelay = 50ms):

重试 0 (第 1 次尝试):
delay = 50 + 10 × (e^0 - 1) = 50ms

重试 1 (第 2 次尝试):
delay = 50 + 10 × (e^0.8 - 1) = 50 + 10 × 1.22 = 62ms

重试 2 (第 3 次尝试):
delay = 50 + 10 × (e^1.6 - 1) = 50 + 10 × 2.95 = 80ms

重试 3 (第 4 次尝试):
delay = 50 + 10 × (e^2.4 - 1) = 50 + 10 × 6.05 = 111ms
```

## 重试策略创建

```
BackoffPolicy 使用

代码:
BackoffPolicy policy = BackoffPolicy.exponentialBackoff(
    this.sweepSearchBackoffMillis,  // 初始延迟
    this.sweepSearchBackoffRetryCount  // 重试次数
);


应用重试:
SearchResponse response = this.retry(
    (searchRequest) -> this.client.search(searchRequest),
    jobSearchRequest,
    this.sweepSearchBackoff  // 退避策略
).actionGet(this.sweepSearchTimeout);
```

## 配置建议

## 生产环境（默认）

```yaml
jobscheduler:
  retry_count: 3  # 默认值
```

**建议**: 保持默认值 `3`。适用于大多数场景。

## 高负载环境

```yaml
jobscheduler:
  retry_count: 5  # 增加重试次数
```

**建议**: 增加到 `5-10`。集群负载高或不稳定时使用。

## 低延迟要求

```yaml
jobscheduler:
  retry_count: 1  # 减少重试次数
```

**建议**: 减少到 `1-2`。需要快速失败时使用。

## 禁用重试

```yaml
jobscheduler:
  retry_count: 0  # 不重试
```

**建议**: 设置为 `0`。需要立即失败时使用。

## 代码示例

## easysearch.yml 基础配置

```yaml
jobscheduler:
  retry_count: 3
```

## 高可用配置

```yaml
jobscheduler:
  retry_count: 5
  sweeper:
    backoff_millis: 100
    period: 5m
```

## 快速失败配置

```yaml
jobscheduler:
  retry_count: 1
  request_timeout: 5s
```

## 动态更新配置

```json
PUT /_cluster/settings
{
  "transient": {
    "jobscheduler.retry_count": 5
  }
}
```

## 相关配置

| 配置项 | 作用 | 默认值 |
|--------|------|--------|
| `jobscheduler.retry_count` | 重试次数 | 3 |
| `jobscheduler.sweeper.backoff_millis` | 退避初始延迟 | 50ms |
| `jobscheduler.request_timeout` | 请求超时时间 | 10s |
| `jobscheduler.jitter_limit` | 抖动限制 | 0.6 |

## 性能影响分析

| retry_count 设置 | 优点 | 缺点 |
|------------------|------|------|
| 0 | 立即失败，无延迟 | 无法容忍临时故障 |
| 1 | 快速失败 | 容错能力弱 |
| 2-3 | 平衡延迟和容错 | 标准设置 |
| 5-10 | 高容错 | 故障恢复慢 |
| 10+ | 最大容错 | 恢复很慢 |

## 总执行时间估算

```
假设 request_timeout = 10s，每次请求都超时

retry_count = 0:
尝试 1: 0s ─── 10s (超时)
总时间: 10s


retry_count = 1:
尝试 1: 0s ───────── 10s (超时)
尝试 2: 10.05s ────── 20.05s (超时)
总时间: ~20s


retry_count = 3:
尝试 1: 0s ───────── 10s (超时)
尝试 2: 10.05s ────── 20.05s (超时)
尝试 3: 20.15s ────── 30.15s (超时)
尝试 4: 30.30s ────── 40.30s (超时)
总时间: ~40s
```

## 使用场景

## 推荐使用默认值的场景

- **标准环境**: 集群运行稳定
- **正常负载**: 集群负载在正常范围
- **常规容错**: 需要基本的容错能力

## 推荐增加重试次数的场景

- **高负载环境**: 集群经常高负载
- **不稳定网络**: 网络质量不稳定
- **关键任务**: 任务调度不能轻易失败
- **大规模集群**: 节点多、交互复杂

## 推荐减少重试次数的场景

- **快速失败**: 需要快速发现故障
- **低延迟**: 对延迟敏感
- **稳定环境**: 集群非常稳定
- **手动干预**: 失败后可手动处理

## 与退避延迟的配合

```
retry_count 和 backoff_millis 的协同

配置 1: 标准配置
retry_count: 3
backoff_millis: 50ms
├── 退避: 50ms, 62ms, 80ms, 111ms
└── 总退避: ~300ms


配置 2: 高容错配置
retry_count: 5
backoff_millis: 100ms
├── 退避: 100ms, 124ms, 161ms, 221ms, 311ms
└── 总退避: ~900ms


配置 3: 快速失败配置
retry_count: 1
backoff_millis: 50ms
├── 退避: 50ms
└── 总退避: 50ms
```

## 常见错误码重试

```
会触发重试的 HTTP 状态码

502 Bad Gateway:
├── 网关或代理服务器错误
├── 通常是临时性故障
└── 适合重试 ✅


503 Service Unavailable:
├── 服务暂时不可用
├── 可能是过载或重启
└── 适合重试 ✅


504 Gateway Timeout:
├── 网关超时
├── 可能是后端响应慢
└── 适合重试 ✅
```

## 注意事项

1. **默认值**: 默认值为 `3`，适用于大多数场景。

2. **范围限制**: 最小值为 0，最大值受 Integer 类型限制。

3. **动态更新**: 支持动态更新，修改后立即生效。

4. **指数退避**: 使用指数退避策略，避免频繁重试。

5. **与超时配合**: 应与 `request_timeout` 配合配置。

6. **总耗时**: 重试次数越多，总执行时间越长。

7. **容错能力**: 重试次数越多，容错能力越强。

8. **故障检测**: 重试次数过多会延迟故障发现。

9. **监控建议**: 监控重试成功率，评估配置效果。

10. **测试验证**: 配置变更后应验证故障恢复能力。