主分片未分配 (index_primary_shard_not_allocated_exception) 错误排查与解决

为什么这个错误发生 #

index_primary_shard_not_allocated_exception 表示索引的主分片没有被分配到任何节点。主分片是处理写入操作的主要分片，必须在可用后才能执行写操作。

这个错误可能由以下原因引起：

节点离线：主分片所在的节点离线或宕机
磁盘空间不足：所有可用节点的磁盘空间不足
分配被禁用：分片分配被手动禁用
分片数据损坏：主分片数据损坏
分配过滤规则：分配规则阻止主分片分配
恢复中：主分片正在恢复过程中
节点数量不足：没有足够的节点容纳所有主分片

如何修复这个错误 #

1. 检查分片状态 #

# 查看未分配的分片
GET /_cat/shards?v | grep UNASSIGNED

# 查看特定索引的分片
GET /_cat/shards/<index>?v

2. 解释未分配原因 #

# 解释为什么分片未分配
GET /_cluster/allocation/explain

# 查看特定分片
GET /_cluster/allocation/explain
{
  "index": "<index>",
  "shard": 0,
  "primary": true
}

3. 启用分片分配 #

# 如果分片分配被禁用，启用它
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

4. 检查磁盘空间 #

# 查看磁盘使用情况
GET /_cat/allocation?v

# 检查水位线设置
GET /_cluster/settings?filter_path=*.disk.*

# 清理磁盘空间或调整水位线
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
  }
}

5. 移除分配过滤 #

# 查看分配排除规则
GET /_cluster/settings?flat_settings=true

# 移除分配排除
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.exclude._name": null,
    "cluster.routing.allocation.exclude._ip": null
  }
}

6. 手动分配分片 #

# 尝试重新分配分片
POST /_cluster/reroute?retry_failed=true

# 或手动指定分配节点
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "<index>",
        "shard": 0,
        "node": "<node_name>",
        "accept_data_loss": true
      }
    }
  ]
}

7. 恢复副本分片 #

如果有副本分片可用：

# 将副本提升为主分片
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_stale_primary": {
        "index": "<index>",
        "shard": 0,
        "node": "<node_with_replica>",
        "accept_data_loss": false
      }
    }
  ]
}

8. 从快照恢复 #

# 从快照恢复索引
POST /_snapshot/<repository>/<snapshot>/_restore
{
  "indices": "<index>"
}

9. 删除并重建索引 #

如果数据可以丢失：

# 删除索引
DELETE /<index>

# 重建索引和映射
PUT /<index>
{
  "mappings": {
    "properties": {
      "field": { "type": "text" }
    }
  }
}

10. 等待节点恢复 #

# 等待离线节点恢复
GET /_cat/nodes?v

# 或等待集群自动恢复
GET /_cluster/health?wait_for_nodes=3&timeout=50s

预防措施 #

配置足够的副本数防止数据丢失
确保有足够的节点容纳所有分片
监控磁盘空间使用情况
配置合理的分配规则
定期检查集群健康状态
在节点下线前正确迁移数据
使用快照定期备份数据

标签

分片分配副本