前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Elasticsearch索引分片损坏该怎么办?

Elasticsearch索引分片损坏该怎么办?

作者头像
周银辉
修改2024-09-29 10:50:57
1800
修改2024-09-29 10:50:57
举报
文章被收录于专栏:elasticsearch常见问题

Elasticsearch索引分片损坏该怎么办?(一)

背景

  • 前面我们学习了Elasticsearch集群异常状态(RED、YELLOW)原因分析,了解到了当集群发生主分片无法上线的情况下,集群状态会变为RED,此时相应的RED索引读写请求都会受到严重的影响。
  • 这里我们将介绍索引分片损坏这种情况,当索引分片发生损坏时,对应的主分片会无法分配,且状态也会是RED。然而分片的损坏的情况又分为很多种,有些只是表象,可以通过一些手段恢复,但有些则是真实的物理损坏,且无法恢复,只能丢弃部分数据,甚至整块分片。

问题

场景:服务器物理断电引发的分片损坏

这种情况比较常见,一般我们可以通过explain api来确认:

代码语言:javascript
复制
[root@sh ~]# curl -s -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "index" : "index-net-20210902-3",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-09-28T03:10:58.099Z",
    "failed_allocation_attempts" : 5,
    "details" : """
    failed shard on node [LwWiAwmdQCiEibtiF7oqxQ]: failed recovery, 
    failure RecoveryFailedException[[device_search_20201204][3]: 
    Recovery failed on {reading_9.10.126.164_node2}{LwWiAwmdQCiEibtiF7oqxQ}
    {YVadGK2FSDKbR69l0Wu0xg}{9.10.126.164}{9.10.126.164:9300}
    {dil}{ml.machine_memory=539647844352, xpack.installed=true, ml.max_open_jobs=20}];
     nested: IndexShardRecoveryException[failed recovery]; nested: 
     ElasticsearchException[java.io.IOException: 
     failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices
     /b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: 
     IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0
     /indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; 
     nested: IOException[org.apache.lucene.index.CorruptIndexException: 
     codec footer mismatch (file truncated?): actual footer=892219961 vs expected 
     footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(
     path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/
     b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st")))]; 
     nested: CorruptIndexException[codec footer mismatch (file truncated?): 
     actual footer=892219961 vs expected footer=-1071082520 
     (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(
     path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/
     b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st")))]; """,
    "last_allocation_status" : "no"
  }
}

或者通过日志信息来确认:

代码语言:javascript
复制
[o.e.c.r.a.AllocationService] [1612339152002813032] failing shard [failed shard, shard [index-net-20210902-3][7], node[6sTEWvTlTlWZutgb_sK8ZA], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=y1Gvnr_hTuaaVIqm9TKaFA], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-09-28T03:10:58.099Z], failed_attempts[4], failed_nodes[[6sTEWvTlTlWZutgb_sK8ZA]], delayed=false, details[failed shard on node [6sTEWvTlTlWZutgb_sK8ZA]: failed recovery, failure RecoveryFailedException[[index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; ], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2604) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.5.1.jar:7.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_181]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
	... 4 more
Caused by: org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st
	at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:167) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:423) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
	... 4 more
Caused by: java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st
	at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:417) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
	... 4 more
Caused by: java.io.IOException: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))
	at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:316) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:413) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
	... 4 more
Caused by: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))
	at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:523) ~[lucene-core-8.3.0.jar:8.3.0 6305aea4e5929f262e9c07fcf16d3afe2b4bb9f5 - danielhuang - 2020-11-10 17:11:38]
	at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:299) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:413) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]

其共同的关键信息都是:file truncated?

问题分析过程

那么这种情况发生的原因是什么呢?我们要知道,索引分片是不可能无故发生损坏的,分片所在的节点一定发生过异常。

  1. 于是我们在cerebro里确认了最近发生下线的节点,然后登上这个节点;
  2. 经过排查发现,通过uptime命令发现节点所在的机器在当天12:21:57左右,疑似出现了重启,但普通重启一般不太可能出现这种故障,这大概率不是一次优雅的重启,一定没有通过shutdown等命令进行关机再启动;
  3. 随后我们找到了负责硬件运维的同学得到了确认:机器当天确实发生了母机故障转移,而故障自动转移这个操作会强制断电,所以罪魁祸首便是这次物理断电;
  4. 那么,为什么物理断电会导致分片损坏呢?这是因为reboot关机时是系统发起的关机,这种会主动停掉子机内的服务;但是母机重启子机是感知不到的,这种情况下子机重启是被动强制停机的,所以当一些正在写入的文件不能正常关闭,就会导致数据无法正常读取。

解决方案

方案一:修复分片

retention-leases-1518.st 这个文件的损坏,与这个文件曾经有一段时间不在线有关系。也就是说,与机器重启有关。如果要恢复的话,则需要手动删除这个文件,然后重新尝试分配分片:

代码语言:javascript
复制
POST _cluster/reroute?retry_failed=true
{
  "acknowledged": true,
  "state": {
    "cluster_uuid": "LOk2L8k5RsmCC7eg2y3h8A",
    "version": 533752,
    "state_uuid": "jVm_8aAIT6ug9NBJazjVig",
    "master_node": "kHbBiclxR5-c-rsra2A5Jg",
    "blocks": {

    },
    "nodes": {
      "m5eloUNuTJak4xDRqf3FeA": {
        "name": "1625799512002116132",
        "ephemeral_id": "dqHmYahLSbuqvSkRXy2IPg",
        "transport_address": "9.27.34.96:9300",
        "attributes": {
          "ml.machine_memory": "134587404288",
          "rack": "cvm_33_330002",
          "xpack.installed": "true",
          "set": "330002",
          "transform.node": "true",
          "ip": "9.27.34.96",
          "temperature": "hot",
          "ml.max_open_jobs": "20",
          "region": "33"
        }
      },
    "security_tokens": {
    }
  }
}

方案二:分配陈腐的分片

如果删除损坏的.st文件无法使分片上线,则需要考虑使用reroute api分配stale primary。执行这个api之前,我们需要得到一些信息:

  • 索引名称和分片ID可以通过explain api直观看到;
  • 节点名称可以通过unassigned_info.details得到。

根据这些信息,我们就可以执行reroute api了:

代码语言:javascript
复制
POST _cluster/reroute?pretty" -d '
{
  "commands": [
    {
      "allocate_stale_primary": {
      "index": "{索引名称}",
      "shard": "{分片ID}",
      "node": "{节点名称}",
      "accept_data_loss": true
      }
    }
  ]
}

方案三:丢弃分片(三思!慎用!)

如果分配陈腐的分片也无法使分片上线,为了不影响索引读写请求,就只能丢弃掉损坏的分片了,这是最糟糕的情况:

代码语言:javascript
复制
POST "localhost:9200/_cluster/reroute?pretty" -d '
{
    "commands" : [
        {
          "allocate_empty_primary" : {
              "index" : "{索引名称}", 
              "shard" : "{分片ID}",
              "node" : "{节点名称}",
              "accept_data_loss": true
          }
        }
    ]
}'

Elasticsearch索引分片损坏该怎么办?(二)

问题

场景:磁盘故障引起的checksum异常

这种情况也比较常见,一般我们可以通过explain api来确认:

代码语言:javascript
复制
GET localhost:9200/_cluster/allocation/explain?pretty
{
  "index" : "twitter",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-11-06T06:11:15.562Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure Recovery
    FailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9
    KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed 
    to recover from gateway]; nested: EngineCreationFailureException[failed to create engine];
     nested: CorruptIndexException[misplaced codec footer (file truncated?):
 length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQyS
RmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp\"))]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of 
  the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
      "node_name" : "node-1",
      "transport_address" : "10.142.0.2:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "gxegPAMyQa21MH5NxQEACw"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed alloca
          tion attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_F
AILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node
 [CxXWE8BiQbS4ThB9AvvGQA]: failed
 recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}
 {CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed 
to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptInde
xException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInp
ut(path=\"/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ], allocation_sta
tus[deciders_no]]]"
        }
      ]
    }
  ]
}

或者通过日志信息来确认:

代码语言:javascript
复制
[o.e.a.a.c.a.TransportClusterAllocationExplainAction] [1624264340001550732] explaining the allocation for [ClusterAllocationExplainRequest[index=qw_cust_group,shard=3,primary?=true,includeYesDecisions?=false], found shard [[qw_cust_group][3], node[null], [P], recovery_source[existing recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-09-29T07:10:25.054Z], failed_attempts[13], delayed=false, details[failed recovery, failure RecoveryFailedException[[qw_cust_group][3]: Recovery failed on {1624264340001550832}{bbNH-12CS7uy8dYTYAywgQ}{_PqMaYD_T-yz1u67-ifKdQ}{172.23.15.115}{172.23.15.115:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=9.15.112.197}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1624264340001550832/es/data/nodes/0/indices/d1_CneTOQcCCgatoDMX8Ag/3/translog/translog-2705.ckp"))]; ], allocation_status[deciders_no]]]
[o.e.c.a.s.ShardStateAction] [1624264340001550732] [qw_cust_group][3] received shard failed for shard id [[qw_cust_group][3]], allocation id [HlWMLhDHTDe3hYFjY7oo0g], primary term [0], message [failed recovery], failure [RecoveryFailedException[[qw_cust_group][3]: Recovery failed on {1624264340001550832}{bbNH-12CS7uy8dYTYAywgQ}{_PqMaYD_T-yz1u67-ifKdQ}{172.23.15.115}{172.23.15.115:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=9.15.112.197}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1624264340001550832/es/data/nodes/0/indices/d1_CneTOQcCCgatoDMX8Ag/3/translog/translog-2705.ckp"))]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [qw_cust_group][3]: Recovery failed on {1624264340001550832}{bbNH-12CS7uy8dYTYAywgQ}{_PqMaYD_T-yz1u67-ifKdQ}{172.23.15.115}{172.23.15.115:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=9.15.112.197}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1488) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-5.6.4.jar:5.6.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_181]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed to recover from gateway
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:365) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.4.jar:5.6.4]
	... 4 more
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: failed to create engine
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:163) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1602) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1584) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1027) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:987) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.4.jar:5.6.4]
	... 4 more
Caused by: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1624264340001550832/es/data/nodes/0/indices/d1_CneTOQcCCgatoDMX8Ag/3/translog/translog-2705.ckp"))
	at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:523) ~[lucene-core-6.6.1.jar:6.6.1 unknown - boicehuang - 2018-11-20 19:03:10]
	at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:98) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:237) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.translog.Translog.<init>(Translog.java:177) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:272) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:160) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1602) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1584) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1027) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:987) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.4.jar:5.6.4]
	... 4 more

其共同的关键信息都是:file truncated?

解决方案

方案一:REOPEN分片

reopen的目的是触发索引分片重新上线,直接调用_close和_open api即可:

代码语言:javascript
复制
POST localhost:9200/twitter/_close?pretty
{
  "acknowledged": true
}
POST localhost:9200/twitter/_open?pretty
{
  "acknowledged": true,
  "shards_acknowledged": true
}

方案二:分配陈腐的分片

如果reopen索引无法使分片上线,则需要考虑使用reroute api分配stale primary。执行这个api之前,我们需要得到一些信息:

  • 索引名称和分片ID可以通过explain api直观看到;
  • 节点名称可以通过unassigned_info.details得到。

根据这些信息,我们就可以执行reroute api了:

代码语言:javascript
复制
POST "localhost:9200/_cluster/reroute?pretty" -d '
{
  "commands": [
    {
      "allocate_stale_primary": {
      "index": "{索引名称}",
      "shard": "{分片ID}",
      "node": "{节点名称}",
      "accept_data_loss": true
      }
    }
  ]
}

方案三:清理corrupt文件

在故障目录,如果出现corrupt开头的文件,则需要清理掉这个文件。corrupt开头的文件是记录文件损坏的位置,不移除这个文件,分配stale是无法恢复,移除了这个文件才能恢复。清理完corrupt文件之后,再重试方案二

方案四:丢弃分片(三思!慎用!)

如果分配陈腐的分片也无法使分片上线,为了不影响索引读写请求,就只能丢弃掉损坏的分片了,这是最糟糕的情况:

代码语言:javascript
复制
[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
    "commands" : [
        {
          "allocate_empty_primary" : {
              "index" : "{索引名称}", 
              "shard" : "{分片ID}",
              "node" : "{节点名称}",
              "accept_data_loss": true
          }
        }
    ]
}'

Elasticsearch索引分片损坏该怎么办?(三)

问题

场景:集群节点文件系统故障引起的分片损坏

这种情况也是比较常见的,一般我们可以通过explain api来确认:

代码语言:javascript
复制
[root@sh ~]# curl -s -XGET localhost:9200/_cluster/allocation/explain?pretty
{
	"index": "d4f811fc-4a43-40ca-a362-ebdaa9f23a720722",
	"shard": 5,
	"primary": true,
	"current_state": "unassigned",
	"unassigned_info": {
		"reason": "MANUAL_ALLOCATION",
		"at": "2021-07-11T03:05: 49.2417",
		"details": "failed shand on node[IEW657FYSZiiUn53LjBcuAJ]: shard failure,reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested:TranslogCorruptedException[translogfrom source [/data2/containers/1620201319003550632/es/data/nodes/0/indices/g2Zz7YV6FRj6xPETjZAzCOg/5/translog/translog-38.tlog] is corrupted, operation size is corrupted must be [0..65615539] but was: 2065851766]; ", 
		"last_allocation_status ": "no_valid_shard_copy "
	},
	"can _allocate": "no_valid_shard_copy",
	"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or conrupt",
	"node_allocation_decisions": [
		"node_id": "OGORzwn5T2CknwuCkHSNHA",
		"node_ name": "1617024100001322232",
		"transport_address": "9.10.179.196:9300",
		"node_attributes": {
			"ml.machine_memory": "67211821056",
			"rack": "cvm_1_100003",
			"xpack.installed": "true",
			"set": "100003",
			"ip": "9.10.179.196",
			"temperature": "warm",
			"ml.max_open_jobs": "20",
			"region": "1"
		},
		"node_decision": "no",
		"store": {
			"found": false
		}
	]
}

解决方案

方案一:重试分配上线失败的分片

这是一种乐观的场景,这种情况通常是由于集群压力大,导致的分片无法分配,这里我们尝试重新分配:

代码语言:javascript
复制
[root@sh ~]# curl -s -XPOST localhost:9200/_cluster/reroute?retry_failed=true
{
  "acknowledged": true,
  "state": {
    "cluster_uuid": "LOk2L8k5RsmCC7eg2y3h8A",
    "version": 533752,
    "state_uuid": "jVm_8aAIT6ug9NBJazjVig",
    "master_node": "kHbBiclxR5-c-rsra2A5Jg",
    "blocks": {

    },
    "nodes": {
      "m5eloUNuTJak4xDRqf3FeA": {
        "name": "1625799512002116132",
        "ephemeral_id": "dqHmYahLSbuqvSkRXy2IPg",
        "transport_address": "9.27.34.96:9300",
        "attributes": {
          "ml.machine_memory": "134587404288",
          "rack": "cvm_33_330002",
          "xpack.installed": "true",
          "set": "330002",
          "transform.node": "true",
          "ip": "9.27.34.96",
          "temperature": "hot",
          "ml.max_open_jobs": "20",
          "region": "33"
        }
      },
    "security_tokens": {
    }
  }
}

方案二:REOPEN分片

reopen的目的是触发索引分片重新上线,直接调用_close和_open api即可:

代码语言:javascript
复制
[root@sh ~]# curl -s -XPOST localhost:9200/twitter/_close?pretty
{
  "acknowledged": true
}
[root@sh ~]# curl -s -XPOST localhost:9200/twitter/_open?pretty
{
  "acknowledged": true,
  "shards_acknowledged": true
}

方案三:分配陈腐的分片

如果retry_failed和reopen索引都无法使分片上线,则需要考虑使用reroute api分配stale primary。执行这个api之前,我们需要得到一些信息:

  • 索引名称和分片ID可以通过explain api直观看到;
  • 节点名称可以通过unassigned_info.details得到。

根据这些信息,我们就可以执行reroute api了:

代码语言:javascript
复制
[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
  "commands": [
    {
      "allocate_stale_primary": {
      "index": "{索引名称}",
      "shard": "{分片ID}",
      "node": "{节点名称}",
      "accept_data_loss": true
      }
    }
  ]
}

方案四:丢弃部分trasnlog文件

这块无法上线的分片有2GB ,但是提示有一个的translog-38损坏了。登上对应节点的服务器,看了下这个文件有6mb,于是我们把它move走,移到/tmp,然后再次执行方案三,这样allocate_stale_primary操作就可以最大限度的恢复分片数据:

代码语言:javascript
复制
[root@sh ~]# mv /data2/containers/1620201319003550632/es/data/nodes/0/indices/g2Zz7YV6FRj6xPETjZAzCOg/5/translog/translog-38.tlog /tmp
[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
  "commands": [
    {
      "allocate_stale_primary": {
      "index": "{索引名称}",
      "shard": "{分片ID}",
      "node": "{节点名称}",
      "accept_data_loss": true
      }
    }
  ]
}

方案五:丢弃分片(三思!慎用!)

如果以上的所有方案都无法使分片上线,为了不影响索引读写请求,就只能丢弃掉损坏的分片了,这是最糟糕的情况:

代码语言:javascript
复制
[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
    "commands" : [
        {
          "allocate_empty_primary" : {
              "index" : "{索引名称}", 
              "shard" : "{分片ID}",
              "node" : "{节点名称}",
              "accept_data_loss": true
          }
        }
    ]
}'

本文系转载,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系转载前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Elasticsearch索引分片损坏该怎么办?(一)
    • 背景
      • 问题
        • 场景:服务器物理断电引发的分片损坏
      • 问题分析过程
        • 解决方案
          • 方案一:修复分片
          • 方案二:分配陈腐的分片
          • 方案三:丢弃分片(三思!慎用!)
      • Elasticsearch索引分片损坏该怎么办?(二)
        • 问题
          • 场景:磁盘故障引起的checksum异常
        • 解决方案
          • 方案一:REOPEN分片
          • 方案二:分配陈腐的分片
          • 方案三:清理corrupt文件
          • 方案四:丢弃分片(三思!慎用!)
      • Elasticsearch索引分片损坏该怎么办?(三)
        • 问题
          • 场景:集群节点文件系统故障引起的分片损坏
        • 解决方案
          • 方案一:重试分配上线失败的分片
          • 方案二:REOPEN分片
          • 方案三:分配陈腐的分片
          • 方案四:丢弃部分trasnlog文件
          • 方案五:丢弃分片(三思!慎用!)
      相关产品与服务
      Elasticsearch Service
      腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档