导读 | 今天检查ceph集群,发现有pg丢失,本文就给大家介绍一下解决方法。 |
1.查看集群状态
[root@**snode001?~]#?ceph?health?detail HEALTH_ERR?1/973013?objects?unfound?(0.000%);?17?scrub?errors;?Possible?data?damage:?1?pg?recovery_unfound,?8?pgs?inconsistent,?1?pg?repair;?Degraded?data?redundancy:?1/2919039?objects?degraded?(0.000%),?1?pg?degraded OBJECT_UNFOUND?1/973013?objects?unfound?(0.000%) ????pg?2.2b?has?1?unfound?objects OSD_SCRUB_ERRORS?17?scrub?errors PG_DAMAGED?Possible?data?damage:?1?pg?recovery_unfound,?8?pgs?inconsistent,?1?pg?repair ????pg?2.2b?is?active+recovery_unfound+degraded,?acting?[14,22,4],?1?unfound ????pg?2.44?is?active+clean+inconsistent,?acting?[14,8,21] ????pg?2.73?is?active+clean+inconsistent,?acting?[25,14,8] ????pg?2.80?is?active+clean+scrubbing+deep+inconsistent+repair,?acting?[4,8,14] ????pg?2.83?is?active+clean+inconsistent,?acting?[14,13,6] ????pg?2.ae?is?active+clean+inconsistent,?acting?[14,3,2] ????pg?2.c4?is?active+clean+inconsistent,?acting?[8,21,14] ????pg?2.da?is?active+clean+inconsistent,?acting?[23,14,15] ????pg?2.fa?is?active+clean+inconsistent,?acting?[14,23,25] PG_DEGRADED?Degraded?data?redundancy:?1/2919039?objects?degraded?(0.000%),?1?pg?degraded ????pg?2.2b?is?active+recovery_unfound+degraded,?acting?[14,22,4],?1?unfound
从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
现在我们来查看pg 2.2b,看看这个pg的想想信息。
[root@**snode001?~]#?ceph?pg?dump_json?pools????|grep?2.2b dumped?all 2.2b???????2487??????????????????1????????1?????????0???????1??9533198403?3048?????3048????????????????active+recovery_unfound+degraded?2020-07-23?08:56:07.669903??10373'5448370??10373:7312614??[14,22,4]?????????14??[14,22,4]?????????????14??10371'5437258?2020-07-23?08:56:06.637012???10371'5437258?2020-07-23?08:56:06.637012?????????????0
可以看到它现在只有一个副本
2.查看pg map
[root@**snode001?~]#?ceph?pg?map?2.2b osdmap?e10373?pg?2.2b?(2.2b)?->?up?[14,22,4]?acting?[14,22,4]
从pg map可以看出,pg 2.2b分布到osd [14,22,4]上
3.查看存储池状态
[root@**snode001?~]#?ceph?osd?pool?stats?**s-1 pool?**s-1?id?2 ??1/1955664?objects?degraded?(0.000%) ??1/651888?objects?unfound?(0.000%) ??client?io?271?KiB/s?wr,?0?op/s?rd,?52?op/s?wr ? [root@**snode001?~]#?ceph?osd?pool?ls?detail|grep?**s-1 pool?2?'**s-1'?replicated?size?3?min_size?1?crush_rule?0?object_hash?rjenkins?pg_num?256?pgp_num?256?last_change?88?flags?hashpspool,selfmanaged_snaps?stripe_width?0?application?rbd
4.尝试恢复pg 2.2b丢失地块
[root@**snode001?~]#?ceph?pg?repair?2.2b
如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下
[root@**snode001?~]#?ceph?pg?2.2b??query { ????"...... ????"recovery_state":?[ ????????{ ????????????"name":?"Started/Primary/Active", ????????????"enter_time":?"2020-07-21?14:17:05.855923", ????????????"might_have_unfound":?[], ????????????"recovery_progress":?{ ????????????????"backfill_targets":?[], ????????????????"waiting_on_backfill":?[], ????????????????"last_backfill_started":?"MIN", ????????????????"backfill_info":?{ ????????????????????"begin":?"MIN", ????????????????????"end":?"MIN", ????????????????????"objects":?[] ????????????????}, ????????????????"peer_backfill_info":?[], ????????????????"backfills_in_flight":?[], ????????????????"recovering":?[], ????????????????"pg_backend":?{ ????????????????????"pull_from_peer":?[], ????????????????????"pushing":?[] ????????????????} ????????????}, ????????????"scrub":?{ ????????????????"scrubber.epoch_start":?"10370", ????????????????"scrubber.active":?false, ????????????????"scrubber.state":?"INACTIVE", ????????????????"scrubber.start":?"MIN", ????????????????"scrubber.end":?"MIN", ????????????????"scrubber.max_end":?"MIN", ????????????????"scrubber.subset_last_update":?"0'0", ????????????????"scrubber.deep":?false, ????????????????"scrubber.waiting_on_whom":?[] ????????????} ????????}, ????????{ ????????????"name":?"Started", ????????????"enter_time":?"2020-07-21?14:17:04.814061" ????????} ????], ????"agent_state":?{} }
如果repair修复不了;两种解决方案,回退旧版或者直接删除
5.解决方案
回退旧版 [root@**snode001?~]#?ceph?pg??2.2b??mark_unfound_lost?revert 直接删除 [root@**snode001?~]#?ceph?pg??2.2b??mark_unfound_lost?delete
6.验证
我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean
[root@**snode001?~]#??ceph?pg??2.2b?query? {? ????"state":?"active+clean",? ????"snap_trimq":?"[]",? ????"snap_trimq_len":?0,? ????"epoch":?11069,? ????"up":?[? ????????12,? ????????22,? ????????4? ????],
再次查看集群状态
[root@**snode001?~]#?ceph?health?detail? HEALTH_OK