这几天在给ceph增加OSD时候遇到一个故障,分享一下处理经验。
故障现象:
执行添加命令报错:
# ceph-volume lvm prepare --data /dev/sdv --block.wal /dev/nvme0n1p1 --block.db /dev/nvme0n1p7
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new bd07ea2f-9e65-46e2-92b0-42ce1c9796f6
stderr: 2020-03-24 02:00:36.122857 7f0b0c1ba700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
stderr: 2020-03-24 02:00:36.122869 7f0b0c1ba700 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication
stderr: 2020-03-24 02:00:36.122870 7f0b0c1ba700 0 librados: client.bootstrap-osd initialization error (2) No such file or directory
stderr: [errno 2] error connecting to the cluster
--> RuntimeError: Unable to create a new OSD id手动在目录/var/lib/ceph/bootstrap-osd/下新建删除文件正常,把ceph服务都重启了一遍还是这样。
检查对比各个节点,发现其他可以创建OSD的节点有该文件,对比md5码发现多数节点该文件相同(有一个节点MD5码不同)
$ ansible ceph1 -uroot -m shell -a 'md5sum /var/lib/ceph/bootstrap-osd/ceph.keyring'
node5 | SUCCESS | rc=0 >>
5d77927663a4869c757854dd56d9238e /var/lib/ceph/bootstrap-osd/ceph.keyring
node1 | SUCCESS | rc=0 >>
10cd904c67e2bf03adbb9647382b4d09 /var/lib/ceph/bootstrap-osd/ceph.keyring
node3 | SUCCESS | rc=0 >>
10cd904c67e2bf03adbb9647382b4d09 /var/lib/ceph/bootstrap-osd/ceph.keyring
node2 | SUCCESS | rc=0 >>
10cd904c67e2bf03adbb9647382b4d09 /var/lib/ceph/bootstrap-osd/ceph.keyring
node4 | FAILED | rc=1 >>
md5sum: /var/lib/ceph/bootstrap-osd/ceph.keyring: No such file or directory
node6 | FAILED | rc=1 >>
md5sum: /var/lib/ceph/bootstrap-osd/ceph.keyring: No such file or directory所有节点该文件中的key值都一致
$ ansible ceph1 -uroot -m shell -a 'cat /var/lib/ceph/bootstrap-osd/ceph.keyring'
node3 | SUCCESS | rc=0 >>
[client.bootstrap-osd]
▸ key = 1AQDzPXdcTh0BDRAA4prnwky9uGO3iVuiZtqsKQ==
node1 | SUCCESS | rc=0 >>
[client.bootstrap-osd]
▸ key = 1AQDzPXdcTh0BDRAA4prnwky9uGO3iVuiZtqsKQ==
node5 | SUCCESS | rc=0 >>
[client.bootstrap-osd]
▸ key = 1AQDzPXdcTh0BDRAA4prnwky9uGO3iVuiZtqsKQ==
▸ caps mon = "allow profile bootstrap-osd"
node2 | SUCCESS | rc=0 >>
[client.bootstrap-osd]
▸ key = 1AQDzPXdcTh0BDRAA4prnwky9uGO3iVuiZtqsKQ==
node4 | FAILED | rc=1 >>
cat: /var/lib/ceph/bootstrap-osd/ceph.keyring: No such file or directory
node6 | FAILED | rc=1 >>
cat: /var/lib/ceph/bootstrap-osd/ceph.keyring: No such file or directory直接拷贝一个过来
# scp /var/lib/ceph/bootstrap-osd/ceph.keyring node004:/var/lib/ceph/bootstrap-osd/ceph.keyring再次执行,这次顺利通过
# ceph-volume lvm prepare --data /dev/sdv --block.wal /dev/nvme0n1p1 --block.db /dev/nvme0n1p7
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 12221f9e-06ae-456a-9e47-2408de97ac6f
Running command: vgcreate --force --yes ceph-90f542ea-bd2b-4fbf-bf74-91550627e9d1 /dev/sdv
......
......
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 19 --monmap /var/lib/ceph/osd/ceph-19/activate.monmap --keyfile - --bluestore-block-wal-path /dev/nvme0n1p1 --bluestore-block-db-path /dev/nvme0n1p7 --osd-data /var/lib/ceph/osd/ceph-19/ --osd-uuid 12221f9e-06ae-456a-9e47-2408de97ac6f --setuser ceph --setgroup ceph
--> ceph-volume lvm prepare successful for: /dev/sdv执行成功后会自动挂载到系统
# df
Filesystem 1K-blocks Used Available Use% Mounted on
......
tmpfs 132026216 48 132026168 1% /var/lib/ceph/osd/ceph-19
......因为没有激活,当前该osd状态为down
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 61.85667 root default
-3 18.19307 host ynode001
0 hdd 3.63860 osd.0 up 1.00000 1.00000
1 hdd 3.63860 osd.1 up 1.00000 1.00000
......
20 0 osd.19 down 0 1.00000 激活
# cat /var/lib/ceph/osd/ceph-19/fsid
12221f9e-06ae-456a-9e47-2408de97ac6f
# ceph-volume lvm activate 19 12221f9e-06ae-456a-9e47-2408de97ac6f
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-19
Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-90f542ea-bd2b-4fbf-bf74-91550627e9d1/osd-block-12221f9e-06ae-456a-9e47-2408de97ac6f --path /var/lib/ceph/osd/ceph-19
.......
Running command: systemctl enable --runtime ceph-osd@19
stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@19.service → /lib/systemd/system/ceph-osd@.service.
Running command: systemctl start ceph-osd@19
--> ceph-volume lvm activate successful for osd ID: 19此时状态为"up",并已经自动开始同步数据。