换盘的时候一定要验明正身,原生的ceph-disk方式对磁盘分区的信息标记实在是太粗糙,很容易看花眼,比如下面这个例子,虽然通过PARTLABEL可以区分journal或者data分区,但是很难搞清楚Journal和Data分区具体对应哪个OSD
[root@Demo]# blkid
/dev/sda3: UUID="45e8c881-ce39-4313-9b75-a0b3acb93906" TYPE="ext4"
/dev/sdb1: UUID="3ef30369-343a-4c7c-97e8-7360a7bc3239" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="e963fe08-04be-42b6-b541-3da67c57ceed"
/dev/sdc1: UUID="7e1024ba-ceb3-4731-b484-689391c2bd5a" TYPE="xfs" PARTLABEL="ceph data" PARTUUID="70a9e47f-bbd0-455e-9fc8-207e33f548b7"
/dev/sda1: UUID="4f08eebe-f3d6-4290-bf84-2744c406fd80" TYPE="swap"
/dev/sda2: UUID="eb439ed8-1b0f-42fe-9653-2fb677833ad9" TYPE="ext4"
/dev/sdb2: PARTLABEL="ceph journal" PARTUUID="dbd5efca-3264-486c-8e38-5a0e0b3547a3"
/dev/sdc2: PARTLABEL="ceph journal" PARTUUID="6f7fed48-621c-48c3-8336-088e34814b4f"
于是我改进了分区的细节,在PARTLABEL字段打上详细的标签,看起来一目了然,维护的时候能够极大减少老眼昏花带来的人为事故。
[root@Demo]# blkid
/dev/sda1: UUID="e499fbea-962e-46e6-8d43-4d79d7b3d0d5" TYPE="swap"
/dev/sda2: UUID="b6e343a0-066c-4640-b5b5-8ee65fc7fa42" TYPE="ext4"
/dev/sda3: UUID="78ada3e7-30da-4bc1-8534-363b587de191" TYPE="ext4"
/dev/sdb1: PARTLABEL="0-journal" PARTUUID="62a622dc-c2e0-50b4-94d1-76705915b88b"
/dev/sdb2: UUID="6677bbc0-ab78-4363-94b7-a49c88a55d17" TYPE="xfs" PARTLABEL="0-data" PARTUUID="e432f6b1-c805-53fd-ac09-ec3f90e6af2d"
/dev/sdc1: PARTLABEL="1-journal" PARTUUID="2a35f318-41f6-5604-a5a4-47403abea2ee"
/dev/sdc2: UUID="4b8fa574-5501-4b47-ba49-4faecfa8fce2" TYPE="xfs" PARTLABEL="1-data" PARTUUID="cab2d0ff-ce5c-53d3-a870-703ea1409bdf"
/dev/sdf1: PARTLABEL="4-journal" PARTUUID="cf4ee515-52ec-550e-8c26-b4b837253b53"
/dev/sdf2: UUID="afe934d1-ec69-4c44-b349-9ac60bf5ef0d" TYPE="xfs" PARTLABEL="4-data" PARTUUID="537fb0a6-31e6-5f4a-8a9d-36f3442f89f9"
/dev/sdd1: PARTLABEL="2-journal" PARTUUID="b635f60f-6652-5548-927c-0f5af88286e6"
/dev/sdd2: UUID="120bc72e-a883-4461-93ed-55932178b616" TYPE="xfs" PARTLABEL="2-data" PARTUUID="8176db3a-3090-5c48-b81a-281eb9d0a5bb"
/dev/sdg1: PARTLABEL="5-journal" PARTUUID="87185c7d-4b80-581a-b206-09103a90f9f7"
/dev/sdg2: UUID="900da833-1cde-4bd4-a70f-e65bfc643e15" TYPE="xfs" PARTLABEL="5-data" PARTUUID="5cc0f85b-86dc-5b98-9f22-e2bd3836015d"
/dev/sde1: PARTLABEL="3-journal" PARTUUID="0a906835-5cd4-5b8d-9bd9-90b3f37a8b91"
/dev/sde2: UUID="6bd951e5-5cfa-4e8c-b5ff-aa54d8a4433e" TYPE="xfs" PARTLABEL="3-data" PARTUUID="786a5edb-b564-5316-8e8a-83ef6bf6966c"
/dev/sdh1: PARTLABEL="6-journal" PARTUUID="c8d638ca-2bd5-51c9-94e6-7f7a5860fd16"
/dev/sdh2: UUID="d821c3e6-2080-4406-962d-3fc96d486f6e" TYPE="xfs" PARTLABEL="6-data" PARTUUID="66676bf1-6db0-57ac-a15f-99c48eefa7f6"
/dev/sdi1: PARTLABEL="7-journal" PARTUUID="7c55967f-b0a8-54ee-b227-d0f2e9637b3b"
/dev/sdi2: UUID="517d2e1f-25ed-4acd-9baa-f3a1257fdaec" TYPE="xfs" PARTLABEL="7-data" PARTUUID="8b065259-010a-56cd-99dc-e27a80fac351"
至于怎么打标签可以参考本公众号之前的《systemd下手工部署OSD服务-Jewel版本》
一般会用下面的命令判断磁盘是HDD(1)还是SSD(0),但是很多时候这个0 or 1 不一定正确,比如下面的SSD可能会被识别成HHD
[root@demo]# grep . /sys/block/sd?/queue/rotational
/sys/block/sda/queue/rotational:1
/sys/block/sdb/queue/rotational:1
/sys/block/sdc/queue/rotational:1
/sys/block/sdd/queue/rotational:1
/sys/block/sde/queue/rotational:1
/sys/block/sdf/queue/rotational:1
/sys/block/sdg/queue/rotational:1
/sys/block/sdh/queue/rotational:1
/sys/block/sdi/queue/rotational:1
当然我们可以通过Raid卡管理工具去精准判断,但是这个方法需要适配不同的Raid卡,有些不支持MegaCli一类工具的就懵逼了
[root@demo]# MegaCli64 pdList aall|grep "Media Type"
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
Media Type: Solid State Device
通过lsblk可以的物理和逻辑扇区大小可以轻松识别SSD(物理扇区4KB,逻辑512 B)或者HDD(逻辑和物理都是512 B)
[root@demo]# lsblk -d -o NAME,PHY-SEC,LOG-SEC
NAME PHY-SEC LOG-SEC
sda 4096 512
sdb 4096 512
sdc 4096 512
sdd 4096 512
sde 4096 512
sdf 4096 512
sdg 4096 512
sdh 4096 512
sdi 4096 512
这里还要讲一个坑,使用下面的hdparm命令获取的物理和逻辑分区大小是不准的
[root@demo]# hdparm -I /dev/sdb
/dev/sdb:
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ATA device, with non-removable media
Standards:
Likely used: 1
Configuration:
Logical max current
cylinders 0 0
heads 0 0
sectors/track 0 0
--
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 0 MBytes
device size with M = 1000*1000: 0 MBytes
cache/buffer size = unknown
Capabilities:
IORDY not likely
Cannot perform double-word IO
R/W multiple sector transfer: not supported
DMA: not supported
PIO: pio0
同时lsblk在判断HDD和SSD也有马失前蹄的时候
[root@demo]# lsblk -d -o NAME,ROTA
NAME ROTA
sda 1
sdb 1
sdc 1
sdd 1
sde 1
sdf 1
sdg 1
sdh 1
sdi 1
SSD的物理扇区4KB为单位,但是操作系统还是按512 B为最小管理单位,因此操作系统在管理SSD的时候就涉及到一个物理扇区和逻辑扇区的对齐问题,扇区对齐问题就不赘述了,不了解的自己去百度。
一般会用parted命令去检查分区对齐,但是一个一个去查很累
[root@demo]# parted /dev/sdb print
Model: DELL PERC H730 Mini (scsi)
Disk /dev/sdb: 800GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 21.5GB 21.5GB 0-journal
2 21.5GB 800GB 778GB xfs 0-data
[root@demo]# parted /dev/sdb align-check optimal 1
1 aligned
[root@demo]# parted /dev/sdb align-check optimal 2
2 aligned
可以用lsblk一条命令搞定
[root@demo]# lsblk -a -o NAME,ALIGNMENT
NAME ALIGNMENT
sda 0
├─sda1 0
├─sda2 0
└─sda3 0
sdb 0
├─sdb1 0
└─sdb2 0
sdc 0
├─sdc1 0
└─sdc2 0
sdd 0
├─sdd1 0
└─sdd2 0
sde 0
├─sde1 0
└─sde2 0
sdf 0
├─sdf1 0
└─sdf2 0
sdg 0
├─sdg1 0
└─sdg2 0
sdh 0
├─sdh1 0
└─sdh2 0
sdi 0
├─sdi1 0
└─sdi2 0
查看磁盘当前的scheduler
[root@demo]# grep . /sys/block/sd?/queue/scheduler
/sys/block/sda/queue/scheduler:noop [deadline] cfq
/sys/block/sdb/queue/scheduler:[noop] deadline cfq
/sys/block/sdc/queue/scheduler:noop [deadline] cfq
/sys/block/sdd/queue/scheduler:noop [deadline] cfq
/sys/block/sde/queue/scheduler:noop [deadline] cfq
/sys/block/sdf/queue/scheduler:noop [deadline] cfq
/sys/block/sdg/queue/scheduler:noop [deadline] cfq
/sys/block/sdh/queue/scheduler:noop [deadline] cfq
/sys/block/sdi/queue/scheduler:noop [deadline] cfq
修改磁盘的scheduler,注意这个重启会失效
[root@demo]# echo "deadline" > /sys/block/sdb/queue/scheduler
[root@demo]# grep . /sys/block/sd?/queue/scheduler
/sys/block/sda/queue/scheduler:noop [deadline] cfq
/sys/block/sdb/queue/scheduler:noop [deadline] cfq
/sys/block/sdc/queue/scheduler:noop [deadline] cfq
/sys/block/sdd/queue/scheduler:noop [deadline] cfq
/sys/block/sde/queue/scheduler:noop [deadline] cfq
/sys/block/sdf/queue/scheduler:noop [deadline] cfq
/sys/block/sdg/queue/scheduler:noop [deadline] cfq
/sys/block/sdh/queue/scheduler:noop [deadline] cfq
/sys/block/sdi/queue/scheduler:noop [deadline] cfq
当然你可以在rc.local一类启动脚本里面把上面的命令每次启动之前都跑一遍,但是这样做不够优雅。于是可以借助udev去实现
比如新建/etc/udev/rules.d/60-ssd-scheduler.rules内容如下
# set deadline scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"
当然你可以在/etc/default/grub脚本里面注入内核启动参数,但是这样做如果是HDD和SSD混部的情况,无法区分,注意grub修改以后要update-grub。
修改前
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
修改后
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=deadline"