网络虚拟化：RDMA编程介绍

通信行业搬砖工

发布于 2023-09-07 14:36:06

1.5K0

发布于 2023-09-07 14:36:06

文章被收录于专栏：网络虚拟化

本文作者：木木大佬，作者系美团公司资深研发专家

文章地址：

https://zhuanlan.zhihu.com/p/646296114

前言

写这篇文章介绍了 RDMA 编程的基础知识，如有啥错误，欢迎各位大神指出，感觉我就闲不住，休日时间也得学习，哪里需要去哪里，需要哪里学哪里，我真是个苦命的程序媛o(╥﹏╥)o。

一.术语介绍

1 CA(Channel Adapter)

通道适配器是指infiniband网络中的终端节点。它相当于以太网网络接口卡 (NIC)，有更多有关 Infiniband 和 RDMA 的功能，这些 Infiniband 网络接口卡称为（主机）通道适配器 (HCA)。

2 队列对（QP），一组发送队列（SQ）、接收队列（RQ）和完成队列（CQ）

HCA 使用工作队列相互通信。三种类型的队列是：(1)发送队列(SQ)、(2)接收队列(RQ)和(3)完成队列。SQ 和 RQ 始终作为队列对 (QP) 进行分组和管理。

我们可以通过在工作队列中生成工作队列条目（WRE）来发布工作请求（WR），例如(1) 将发送工作请求发布到 SQ 中以将一些数据发送到远程节点，(2) 将接收工作请求发布到 RQ 中以从远程节点接收数据等。发布的工作请求由硬件 (HCA) 直接处理 3 4. 一旦请求完成，硬件就会将工作完成 (WC) 发布到完成队列 (CQ) 中。编程接口提供了灵活性，我们可以为 SQ 和 RQ 指定不同的完成队列，或者为整个 QP 使用一个 CQ。

简而言之，编写 RDMA 程序大致很简单：生成 QP 和 CQ（以及该操作所需的其他数据结构，一会介绍），将 QP 连接到远程节点，生成工作请求（WR）和将其发布到 QP 中。

然后 HCA 将您的订单传输给连接的对应方。

二 Libibverbs API

libibverbs 库提供高级用户空间 API 来使用 Infiniband HCA 。通过这些 API，程序按照以下简单的描述运行：

1.创建 Infiniband context（struct ibv_context* ibv_open_device()） 2.创建保护域（struct ibv_pd* ibv_alloc_pd()） 3.创建完成队列（struct ibv_cq* ibv_create_cq()） 4.创建队列对（struct ibv_qp* ibv_create_qp()） 5.交换标识符信息以建立连接 6.更改队列对状态（ibv_modify_qp()）：将队列对的状态从RESET更改为INIT，RTR（准备接收），最后RTS（准备发送 7.注册内存区域 (ibv_reg_mr()) 8.交换内存区域信息来处理操作 9.进行数据通信

内存区域注册为初始化的一部分（介于步骤 2~6 之间），延迟注册没有啥问题，并且可以在发布工作请求之前（第 6 步之后）随时动态注册和取消注册内存区域。

将程序流程分为两组：步骤1~6作为初始化阶段，步骤7~9作为运行时阶段。我会对第 7 步讨论更多的细节。

1 创建 Infiniband context

打开 HCA 并生成用户空间设备context

struct ibv_context* createContext(const std::string& device_name) {
  /* There is no way to directly open the device with its name; we should get the list of devices first. */
  struct ibv_context* context = nullptr;
  int num_devices;
  struct ibv_device** device_list = ibv_get_device_list(&num_devices);
  for (int i = 0; i < num_devices; i++){
    /* match device name. open the device and return it */
    if (device_name.compare(ibv_get_device_name(device_list[i])) == 0) {
      context = ibv_open_device(device_list[i]);
      break;
    }
  }

  /* it is important to free the device list; otherwise memory will be leaked. */
  ibv_free_device_list(device_list);
  if (context == nullptr) {
    std::cerr << "Unable to find the device " << device_name << std::endl;
  }
  return context;
}

2 创建保护域

从字面上创建一个保护域，保护资源免受远程任意访问。可以注册到保护域的组件是：

内存区域 (MR)
内存窗口 (MW)
队列对 (QP)
共享接收队列 (SRQ)
地址句柄 (AH)

注意啊，完成队列 (CQ) 不在保护域中~

例如注册一个内存区域需要一个指向保护域的指针，表明这个内存区域将被注册到保护域。

struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, ...); manual page

正如前面提过的，内存区域不一定要在创建保护域后立即注册。由于队列对也注册到保护域，因此创建保护域是早期初始化步骤。

struct ibv_context* context = createContext(/* device name */);
struct ibv_pd* protection_domain = ibv_alloc_pd(context);

3 创建完成队列

与步骤 2 一样，这是创建队列对的前提步骤。

struct ibv_context* context = createContext(/* device name */);

int cq_size = 0x10;
struct ibv_cq* completion_queue = ibv_create_cq(context, cq_size, nullptr, nullptr, 0);

ibv_create_cqis:struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector);.[manual page]

4 创建队列对

struct ibv_qp* createQueuePair(struct ibv_pd* pd, struct ibv_pd* pd, struct ibv_cq* cq) {
  struct ibv_qp_init_attr queue_pair_init_attr;
  memset(&queue_pair_init_attr, 0, sizeof(queue_pair_init_attr));
  queue_pair_init_attr.qp_type = IBV_QPT_RC;
  queue_pair_init_attr.sq_sig_all = 1;       // if not set 0, all work requests submitted to SQ will always generate a Work Completion.
  queue_pair_init_attr.send_cq = cq;         // completion queue can be shared or you can use distinct completion queues.
  queue_pair_init_attr.recv_cq = cq;         // completion queue can be shared or you can use distinct completion queues.
  queue_pair_init_attr.cap.max_send_wr = 1;  // increase if you want to keep more send work requests in the SQ.
  queue_pair_init_attr.cap.max_recv_wr = 1;  // increase if you want to keep more receive work requests in the RQ.
  queue_pair_init_attr.cap.max_send_sge = 1; // increase if you allow send work requests to have multiple scatter gather entry (SGE).
  queue_pair_init_attr.cap.max_recv_sge = 1; // increase if you allow receive work requests to have multiple scatter gather entry (SGE).

  return ibv_create_qp(pd, &queue_pair_init_attr);
}

qp_type 表示该队列对的类型。存在三种类型的队列对：(1)可靠连接(RC)、(2)不可靠连接(UC)和(3)不可靠数据报(UD)。

5 交换标识符信息以建立连接

6 改变队列对状态

创建后，队列对的状态立即重置。在这种状态下，队列对不起作用。我们必须与另一个队列对建立队列对连接才能使其工作。队列对状态机图如下。

为了拥有一个工作队列对，我们需要使用 ibv_modfiy_qp() 将队列对的状态修改为 RTR（准备接收）或 RTS（准备发送)。

int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, int attr_mask);[manual page]

按照状态机图，我们首先应该将状态更改为 INIT

bool changeQueuePairStateToInit(struct ibv_qp* queue_pair) {
  struct ibv_qp_attr init_attr;
  memset(&init_attr, 0, sizeof(init_attr));
  init_attr.qp_state = ibv_qp_state::IBV_QPS_INIT;
  init_attr.port_num = device_port_;
  init_attr.pkey_index = 0;
  init_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;

  return ibv_modify_qp(queue_pair, &init_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS) == 0 ? true : false;
}

然后将状态更改为 RTR

bool changeQueuePairStateToRTR(struct ibv_qp* queue_pair, int ib_port, uint32_t destination_qp_number, uint16_t destination_local_id) {
  struct ibv_qp_attr rtr_attr;
  memset(&rtr_attr, 0, sizeof(rtr_attr));
  rtr_attr.qp_state = ibv_qp_state::IBV_QPS_RTR;
  rtr_attr.path_mtu = ibv_mtu::IBV_MTU_1024;
  rtr_attr.rq_psn = 0;
  rtr_attr.max_dest_rd_atomic = 1;
  rtr_attr.min_rnr_timer = 0x12;
  rtr_attr.ah_attr.is_global = 0;
  rtr_attr.ah_attr.sl = 0;
  rtr_attr.ah_attr.src_path_bits = 0;
  rtr_attr.ah_attr.port_num = ib_port;
  
  rtr_attr.dest_qp_num = destination_qp_number;
  rtr_attr.ah_attr.dlid = destination_local_id;

  return ibv_modify_qp(queue_pair, &rtr_attr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER) == 0 ? true : false;
}

注意这里的参数：

ib_port 是该队列对在主机中使用的端口号，可以使用 ibstat 轻松查看设备支持的端口数量及其数量。

$ ibstat
CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.23.1020
        Hardware version: 0
        Node GUID: /* omitted */
        System image GUID: /* omitted */
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 13
                LMC: 0
                SM lid: 9
                Capability mask: /* omitted */
                Port GUID: /* omitted */
                Link layer: InfiniBand

安装infiniband-diags 如果想使用ibstat命令

我的CA有一个端口，端口号是1。可以在启动程序时手动传递此信息。

为了使队列对连接另一个队列对并准备好接收，必须知道有关对等 QP 的信息。就是destination_qp_number 和destination_local_id。

destination_local_id：它用作 HCA 分配到的子网中的本地标识符。这是由子网管理器分配给每个端口的，并且在其子网中是唯一的。对于通过子网的通信，我们可以使用 GID（全局 ID），要查找本地 ID，您可以使用以下函数。

uint16_t getLocalId(struct ibv_context* context, int ib_port) {
  ibv_port_attr port_attr;
  ibv_query_port(context, ib_port, &port_attr);
  return port_attr.lid;
}

destination_qp_number：至少创建了一个队列对，因此已经分配了该队列对的唯一标识符。

uint32_t getQueuePairNumber(struct ibv_qp* qp) {
  return qp->qp_num;
}

注意，这些返回其本地信息，而不是目的地信息（相对节点的信息）。这意味着双方都必须调用这些函数，并且它们交换信息以了解彼此的目的地信息。为此，我提到的所有示例都使用 TCP 套接字。在处理 RDMA 操作之前，服务器和客户端建立 TCP 连接并交换它们的本地 ID 和 QP 编号。这就是步骤 5 包含交换标识符信息的原因。 TCP连接也用在步骤8中，让对方知道它的内存区域。

调用changeQueuePairStateToRTR()后，队列对无法通过发布接收工作请求来接收数据。如果要使其能够发送数据，还需要将状态进一步改为RTS。

bool changeQueuePairStateToRTS(struct ibv_qp* queue_pair) {
  struct ibv_qp_attr rts_attr;
  memset(&rts_attr, 0, sizeof(rts_attr));
  rts_attr.qp_state = ibv_qp_state::IBV_QPS_RTS;
  rts_attr.timeout = 0x12;
  rts_attr.retry_cnt = 7;
  rts_attr.rnr_retry = 7;
  rts_attr.sq_psn = 0;
  rts_attr.max_rd_atomic = 1;

  return ibv_modify_qp(queue_pair, &init_attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC) == 0 ? true : false;
}

由于对等信息已在 RTR 步骤中存储，因此将状态更改为 RTS 不需要任何进一步的对等信息。

7 注册一个内存区域

直到步骤6，都是初始化阶段。在步骤6之后，队列对能够相互通信；发布的工作请求被转发到相反的节点，该节点的 HCA 将消耗它。当然，由于没有注册内存区域，这会因为权限错误而被拒绝；工作请求无法从对等节点的任何空间读取或写入任何内容。换句话说，可以在对等节点发布工作请求之前注册内存区域。

进一步讨论之前，让我们深入了解一下 Infiniband 支持的操作类型：

发送/立即发送：[需要 RTS 状态]将数据发送到远程 QP 的接收队列。
接收：[需要RTR/RTS状态]发送操作的对应操作；当接收到数据缓冲区时，主机会收到通知。
RDMA 读取：[需要 RTS 状态]从远程存储器读取数据。远程端不知道此操作正在完成。
RDMA 写入/RDMA 立即写入：[需要 RTS 状态] 将数据写入远程存储器。远程端不知道此操作正在完成。
原子获取和交换/原子比较和交换

个人认为，内存区域注册大多在初始化阶段的主要原因是由于RDMA操作。与接收操作不同，在接收操作中，远程端主动发布接收工作请求，以便能够决定注册内存区域的时刻（就在发布接收工作请求之前），RDMA读取和RDMA写入可以在远程节点中不进行任何操作的情况下完成，需要提前注册内存区域。同样，在操作中，在不注册内存区域的情况下初始化队列对是没有问题的。HCA 无法从远程节点的内存读取数据或向远程节点的内存写入数据，这是一个运行时问题。要注册内存区域，请使用 ibv_reg_mr()

struct ibv_mr* registerMemoryRegion(struct ibv_pd* pd, void* buffer, size_t size) {
  return ibv_reg_mr(pd, buffer, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE);
}

我们可以在这个内存区域中指定远程访问的范围,提供了五个flag :[manual page] IBV_ACCESS_LOCAL_WRITE: enable local write access IBV_ACCESS_REMOTE_WRITE: enable remote write access IBV_ACCESS_REMOTE_READ: enable remote read access IBV_ACCESS_REMOTE_ATOMIC: enable remote atomic operation access IBV_ACCESS_MW_BIND: enable memory window binding 如果设置了 IBV_ACCESS_REMOTE_WRITE 或 IBV_ACCESS_REMOTE_ATOMIC，则也必须设置 IBV_ACCESS_LOCAL_WRITE。

8. 交换内存区域信息以处理操作

9. 执行数据通信

内存区域是通信所需的数据，它也必须通过 TCP 通道发送。

以下是通过调用 ibv_reg_mr() 返回的已注册内存区域的数据结构：

struct ibv_mr {
	struct ibv_context     *context;
	struct ibv_pd	       *pd;
	void		       *addr;
	size_t			length;
	uint32_t		handle;
	uint32_t		lkey;
	uint32_t		rkey;
};

我们实际需要的数据传输是：

address
length
lkey（本地密钥）
rkey（远程键）

其他数据不是必需的，但只需将整个数据作为字节流发送就足够了。有了内存区域信息，您就可以通过 ibv_post_send() 和 ibv_post_recv() 在两个设备之间使用 RDMA。

Send/Recv：

bool postReceiveRequest(const std::vector<struct ibv_mr*>& sge) {
  struct ibv_recv_wr receive_wr, *bad_wr = nullptr;
  memset(&receive_wr, 0, sizeof(receive_wr));

  // RDMA supports scatter-gather I/O.
  // For a RECV operation, it works as scatter; received data will be scattered into several registered MR.
  struct ibv_sge* receive_sge = calloc(sizeof(struct ibv_sge), sge.size());
  for (int i = 0; i < sge.size(); i++) {
    receive_sge[i].addr = (uintptr_t) sge[i].addr;
    receive_sge[i].length = sge[i].length;
    receive_sge[i].lkey = sge[i].lkey;
  }

  receive_wr.sg_list = receive_sge;
  receive_wr.num_sge = sge.size();
  // will be used for identification.
  // When a request fail, ibv_poll_cq() returns a work completion (struct ibv_wc) with the specified wr_id.
  // If the wr_id is 100, we can easily find out that this RECV request failed.
  receive_wr.wr_id = 100;
  // You can chain several receive requests to reduce software footprint, hnece to improve latency.
  receive_wr.next = nullptr;

  // If posting fails, the address of the failed WR among the chained WRs is stored in bad_wr.
  auto result = ibv_post_recv(queue_pair_, &receive_wr, &bad_wr);
  free(receive_sge);

  return result == 0 ? true : false;
}

bool postSendRequest(const std::vector<struct ibv_mr*>& sge) {
  struct ibv_send_wr send_wr, *bad_wr = nullptr;
  memset(&send_wr, 0, sizeof(send_wr));

  struct ibv_sge* send_sge = calloc(sizeof(struct ibv_sge), sge.size());
  for (int i = 0; i < sge.size(); i++) {
    send_sge[i].addr = (uintptr_t) sge[i].addr;
    send_sge[i].length = sge[i].length;
    send_sge[i].lkey = sge[i].lkey;
  }

  send_wr.sg_list = send_sge;
  send_wr.num_sge = sge.size();
  send_wr.wr_id = 200;

  // All WRs that are posted into Send Queue (SQ) are posted via ibv_send_wr.
  // You should specify the opcode so that which operation you want to do.
  send_wr.opcode = IBV_WR_SEND;
  // With IBV_SEND_SIGNALED flag, the hardware creates a work completion (wc) entry into the completion queue connected to the send queue.
  // You can wait with ibv_poll_cq() call until it finishes its operation.
  send_wr.send_flags = IBV_SEND_SIGNALED;
  send_wr.next = nullptr;

  auto result = ibv_post_send(queue_pair_, &send_wr, &bad_wr);
  free(send_sge);

  return result == 0 ? true : false;
}

RDMA READ / RDMA WRITE：

也可以使用 postSendRequest() 发送 RDMA WRITE 或 RDMA READ 请求，但使用不同的操作码：

RDMA READ: IBV_WR_RDMA_READ RDMA WRITE: IBV_WR_RDMA_WRITE [manual]

对于 RDMA 读/写，必须在 ibv_send_wr 中指定其他参数。RDMA 读取，例如：

bool postRDMAReadRequest(const std::vector<struct ibv_mr*>& sge, struct ibv_mr* peer_memory_region) {
  struct ibv_send_wr rdma_wr, ...;
  rdma_wr.wr.rdma.remote_addr = peer_memory_region->addr;
  rdma_wr.wr.rdma.rkey = peer_memory_region->rkey;
  rdma.wr.opcode = IBV_WR_RDMA_READ;
  
  // All the others same
}

该函数postRDMAReadRequest从远程peer节点中的peer_memory_region读取数据并将其分散到sge中的ibv_mrs中。

Poll 完成

当设备完成操作时，它会在连接的完成队列中创建相应的工作完成（wc）条目（在创建队列对时指定完成队列。

轮询并不是检测工作完成情况的唯一方法。RDMA 提供了一种通知机制，但是，轮询通常检测速度更快（低延迟），因为通知需要多次上下文切换、进程调度等。

我们使用 ibv_poll_cq 来轮询完成队列。这是一个繁忙的轮询，因此会消耗 CPU 核心，但提供较低的延迟。

bool pollCompletion(struct ibv_cq* cq) {
  struct ibv_wc wc;
  int result;

  do {
    // ibv_poll_cq returns the number of WCs that are newly completed,
    // If it is 0, it means no new work completion is received.
    // Here, the second argument specifies how many WCs the poll should check,
    // however, giving more than 1 incurs stack smashing detection with g++8 compilation.
    result = ibv_poll_cq(cq, 1, &wc);
  } while (result == 0);

  if (result > 0 && wc.status == ibv_wc_status::IBV_WC_SUCCESS) {
    // success
    return true;
  }

  // You can identify which WR failed with wc.wr_id.
  printf("Poll failed with status %s (work request ID: %llu)\n", ibv_wc_status_str(wc.status), wc.wr_id);
  return false;
}

ibv_poll_cq 返回 WC 的数量。正如我们指定的，它最多只能等待 1 个 WC，如果发生错误，结果必须为 0、1 或负数。

最后：

原文链接：

https://zhuanlan.zhihu.com/p/646296114

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023-07-29，如有侵权请联系 cloudcommunity@tencent.com 删除

虚拟化