首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >libfabric_ofa_简介_指南_设计思想_高性能网络5

libfabric_ofa_简介_指南_设计思想_高性能网络5

原创
作者头像
晓兵
修改2025-06-13 10:39:43
修改2025-06-13 10:39:43
14100
代码可运行
举报
文章被收录于专栏:DPUDPU
运行总次数:0
代码可运行

上文: https://cloud.tencent.com/developer/article/2531046

Address Vectors 地址向量

A primary goal of address vectors is to allow applications to communicate with thousands to millions of peers while minimizing the amount of data needed to store peer addressing information. It pushes fabric specific addressing details away from the application to the provider. This allows the provider to optimize how it converts addresses into routing data, and enables data compression techniques that may be difficult for an application to achieve without being aware of low-level fabric addressing details. For example, providers may be able to algorithmically calculate addressing components, rather than storing the data locally. Additionally, providers can communicate with resource management entities or fabric manager agents to obtain quality of service or other information about the fabric, in order to improve network utilization.

An equally important objective is ensuring that the resulting interfaces, particularly data transfer operations, are fast and easy to use. Conceptually, an address vector converts an endpoint address into an fi_addr_t. The fi_addr_t (fabric interface address datatype) is a 64-bit value that is used in all ‘fast-path’ operations – data transfers and completions.

Address vectors are associated with domain objects. This allows providers to implement portions of an address vector, such as quality of service mappings, in hardware.

地址向量的主要目标是允许应用程序与数千到数百万个对等点进行通信,同时最大限度地减少存储对等点寻址信息所需的数据量。它将特定于结构的寻址细节从应用程序推送到提供者。这允许提供商优化将地址转换为路由数据的方式,并启用应用程序在不了解低级结构寻址细节的情况下可能难以实现的数据压缩技术。例如,提供者可能能够通过算法计算寻址组件,而不是在本地存储数据。此外,提供商可以与资源管理实体或结构管理器代理进行通信,以获得服务质量或有关结构的其他信息,以提高网络利用率。

一个同样重要的目标是确保生成的接口,特别是数据传输操作,快速且易于使用。从概念上讲,地址向量将端点地址转换为 fi_addr_t。 fi_addr_t(结构接口地址数据类型)是一个 64 位值,用于所有“快速路径”操作——数据传输和完成。

地址向量与域对象相关联。这允许提供商在硬件中实现地址向量的一部分,例如服务质量映射。

地址向量用于将更高级别的地址(对于应用程序使用起来可能更自然)映射到结构特定地址。地址的映射是特定于结构和提供商的,但可能涉及冗长的地址解析和结构管理协议。AV 操作默认是同步的,但可以通过指定FI_EVENT 标志为fi_av_open的参数来设置为异步操作。当请求异步操作时,应用程序必须先将事件队列绑定到 AV 才能插入地址。有关重复地址的 AV 限制,请参阅注释部分

AV Attributes

Address vectors are created with the following attributes. Select attribute details are discussed below:

地址向量是使用以下属性创建的。 下面讨论选择属性的详细信息:

代码语言:javascript
代码运行次数:0
运行
复制
struct fi_av_attr {
    enum fi_av_type type;
    int rx_ctx_bits;
    size_t count;
    size_t ep_per_node;
    const char *name;
    void *map_addr;
    uint64_t flags;
};

AV Type

There are two types of address vectors. The type refers to the format of the returned fi_addr_t values for addresses that are inserted into the AV. With type FI_AV_TABLE, returned addresses are simple indices, and developers may think of the AV as an array of addresses. Each address that is inserted into the AV is mapped to the index of the next free array slot. The advantage of FI_AV_TABLE is that applications can refer to peers using a simple index, eliminating an application’s need to store any addressing data. I.e. the application can generate the fi_addr_t values themselves. This type maps well to applications, such as MPI, where a peer is referenced by rank.

The second type is FI_AV_MAP. This type does not define any specific format for the fi_addr_t value. Applications that use type map are required to provide the correct fi_addr_t for a given peer when issuing a data transfer operation. The advantage of FI_AV_MAP is that a provider can use the fi_addr_t to encode the target’s address, which avoids retrieving the data from memory. As a simple example, consider a fabric that uses TCP/IPv4 based addressing. An fi_addr_t is large enough to contain the address, which allows a provider to copy the data from the fi_addr_t directly into an outgoing packet.

有两种类型的地址向量。类型是指插入到 AV 中的地址的返回 fi_addr_t 值的格式。对于 FI_AV_TABLE 类型,返回的地址是简单的索引,开发人员可能会将 AV 视为地址数组。插入 AV 的每个地址都映射到下一个空闲数组槽的索引。 FI_AV_TABLE 的优点是应用程序可以使用简单的索引来引用对等点,从而消除了应用程序存储任何寻址数据的需要。 IE。应用程序可以自己生成 fi_addr_t 值。这种类型很好地映射到应用程序,例如 MPI,其中对等点按等级引用。

第二种类型是 FI_AV_MAP。此类型没有为 fi_addr_t 值定义任何特定格式。使用类型映射的应用程序需要在发出数据传输操作时为给定对等方提供正确的 fi_addr_t。 FI_AV_MAP 的优点是提供者可以使用 fi_addr_t 对目标地址进行编码,从而避免从内存中检索数据。作为一个简单的例子,考虑一个使用基于 TCP/IPv4 寻址的结构。 fi_addr_t 足够大以包含地址,这允许提供者将数据从 fi_addr_t 直接复制到传出数据包中。

AV Rx Context Bits

The rx_ctx_bits field is only used with scalable endpoints with named received contexts, and is best described using an example. A peer process has allocated a scalable endpoint with two receive contexts. The first receive context will be used for control message, with data messages targeting the second context. Named contexts is a feature that allows the initiator to select which context will receive a message. If the initiating application wishes to send a data message, it must indicate that the message should be steered to the second context.

The rx_ctx_bits allocates a specific number of bits out of the fi_addr_t value that will be used to indicate which context a given operation will target. Applications should reserve a number of bits large enough to indicate any context at the target. For example, two bits is sufficient to target one of four different receive contexts.

The function fi_rx_addr() converts a target fi_addr_t address, along with the requested receive context index, into a new fi_addr_t value that may be used to transfer data to a scalable endpoint’s receive context.

rx_ctx_bits 字段仅用于具有命名接收上下文的可扩展端点,最好使用示例进行描述。一个对等进程分配了一个具有两个接收上下文的可扩展端点。第一个接收上下文将用于控制消息,数据消息针对第二个上下文。命名上下文是一项功能,它允许发起者选择哪个上下文将接收消息。如果启动应用程序希望发送数据消息,它必须指示该消息应该被引导到第二个上下文。

rx_ctx_bits 从 fi_addr_t 值中分配特定数量的位,用于指示给定操作将针对哪个上下文。应用程序应保留足够大的比特数以指示目标处的任何上下文。例如,两个比特足以针对四个不同的接收上下文之一。

函数 fi_rx_addr() 将目标 fi_addr_t 地址以及请求的接收上下文索引转换为新的 fi_addr_t 值,该值可用于将数据传输到可扩展端点的接收上下文。

代码语言:javascript
代码运行次数:0
运行
复制
fi_addr_t fi_rx_addr(fi_addr_t fi_addr, int rx_index, int rx_ctx_bits);

This call is simply a wrapper that writes the rx_index into the space that was reserved in the fi_addr_t value.

Sharing AVs Between Processes

Large scale parallel programs typically run with multiple processes allocated on each node. Because these processes communicate with the same set of peers, the addressing data needed by each process is the same. Libfabric defines a mechanism by which processes running on the same node may share their address vectors. This allows a system to maintain a single copy of addressing data, rather than one copy per process.

Although libfabric does not require any implementation for how an address vector is shared, the interfaces map well to using shared memory. Address vectors which will be shared are given an application specific name. How an application selects a name that avoid conflicts with unrelated processes, or how it communicates the name with peer processes is outside the scope of libfabric.

In addition to having a name, a shared AV also has a base map address -- map_addr. Use of map_addr is only important for address vectors that are of type FI_AV_MAP, and allows applications to share fi_addr_t values. From the viewpoint of the application, the map_addr is the base value for all fi_addr_t values. A common use for map_addr is for the process that creates the initial address vector to request a value from the provider, exchange the returned map_addr with its peers, and for the peers to open the shared AV using the same map_addr. This allows the fi_addr_t values to be stored in shared memory that is

accessible by all peers.

大规模并行程序通常在每个节点上分配多个进程运行。因为这些进程与同一组对等体进行通信,所以每个进程所需的寻址数据是相同的。 Libfabric 定义了一种机制,通过该机制运行在同一节点上的进程可以共享它们的地址向量。这允许系统维护寻址数据的单个副本,而不是每个进程一个副本。

尽管 libfabric 不需要任何实现地址向量的共享方式,但接口很好地映射到使用共享内存。将被共享的地址向量被赋予一个特定于应用程序的名称。应用程序如何选择一个名称以避免与无关进程发生冲突,或者它如何与对等进程通信名称超出了 libfabric 的范围。

共享 AV 除了有名字之外,还有一个基本的地图地址——map_addr。 map_addr 的使用仅对 FI_AV_MAP 类型的地址向量很重要,并允许应用程序共享 fi_addr_t 值。从应用程序的角度来看,map_addr 是所有 fi_addr_t 值的基值。 map_addr 的一个常见用途是用于创建初始地址向量以向提供者请求值、与其对等方交换返回的 map_addr 以及让对等方使用相同的 map_addr 打开共享 AV 的进程。这允许将 fi_addr_t 值存储在共享内存中,即所有同行都可以访问。

Wait and Poll Sets

As mentioned, most libfabric operations involve asynchronous processing, with completions reported to event queues, completion queues, and counters. Wait sets and poll sets were created to help manage and optimize checking for completed requests across multiple objects.

A poll set is a collection of event queues, completion queues, and counters. Applications use a poll set to check if a new completion event has arrived on any of its associated objects. When events occur infrequently or to one of several completion reporting objects, using a poll set can improve application efficiency by reducing the number of calls that the application makes into the libfabric provider. The use of a poll set should be considered by apps that use at least two completion reporting structures, and it is likely that checking them will find that no new events have occurred.

A wait set is similar to a poll set, and is often used in conjunction with one. In ideal implementations, a wait set is associated with a single wait object, such as a file descriptor. All event / completion queues and counters associated with the wait set will be configured to signal that wait object when an event occurs. This minimizes the system resources that are necessary to support applications waiting for events.

如前所述,大多数 libfabric 操作都涉及异步处理,完成报告给事件队列、完成队列和计数器。创建等待集和轮询集是为了帮助管理和优化对跨多个对象的已完成请求的检查。

轮询集是事件队列、完成队列和计数器的集合。应用程序使用轮询集来检查新的完成事件是否已到达其任何关联对象。当事件不经常发生或发生在几个完成报告对象之一时,使用轮询集可以通过减少应用程序对 libfabric 提供程序的调用次数来提高应用程序效率。使用至少两个完成报告结构的应用程序应该考虑使用轮询集,并且检查它们很可能会发现没有发生新事件。

等待集类似于轮询集,并且经常与其中之一结合使用。在理想的实现中,等待集与单个等待对象相关联,例如文件描述符。与等待集相关的所有事件/完成队列和计数器将被配置为在事件发生时向该等待对象发出信号。这最大限度地减少了支持等待事件的应用程序所需的系统资源。

Poll Set

The poll set API is fairly straightforward.

代码语言:javascript
代码运行次数:0
运行
复制
int fi_poll_open(struct fid_domain *domain, struct fi_poll_attr *attr,
    struct fid_poll **pollset);
int fi_poll_add(struct fid_poll *pollset, struct fid *event_fid,
    uint64_t flags);
int fi_poll_del(struct fid_poll *pollset, struct fid *event_fid,
    uint64_t flags);
int fi_poll(struct fid_poll *pollset, void **context, int count);

Applications call fi_poll_open() to allocate a poll set. The attribute structure is a placeholder for future extensions and contains a single flags field, which is reserved. To add and remove event queues, completion queues, and counters, the fi_poll_add() and fi_poll_del() calls are used. As with open, the flags parameter is for extensibility and should be 0. Once objects have been associated with the poll set, an app may call fi_poll() to retrieve a list of objects that may have new events available.

应用程序调用 fi_poll_open() 来分配轮询集。 属性结构是未来扩展的占位符,并包含一个保留的标志字段。 要添加和删除事件队列、完成队列和计数器,请使用 fi_poll_add() 和 fi_poll_del() 调用。 与 open 一样,flags 参数用于可扩展性,应为 0。一旦对象与轮询集相关联,应用程序可能会调用 fi_poll() 来检索可能具有可用新事件的对象列表。

代码语言:javascript
代码运行次数:0
运行
复制
struct my_cq *tx_cq, *rx_cq; /* embeds fid_cq, configured separately */
struct fid_poll *pollset;
struct fi_poll_attr attr = {};
void *cq_context;
​
/* Allocate and add CQs to poll set */
fi_poll_open(domain, &attr, &pollset);
fi_poll_add(pollset, &tx_cq->cq.fid, 0);
fi_poll_add(pollset, &rx_cq->cq.fid, 0);
​
/* Check for events */
ret = fi_poll(pollset, &cq_context, 1);
if (ret == 1) {
    /* CQ had an event */
    struct my_cq *cq = cq_context;
    struct fi_cq_msg_entry entry;
    fi_cq_read(&cq->cq, &entry, 1);
}
​

It’s worth noting that fi_poll() returns a list of objects that have experienced some level of activity since they were last checked. However, an object appearing in the poll output does not guarantee that an event is actually available. For example, fi_poll() may return the context associated with a completion queue, but an app may find that queue empty when reading it. This behavior is permissible by the API and is the result of potential provider implementation details.

One reason this can occur is if an entry is added to the completion queue, but that entry should not be reported to the application. For example, the completion may correspond to a message sent or received as part the provider’s protocol, and may not correspond to an application operation. The fi_poll() routine simply reports whether the queue is empty or not, and is not intended for event processing, which is deferred until the queue can be read in order to avoid additional software queuing overhead.

值得注意的是,fi_poll() 返回一个对象列表,这些对象自上次检查以来经历了某种程度的活动。 但是,出现在轮询输出中的对象并不能保证事件实际上是可用的。 例如,fi_poll() 可能会返回与完成队列关联的上下文,但应用程序可能会在读取该队列时发现该队列为空。 这种行为是 API 允许的,并且是潜在的提供者实现细节的结果。

发生这种情况的一个原因是,如果一个条目被添加到完成队列中,但该条目不应该报告给应用程序。 例如,完成可能对应于作为提供商协议的一部分发送或接收的消息,并且可能不对应于应用程序操作。 fi_poll() 例程仅报告队列是否为空,并且不用于事件处理,事件处理被推迟到可以读取队列时,以避免额外的软件排队开销。

Wait Sets

The wait set API is actually smaller than the poll set.

代码语言:javascript
代码运行次数:0
运行
复制
int fi_wait_open(struct fid_fabric *fabric, struct fi_wait_attr *attr,
    struct fid_wait **waitset);
int fi_wait(struct fid_wait *waitset, int timeout);
​
struct fi_wait_attr {
    enum fi_wait_obj wait_obj;
    uint64_t flags;
};
​

The type of wait object that the wait set should use is specified through the wait attribute structure. Unlike poll sets, a wait set is associated with event queues, completion queues, and counters during their creation. This is necessary so that system resources, such as file descriptors, can be properly allocated and configured. Applications can block until the wait set’s wait object is signaled using the fi_wait() call. Or an application can use an fi_control() call to retrieve the native wait object for use directly with system calls, such as poll() and select().

A wait set is signaled whenever an event is added to one of its associated objects which would trigger the signal. In many cases, the wait set is signaled when any new event occurs; however, some objects will delay signaling the wait object until a threshold is crossed.

Because wait sets are responsible for linking completion reporting objects with wait objects, they can only indicate when a wait object has been signaled. A wait set cannot identify which object was responsible for signaling the wait object. Once a wait has been satisfied, applications are responsible for checking all completion structures for events. One simple way to accomplish this is to place all objects sharing a wait set into a peer poll set.

等待集应该使用的等待对象的类型是通过等待属性结构指定的。与轮询集不同,等待集在创建期间与事件队列、完成队列和计数器相关联。这是必要的,以便可以正确分配和配置系统资源,例如文件描述符。应用程序可以阻塞,直到使用 fi_wait() 调用通知等待集的等待对象。或者应用程序可以使用 fi_control() 调用来检索本机等待对象,以便直接与系统调用一起使用,例如 poll() 和 select()。

每当将事件添加到将触发信号的关联对象之一时,就会发出等待集的信号。在许多情况下,当任何新事件发生时都会发出等待集信号;但是,某些对象会延迟向等待对象发出信号,直到超过阈值。

因为等待集负责将完成报告对象与等待对象联系起来,所以它们只能指示等待对象何时发出信号。等待集无法识别哪个对象负责向等待对象发出信号。一旦满足等待,应用程序负责检查事件的所有完成结构。实现此目的的一种简单方法是将共享等待集的所有对象放入对等轮询集。

Using Native Wait Objects: TryWait 使用原生等待对象:尝试等待

There is an important difference between using libfabric completion objects, versus sockets, that may not be obvious from the discussions so far. With sockets, the object that is signaled is the same object that abstracts the queues, namely the file descriptor. When data is received on a socket, that data is placed in a queue associated directly with the fd. Reading from the fd retrieves that data. If an application wishes to block until data arrives on a socket, it calls select() or poll() on the fd. The fd is signaled when a message is received, which releases the blocked thread, allowing it to read the fd.

By associating the wait object with the underlying data queue, applications are exposed to an interface that is easy to use and race free. If data is available to read from the socket at the time select() or poll() is called, those calls simply return that the fd is readable.

There are a couple of significant disadvantages to this approach, which have been discussed previously, but from different perspectives. The first is that every socket must be associated with its own fd. There is no way to share a wait object among multiple sockets. (This is a main reason for the development of epoll semantics). The second is that the queue is maintained in the kernel, so that the select() and poll() calls can check them.

Libfabric separates the wait object from the queues. For applications that use libfabric interfaces to wait for events, such as fi_cq_sread and fi_wait, this separation is mostly hidden from the application. The exception is that applications may receive a signal (e.g. fi_wait() returns success), but no events are retrieved when a queue is read. This separation allows the queues to reside in the application's memory space, while wait objects may still use kernel components. A reason for the latter is that wait objects may be signaled as part of system interrupt processing, which would go through a kernel driver.

Applications that want to use native wait objects (e.g. file descriptors) directly in operating system calls must perform an additional step in their processing. In order to handle race conditions that can occur between inserting an event into a completion or event object and signaling the corresponding wait object, libfabric defines a ‘trywait’ function. The fi_trywait implementation is responsible for handling potential race conditions which could result in an application either losing events or hanging. The following example demonstrates the use of fi_trywait.

使用 libfabric 完成对象与使用套接字之间有一个重要的区别,这在目前的讨论中可能并不明显。对于套接字,发出信号的对象与抽象队列的对象相同,即文件描述符。当在套接字上接收到数据时,该数据被放置在与 fd 直接关联的队列中。从 fd 读取会检索该数据。如果应用程序希望阻塞直到数据到达套接字,它会调用 fd 上的 select() 或 poll()。当收到一条消息时,fd 会发出信号,这会释放阻塞的线程,允许它读取 fd。

通过将等待对象与底层数据队列相关联,应用程序暴露于易于使用且无竞争的接口。如果在调用 select() 或 poll() 时可以从套接字读取数据,则这些调用仅返回 fd 可读。

这种方法有几个明显的缺点,前面已经讨论过,但是从不同的角度。首先是每个套接字都必须与它自己的 fd 相关联。无法在多个套接字之间共享等待对象。 (这是epoll语义发展的一个主要原因)。第二个是队列在内核中维护,以便select()和poll()调用可以检查它们。

Libfabric 将等待对象与队列分开。对于使用 libfabric 接口来等待事件的应用程序,例如 fi_cq_sread 和 fi_wait,这种分离大部分对应用程序是隐藏的。例外情况是应用程序可能会收到一个信号(例如 fi_wait() 返回成功),但在读取队列时不会检索到任何事件。这种分离允许队列驻留在应用程序的内存空间中,而等待对象仍可能使用内核组件。后者的一个原因是等待对象可以作为系统中断处理的一部分发出信号,这将通过内核驱动程序。

想要直接在操作系统调用中使用本机等待对象(例如文件描述符)的应用程序必须在其处理中执行额外的步骤。为了处理在将事件插入完成或事件对象和发出相应等待对象的信号之间可能发生的竞争条件,libfabric 定义了一个“trywait”函数。 fi_trywait 实现负责处理可能导致应用程序丢失事件或挂起的潜在竞争条件。以下示例演示了 fi_trywait 的使用。

代码语言:javascript
代码运行次数:0
运行
复制
/* Get the native wait object -- an fd in this case */
fi_control(&cq->fid, FI_GETWAIT, (void *) &fd);
FD_ZERO(&fds);
FD_SET(fd, &fds);
​
while (1) {
    ret = fi_trywait(fabric, &cq->fid, 1);
    if (ret == FI_SUCCESS) {
        /* It’s safe to block on the fd */
        select(fd + 1, &fds, NULL, &fds, &timeout);
    } else if (ret == -FI_EAGAIN) {
        /* Read and process all completions from the CQ */
        do {
            ret = fi_cq_read(cq, &comp, 1);
        } while (ret > 0);
    } else {
        /* something really bad happened */
    }
}

In this example, the application has allocated a CQ with an fd as its wait object. It calls select() on the fd. Before calling select(), the application must call fi_trywait() successfully (return code of FI_SUCCESS). Success indicates that a blocking operation can now be invoked on the native wait object without fear of the application hanging or events being lost. If fi_trywait() returns –FI_EAGAIN, it usually indicates that there are queued events to process.

在这个例子中,应用程序分配了一个以 fd 作为其等待对象的 CQ。 它在 fd 上调用 select()。 在调用 select() 之前,应用程序必须成功调用 fi_trywait()(返回码 FI_SUCCESS)。 成功表示现在可以在本机等待对象上调用阻塞操作,而不必担心应用程序挂起或事件丢失。 如果 fi_trywait() 返回 –FI_EAGAIN,通常表示有排队的事件要处理。

Putting It All Together

MSG EP pingpong

RDM EP pingpong

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Address Vectors 地址向量
    • AV Attributes
      • AV Type
      • AV Rx Context Bits
      • Sharing AVs Between Processes
  • Wait and Poll Sets
    • Poll Set
    • Wait Sets
    • Using Native Wait Objects: TryWait 使用原生等待对象:尝试等待
  • Putting It All Together
    • MSG EP pingpong
    • RDM EP pingpong
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档