首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >libfabric_ofa_简介_指南_设计思想_高性能网络1

libfabric_ofa_简介_指南_设计思想_高性能网络1

原创
作者头像
晓兵
修改于 2025-06-13 02:37:40
修改于 2025-06-13 02:37:40
15500
代码可运行
举报
文章被收录于专栏:DPUDPU
运行总次数:0
代码可运行

OFI指导

参考:

开发指南(设计思想): https://github.com/ofiwg/ofi-guide/blob/master/OFIGuide.md

地址向量: https://ofiwg.github.io/libfabric/v1.15.2/man/fi_av.3.html

编程指南: https://ofiwg.github.io/libfabric/main/man/fi_guide.7.html

Introduction OFI简介

OpenFabrics Interfaces, or OFI, is a framework focused on exporting fabric communication services to applications. OFI is specifically designed to meet the performance and scalability requirements of high-performance computing (HPC) applications, such as MPI, SHMEM, PGAS, DBMS, and enterprise applications, running in a tightly coupled network environment. The key components of OFI are: application interfaces, provider libraries, kernel services, daemons, and test applications.

Libfabric is a core component of OFI. It is the library that defines and exports the user-space API of OFI, and is typically the only software that applications deal with directly. Libfabric is agnostic to the underlying networking protocols, as well as the implementation of the networking devices.

The goal of OFI, and libfabric specifically, is to define interfaces that enable a tight semantic map between applications and underlying fabric services. Specifically, libfabric software interfaces have been co-designed with fabric hardware providers and application developers, with a focus on the needs of HPC users.

This guide describes the libfabric architecture and interfaces. It provides insight into the motivation for its design, and aims to instruct developers on how the features of libfabric may best be employed.

OpenFabrics 接口或 OFI 是一个专注于将结构通信服务导出到应用程序的框架。 OFI 专为满足在紧密耦合的网络环境中运行的高性能计算 (HPC) 应用程序(例如 MPI、SHMEM、PGAS、DBMS 和企业应用程序)的性能和可扩展性要求而设计。 OFI 的关键组件是:应用程序接口、提供程序库、内核服务、守护程序和测试应用程序(如ping_pong, fabtests目录下的测试集)。

Libfabric 是 OFI 的核心组件。它是定义和导出 OFI 的用户空间 API 的库,通常是应用程序直接处理的唯一软件。 Libfabric 与底层网络协议以及网络设备的实现无关。

OFI 的目标,特别是 libfabric,是定义接口,在应用程序和底层结构服务之间实现紧密的语义映射。具体来说,libfabric 软件接口是与 Fabric 硬件提供商和应用程序开发人员共同设计的,重点是 HPC 用户的需求

本指南描述了 libfabric 架构和接口。它提供了对其设计动机的洞察,旨在指导开发人员如何最好地利用 libfabric 的特性。

Review of Sockets Communication 套接字通信回顾

The sockets API is a widely used networking API. This guide assumes that a reader has a working knowledge of programming to sockets. It makes reference to socket based communications throughout in an effort to help explain libfabric concepts and how they relate or differ from the socket API. To be clear, there is no intent to criticize the socket API. The objective is to use sockets as a starting reference point in order to explain certain network features or limitations. The following sections provide a high-level overview of socket semantics for reference.

套接字 API 是一种广泛使用的网络 API。 本指南假定读者具有套接字编程的工作知识。 它在整个过程中都引用了基于套接字的通信,以帮助解释 libfabric 概念以及它们与套接字 API 的关系或不同之处。 需要明确的是,没有批评套接字 API 的意图。 目的是使用套接字作为起始参考点,以解释某些网络功能或限制。 以下部分提供了套接字语义的高级概述以供参考。

Connected (TCP) Communication 面向连接的TCP通信

The most widely used type of socket is SOCK_STREAM. This sort of socket usually runs over TCP/IP, and as a result is often referred to as a 'TCP' socket. TCP sockets are connection-oriented, requiring an explicit connection setup before data transfers can occur. A TCP socket can only transfer data to a single peer socket.

Applications using TCP sockets are typically labeled as either a client or server. Server applications listen for connection request, and accept them when they occur. Clients, on the other hand, initiate connections to the server. After a connection has been established, data transfers between a client and server are similar. The following code segments highlight the general flow for a sample client and server. Error handling and some subtleties of the socket API are omitted for brevity.

最广泛使用的套接字类型是 SOCK_STREAM。 这种套接字通常在 TCP/IP 上运行,因此通常被称为“TCP”套接字。 TCP 套接字是面向连接的,在发生数据传输之前需要明确的连接设置。 TCP 套接字只能将数据传输到单个对等套接字。

使用 TCP 套接字的应用程序通常被标记为客户端或服务器。 服务器应用程序侦听连接请求,并在它们发生时接受它们。 另一方面,客户端启动与服务器的连接。 建立连接后,客户端和服务器之间的数据传输是相似的。 以下代码段突出显示了示例客户端和服务器的一般流程。 为简洁起见,省略了错误处理和套接字 API 的一些细微之处。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example server code flow to initiate listen 服务端 */
struct addrinfo *ai, hints;
int listen_fd;memset(&hints, 0, sizeof hints);
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE;
getaddrinfo(NULL, "7471", &hints, &ai);
​
listen_fd = socket(ai->ai_family, SOCK_STREAM, 0);
bind(listen_fd, ai->ai_addr, ai->ai_addrlen);
freeaddrinfo(ai);fcntl(listen_fd, F_SETFL, O_NONBLOCK);
listen(listen_fd, 128);

In this example, the server will listen for connection requests on port 7471 across all addresses in the system. The call to getaddrinfo() is used to form the local socket address. The node parameter is set to NULL, which result in a wild card IP address being returned. The port is hard-coded to 7471. The AI_PASSIVE flag signifies that the address will be used by the listening side of the connection.

This example will work with both IPv4 and IPv6. The getaddrinfo() call abstracts the address format away from the server, improving its portability. Using the data returned by getaddrinfo(), the server allocates a socket of type SOCK_STREAM, and binds the socket to port 7471.

In practice, most enterprise-level applications make use of non-blocking sockets. The fcntl() command sets the listening socket to non-blocking mode. This will affect how the server processes connection requests (shown below). Finally, the server starts listening for connection requests by calling listen. Until listen is called, connection requests that arrive at the server will be rejected by the operating system.

在此示例中,服务器将在端口 7471 上侦听系统中所有地址的连接请求。对 getaddrinfo() 的调用用于形成本地套接字地址。 node 参数设置为 NULL,这将导致返回通配符 IP 地址。该端口被硬编码为 7471。AI_PASSIVE 标志表示该地址将由连接的侦听方使用。

此示例适用于 IPv4 和 IPv6。 getaddrinfo() 调用将地址格式从服务器中抽象出来,提高了它的可移植性。使用 getaddrinfo() 返回的数据,服务器分配一个 SOCK_STREAM 类型的套接字,并将套接字绑定到端口 7471。

实际上,大多数企业级应用程序都使用非阻塞套接字。 fcntl() 命令将侦听套接字设置为非阻塞模式。这将影响服务器处理连接请求的方式(如下所示)。最后,服务器通过调用listen开始监听连接请求。在调用listen之前,到达服务器的连接请求将被操作系统拒绝。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example client code flow to start connection 客户端 */
struct addrinfo *ai, hints;
int client_fd;memset(&hints, 0, sizeof hints);
hints.ai_socktype = SOCK_STREAM;
getaddrinfo("10.31.20.04", "7471", &hints, &ai);
​
client_fd = socket(ai->ai_family, SOCK_STREAM, 0);
fcntl(client_fd, F_SETFL, O_NONBLOCK);connect(client_fd, ai->ai_addr, ai->ai_addrlen);
freeaddrinfo(ai);

Similar to the server, the client makes use of getaddrinfo(). Since the AI_PASSIVE flag is not specified, the given address is treated as that of the destination. The client expects to reach the server at IP address 10.31.20.04, port 7471. For this example the address is hard-coded into the client. More typically, the address will be given to the client via the command line, through a configuration file, or from a service. Often the port number will be well-known, and the client will find the server by name, with DNS (domain name service) providing the name to address resolution. Fortunately, the getaddrinfo call can be used to convert host names into IP addresses.

Whether the client is given the server's network address directly or a name which must be translated into the network address, the mechanism used to provide this information to the client varies widely. A simple mechanism that is commonly used is for users to provide the server's address using a command line option. The problem of telling applications where its peers are located increases significantly for applications that communicate with hundreds to millions of peer processes, often requiring a separate, dedicated application to solve. For a typical client-server socket application, this is not an issue, so we will defer more discussion until later.

Using the getaddrinfo() results, the client opens a socket, configures it for non-blocking mode, and initiates the connection request. At this point, the network stack has sent a request to the server to establish the connection. Because the socket has been set to non-blocking, the connect call will return immediately and not wait for the connection to be established. As a result any attempt to send data at this point will likely fail.

与服务器类似,客户端使用 getaddrinfo()。由于未指定 AI_PASSIVE 标志,因此将给定地址视为目标地址。客户端希望到达 IP 地址为 10.31.20.04 的服务器,端口为 7471。对于此示例,该地址被硬编码到客户端中。更典型的是,该地址将通过命令行、配置文件或服务提供给客户端。通常端口号是众所周知的,客户端将通过名称找到服务器,DNS(域名服务)提供名称到地址的解析。幸运的是,getaddrinfo 调用可用于将主机名转换为 IP 地址。

无论是直接为客户端提供服务器的网络地址,还是必须将名称转换为网络地址,用于向客户端提供此信息的机制各不相同。一个常用的简单机制是用户使用命令行选项提供服务器地址。对于与数以亿计的对等进程进行通信的应用程序,告诉应用程序其对等点所在位置的问题显着增加,通常需要单独的专用应用程序来解决。对于典型的客户端-服务器套接字应用程序,这不是问题,因此我们将稍后再讨论。

使用 getaddrinfo() 结果,客户端打开一个套接字,将其配置为非阻塞模式,并启动连接请求。至此,网络栈已经向服务器发送了建立连接的请求。因为socket已经设置为非阻塞,connect调用会立即返回,不会等待连接建立。因此,此时任何发送数据的尝试都可能失败。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example server code flow to accept a connection */
struct pollfd fds;
int server_fd;
​
fds.fd = listen_fd;
fds.events = POLLIN;poll(&fds, -1);
​
server_fd = accept(listen_fd, NULL, 0);
fcntl(server_fd, F_SETFL, O_NONBLOCK);

Applications that use non-blocking sockets use select() or poll() to receive notification of when a socket is ready to send or receive data. In this case, the server wishes to know when the listening socket has a connection request to process. It adds the listening socket to a poll set, then waits until a connection request arrives (i.e. POLLIN is true). The poll() call blocks until POLLIN is set on the socket. POLLIN indicates that the socket has data to accept. Since this is a listening socket, the data is a connection request. The server accepts the request by calling accept(). That returns a new socket to the server, which is ready for data transfers.

The server sets the new socket to non-blocking mode. Non-blocking support is particularly important to applications that manage communication with multiple peers.

使用非阻塞套接字的应用程序使用 select() 或 poll() 来接收有关套接字何时准备好发送或接收数据的通知。 在这种情况下,服务器希望知道侦听套接字何时有连接请求要处理。 它将侦听套接字添加到轮询集poll set,然后等待连接请求到达(即 POLLIN 为真)。 poll() 调用阻塞,直到在套接字上设置 POLLIN。 POLLIN 表示套接字有数据要接受。 由于这是一个监听套接字,因此数据是一个连接请求。 服务器通过调用accept() 接受请求。 这会向服务器返回一个新的套接字,该套接字已准备好进行数据传输。

服务器将新套接字设置为非阻塞模式。 非阻塞支持对于管理与多个对等点的通信的应用程序尤其重要。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example client code flow to establish a connection */
struct pollfd fds;
int err;
socklen_t len;
​
fds.fd = client_fd;
fds.events = POLLOUT;poll(&fds, -1);
​
len = sizeof err;
getsockopt(client_fd, SOL_SOCKET, SO_ERROR, &err, &len);

The client is notified that its connection request has completed when its connecting socket is 'ready to send data' (i.e. POLLOUT is true). The poll() call blocks until POLLOUT is set on the socket, indicating the connection attempt is done. Note that the connection request may have completed with an error, and the client still needs to check if the connection attempt was successful. That is not conveyed to the application by the poll() call. The getsockopt() call is used to retrieve the result of the connection attempt. If err in this example is set to 0, then the connection attempt succeeded. The socket is now ready to send and receive data.

After a connection has been established, the process of sending or receiving data is the same for both the client and server. The examples below differ only by name of the socket variable used by the client or server application.

当其连接套接字“准备好发送数据”(即 POLLOUT 为真)时,客户端会收到其连接请求已完成的通知。 poll() 调用阻塞,直到在套接字上设置 POLLOUT,表示连接尝试已完成。 请注意,连接请求可能已经完成但有错误,客户端仍然需要检查连接尝试是否成功。 poll() 调用不会将其传达给应用程序。 getsockopt() 调用用于检索连接尝试的结果。 如果此示例中的 err 设置为 0,则连接尝试成功。 套接字现在已准备好发送和接收数据。

建立连接后,客户端和服务器发送或接收数据的过程是相同的。 下面的示例仅在客户端或服务器应用程序使用的套接字变量的名称上有所不同。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example of client sending data to server */
struct pollfd fds;
size_t offset, size, ret;
char buf[4096];
​
fds.fd = client_fd;
fds.events = POLLOUT;
​
size = sizeof(buf);
for (offset = 0; offset < size; ) {
    poll(&fds, -1);
    
    ret = send(client_fd, buf + offset, size - offset, 0);
    offset += ret;
}

Network communication involves buffering of data at both the sending and receiving sides of the connection. TCP uses a credit based scheme to manage flow control to ensure that there is sufficient buffer space at the receive side of a connection to accept incoming data. This flow control is hidden from the application by the socket API. As a result, stream based sockets may not transfer all the data that the application requests to send as part of a single operation.

In this example, the client maintains an offset into the buffer that it wishes to send. As data is accepted by the network, the offset increases. The client then waits until the network is ready to accept more data before attempting another transfer. The poll() operation supports this. When the client socket is ready for data, it sets POLLOUT to true. This indicates that send will transfer some additional amount of data. The client issues a send() request for the remaining amount of buffer that it wishes to transfer. If send() transfers less data than requested, the client updates the offset, waits for the network to become ready, then tries again.

网络通信涉及在连接的发送端和接收端缓冲数据。 TCP 使用基于信用(credit)的方案来管理流量控制,以确保连接的接收端有足够的缓冲区空间来接收传入数据。套接字 API 对应用程序隐藏了这种流控制。因此,基于流的套接字可能不会传输应用程序请求作为单个操作的一部分发送的所有数据。

在此示例中,客户端在它希望发送的缓冲区中维护一个偏移量。随着网络接受数据,偏移量增加。然后,客户端等待网络准备好接受更多数据,然后再尝试另一次传输。 poll() 操作支持这一点。当客户端套接字准备好接收数据时,它将 POLLOUT 设置为 true。这表明发送将传输一些额外的数据量。客户端针对它希望传输的剩余缓冲区量发出 send() 请求。如果 send() 传输的数据少于请求的数据,客户端会更新偏移量,等待网络准备好,然后重试。

代码语言:javascript
代码运行次数:0
运行
复制
/* Example of server receiving data from client */
struct pollfd fds;
size_t offset, size, ret;
char buf[4096];
​
fds.fd = server_fd;
fds.events = POLLIN;
​
size = sizeof(buf);
for (offset = 0; offset < size; ) {
    poll(&fds, -1);
​
    ret = recv(client_fd, buf + offset, size - offset, 0);
    offset += ret;
}

The flow for receiving data is similar to that used to send it. Because of the streaming nature of the socket, there is no guarantee that the receiver will obtain all of the available data as part of a single call. The server instead must wait until the socket is ready to receive data (POLLIN), before calling receive to obtain what data is available. In this example, the server knows to expect exactly 4 KB of data from the client.

It is worth noting that the previous two examples are written so that they are simple to understand. They are poorly constructed when considering performance. In both cases, the application always precedes a data transfer call (send or recv) with poll(). The impact is even if the network is ready to transfer data or has data queued for receiving, the application will always experience the latency and processing overhead of poll(). A better approach is to call send() or recv() prior to entering the for() loops, and only enter the loops if needed.

接收数据的流程与发送数据的流程相似。 由于套接字的流式传输特性,不能保证接收方将获得所有可用数据作为单个调用的一部分。 相反,服务器必须等到套接字准备好接收数据(POLLIN),然后再调用接收来获取可用的数据。 在此示例中,服务器知道从客户端准确地期望 4 KB 的数据。

值得注意的是,前面两个例子都是为了便于理解而编写的。 在考虑性能时,它们的构造很差。 在这两种情况下,应用程序总是在数据传输调用(send 或 recv)之前使用 poll()。 其影响是,即使网络已准备好传输数据或有数据排队等待接收,应用程序也将始终经历 poll() 的延迟和处理开销。 更好的方法是在进入 for() 循环之前调用 send() 或 recv(),并且仅在需要时才进入循环。

Connection-less (UDP) Communication 无连接(连接更少)的UDP

TODO

Advantages 优点

The socket API has two significant advantages. First, it is available on a wide variety of operating systems and platforms, and works over the vast majority of available networking hardware. It is easily the de facto networking API. This by itself makes it appealing to use.

The second key advantage is that it is relatively easy to program to. The importance of this should not be overlooked. Networking APIs that offer access to higher performing features, but are difficult to program to correctly or well, often result in lower application performance. This is not unlike coding an application in a higher-level language such as C or C++, versus assembly. Although writing directly to assembly language offers the promise of being better performing, for the vast majority of developers, their applications will perform better if written in C or C++, and using an optimized compiler. Applications should have a clear need for high-performance networking before selecting an alternative API to sockets.

套接字 API 有两个显着的优点。首先,它适用于各种操作系统和平台,并且适用于绝大多数可用的网络硬件。它很容易成为事实上的网络 API。这本身就使其具有吸引力。

第二个关键优势是相对容易编程。不应忽视这一点的重要性。提供对更高性能功能的访问但难以正确或良好编程的网络 API 通常会导致较低的应用程序性能。这与使用 C 或 C++ 等高级语言对应用程序进行编码与汇编没有什么不同。尽管直接用汇编语言编写提供了性能更好的承诺,但对于绝大多数开发人员来说,如果用 C 或 C++ 编写并使用优化的编译器,他们的应用程序将性能更好。在选择套接字的替代 API 之前,应用程序应该明确需要高性能网络。

Disadvantages 不足

When considering the problems with the socket API, we limit our discussion to the two most common sockets types: streaming (TCP) and datagram (UDP).

Most applications require that network data be sent reliably. This invariably means using a connection-oriented TCP socket. TCP sockets transfer data as a stream of bytes. However, many applications operate on messages. The result is that applications often insert headers that are simply used to convert to/from a byte stream. These headers consume additional network bandwidth and processing. The streaming nature of the interface also results in the application using loops as shown in the examples above to send and receive larger messages. The complexity of those loops can be significant if the application is managing sockets to hundreds or thousands of peers.

Another issue highlighted by the above examples deals with the asynchronous nature of network traffic. When using a reliable transport, it is not enough to place an application's data onto the network. If the network is busy, it could drop the packet, or the data could become corrupted during a transfer. The data must be kept until it has been acknowledged by the peer, so that it can be resent if needed. The socket API is defined such that the application owns the contents of its memory buffers after a socket call returns.

As an example, if we examine the socket send() call, once send() returns the application is free to modify its buffer. The network implementation has a couple of options. One option is for the send call to place the data directly onto the network. The call must then block before returning to the user until the peer acknowledges that it received the data, at which point send() can then return. The obvious problem with this approach is that the application is blocked in the send() call until the network stack at the peer can process the data and generate an acknowledgment. This can be a significant amount of time where the application is blocked and unable to process other work, such as responding to messages from other clients.

A better option is for the send() call to copy the application's data into an internal buffer. The data transfer is then issued out of that buffer, which allows retrying the operation in case of a failure. The send() call in this case is not blocked, but all data that passes through the network will result in a memory copy to a local buffer, even in the absence of any errors.

Allowing immediate re-use of a data buffer helps keep the socket API simple. However, such a feature can potentially have a negative impact on network performance. For network or memory limited applications, an alternative API may be attractive.

Because the socket API is often considered in conjunction with TCP and UDP, that is, with protocols, it is intentionally detached from the underlying network hardware implementation, including NICs, switches, and routers. Access to available network features is therefore constrained by what the API can support.

在考虑套接字 API 的问题时,我们将讨论限制在两种最常见的套接字类型:流 (TCP) 和数据报 (UDP)。

大多数应用程序要求可靠地发送网络数据。这总是意味着使用面向连接的 TCP 套接字。 TCP 套接字以字节流的形式传输数据。但是,许多应用程序对消息进行操作。结果是应用程序经常插入仅用于转换为/从字节流转换的标头。这些标头消耗额外的网络带宽和处理。接口的流式传输特性还导致应用程序使用如上例所示的循环来发送和接收更大的消息。如果应用程序正在管理数百或数千个对等点的套接字,那么这些循环的复杂性可能会很大。

上述示例强调的另一个问题涉及网络流量的异步性质。使用可靠传输时,仅将应用程序的数据放到网络上是不够的。如果网络繁忙,它可能会丢弃数据包,或者数据可能会在传输过程中损坏。数据必须保存到对方确认为止,以便在需要时可以重新发送。套接字 API 被定义为在套接字调用返回后应用程序拥有其内存缓冲区的内容。

例如,如果我们检查套接字 send() 调用,一旦 send() 返回,应用程序就可以自由修改其缓冲区。网络实现有几个选项。一种选择是发送调用将数据直接放到网络上。然后调用必须在返回给用户之前阻塞,直到对等方确认它接收到数据,此时 send() 可以返回。这种方法的明显问题是应用程序在 send() 调用中被阻塞,直到对端的网络堆栈可以处理数据并生成确认。这可能是应用程序被阻止并且无法处理其他工作(例如响应来自其他客户端的消息)的大量时间。

更好的选择是调用 send() 将应用程序的数据复制到内部缓冲区中。然后从该缓冲区发出数据传输,这允许在失败的情况下重试操作。在这种情况下,send() 调用不会被阻塞,但是所有通过网络的数据都会导致内存复制到本地缓冲区,即使没有任何错误。

允许立即重用数据缓冲区有助于保持套接字 API 简单。但是,这样的功能可能会对网络性能产生负面影响。对于网络或内存受限的应用程序,替代 API 可能很有吸引力。

由于套接字 API 经常被考虑与 TCP 和 UDP 结合,即与协议结合,因此有意将其与底层网络硬件实现分离,包括 NIC、交换机和路由器。因此,对可用网络功能的访问受限于 API 可以支持的内容。

High-Performance Networking 高性能网络

By analyzing the socket API in the context of high-performance networking, we can start to see some features that are desirable for a network API.

通过在高性能网络环境中分析套接字 API,我们可以开始看到网络 API 所需要的一些特性。

Avoiding Memory Copies 避免内存拷贝

The socket API implementation usually results in data copies occurring at both the sender and the receiver. This is a trade-off between keeping the interface easy to use, versus providing reliability. Ideally, all memory copies would be avoided when transferring data over the network. There are techniques and APIs that can be used to avoid memory copies, but in practice, the cost of avoiding a copy can often be more than the copy itself, in particular for small transfers (measured in bytes, versus kilobytes or more).

To avoid a memory copy at the sender, we need to place the application data directly onto the network. If we also want to avoid blocking the sending application, we need some way for the network layer to communicate with the application when the buffer is safe to re-use. This would allow the buffer to be re-used in case the data needs to be re-transmitted. This leads us to crafting a network interface that behaves asynchronously. The application will need to issue a request, then receive some sort of notification when the request has completed.

Avoiding a memory copy at the receiver is more challenging. When data arrives from the network, it needs to land into an available memory buffer, or it will be dropped, resulting in the sender re-transmitting the data. If we use socket recv() semantics, the only way to avoid a copy at the receiver is for the recv() to be called before the send(). Recv() would then need to block until the data has arrived. Not only does this block the receiver, it is impractical to use outside of an application with a simple request-reply protocol.

Instead, what is needed is a way for the receiving application to provide one or more buffers to the network for received data to land. The network then needs to notify the application when data is available. This sort of mechanism works well if the receiver does not care where in its memory space the data is located; it only needs to be able to process the incoming message.

As an alternative, it is possible to reverse this flow, and have the network layer hand its buffer to the application. The application would then be responsible for returning the buffer to the network layer when it is done with its processing. While this approach can avoid memory copies, it suffers from a few drawbacks. First, the network layer does not know what size of messages to expect, which can lead to inefficient memory use. Second, many would consider this a more difficult programming model to use. And finally, the network buffers would need to be mapped into the application process' memory space, which negatively impacts performance.

In addition to processing messages, some applications want to receive data and store it in a specific location in memory. For example, a database may want to merge received data records into an existing table. In such cases, even if data arriving from the network goes directly into an application's receive buffers, it may still need to be copied into its final location. It would be ideal if the network supported placing data that arrives from the network into a specific memory buffer, with the buffer determined based on the contents of the data.

套接字 API 实现通常会导致发送方和接收方都发生数据副本。这是保持界面易于使用与提供可靠性之间的权衡。理想情况下,通过网络传输数据时会避免所有内存副本。有一些技术和 API 可用于避免内存复制,但在实践中,避免复制的成本通常可能超过复制本身,特别是对于小传输(以字节为单位,而不是千字节或更多)。

为了避免发送方的内存复制,我们需要将应用程序数据直接放到网络上。如果我们还想避免阻塞发送应用程序,我们需要一些方法让网络层在缓冲区可以安全重用时与应用程序通信。这将允许在需要重新传输数据的情况下重新使用缓冲区。这导致我们设计了一个异步行为的网络接口。应用程序需要发出请求,然后在请求完成时收到某种通知。

在接收器处避免内存复制更具挑战性。当数据从网络到达时,需要降落到可用的内存缓冲区中,否则会被丢弃,导致发送方重新传输数据。如果我们使用套接字 recv() 语义,避免在接收方复制的唯一方法是在 send() 之前调用 recv()。然后,Recv() 将需要阻塞,直到数据到达。这不仅会阻塞接收器,而且在应用程序之外使用简单的请求-回复协议也是不切实际的。

相反,需要一种方法让接收应用程序向网络提供一个或多个缓冲区,以便接收到的数据到达。然后,网络需要在数据可用时通知应用程序。如果接收方不关心数据在其内存空间中的位置,这种机制就可以很好地工作。它只需要能够处理传入的消息。

作为替代方案,可以反转此流程,让网络层将其缓冲区交给应用程序。然后,应用程序将负责在完成处理后将缓冲区返回给网络层。虽然这种方法可以避免内存复制,但它也有一些缺点。首先,网络层不知道预期的消息大小,这会导致内存使用效率低下。其次,许多人会认为这是一个更难使用的编程模型。最后,网络缓冲区需要映射到应用程序进程的内存空间,这会对性能产生负面影响。

除了处理消息之外,一些应用程序还希望接收数据并将其存储在内存中的特定位置。例如,数据库可能希望将接收到的数据记录合并到现有表中。在这种情况下,即使来自网络的数据直接进入应用程序的接收缓冲区,也可能仍需要将其复制到其最终位置。如果网络支持将来自网络的数据放置到特定的内存缓冲区中,那将是理想的,缓冲区根据数据的内容确定。

Network Buffers

Based on the problems described above, we can start to see that avoiding memory copies depends upon the ownership of the memory buffers used for network traffic. With socket based transports, the network buffers are owned and managed by the networking stack. This is usually handled by the operating system kernel. However, this results in the data 'bouncing' between the application buffers and the network buffers. By putting the application in control of managing the network buffers, we can avoid this overhead. The cost for doing so is additional complexity in the application.

Note that even though we want the application to own the network buffers, we would still like to avoid the situation where the application implements a complex network protocol. The trade-off is that the app provides the data buffers to the network stack, but the network stack continues to handle things like flow control, reliability, and segmentation and reassembly.

基于上述问题,我们可以开始看到避免内存复制取决于用于网络流量的内存缓冲区的所有权。 使用基于套接字的传输,网络缓冲区由网络堆栈拥有和管理。 这通常由操作系统内核处理。 但是,这会导致数据在应用程序缓冲区和网络缓冲区之间“反弹”。 通过让应用程序控制管理网络缓冲区,我们可以避免这种开销。 这样做的代价是应用程序的额外复杂性。

请注意,即使我们希望应用程序拥有网络缓冲区,我们仍然希望避免应用程序实现复杂网络协议的情况。 权衡是应用程序向网络堆栈提供数据缓冲区,但网络堆栈继续处理流量控制、可靠性以及分段和重组等事情。

Resource Management 资源管理

We define resource management to mean properly allocating network resources in order to avoid overrunning data buffers or queues. Flow control is a common aspect of resource management. Without proper flow control, a sender can overrun a slow or busy receiver. This can result in dropped packets, re-transmissions, and increased network congestion. Significant research and development has gone into implementing flow control algorithms. Because of its complexity, it is not something that an application developer should need to deal with. That said, there are some applications where flow control simply falls out of the network protocol. For example, a request-reply protocol naturally has flow control built in.

For our purposes, we expand the definition of resource management beyond flow control. Flow control typically only deals with available network buffering at a peer. We also want to be concerned about having available space in outbound data transfer queues. That is, as we issue commands to the local NIC to send data, that those commands can be queued at the NIC. When we consider reliability, this means tracking outstanding requests until they have been acknowledged. Resource management will need to ensure that we do not overflow that request queue.

Additionally, supporting asynchronous operations (described in detail below) will introduce potential new queues. Those queues must not overflow as well.

我们将资源管理定义为正确分配网络资源以避免超出数据缓冲区或队列。流控制是资源管理的一个常见方面。如果没有适当的流量控制,发送方可能会超出慢速或繁忙的接收方。这可能导致丢包、重传和网络拥塞增加。大量的研究和开发已经进入实施流控制算法。由于它的复杂性,它不是应用程序开发人员应该需要处理的事情。也就是说,在某些应用程序中,流量控制完全脱离了网络协议。例如,请求-回复协议自然具有内置的流量控制。

出于我们的目的,我们将资源管理的定义扩展到流控制之外。流控制通常只处理对等点的可用网络缓冲。我们还希望关注出站数据传输队列中的可用空间。也就是说,当我们向本地 NIC 发出命令以发送数据时,这些命令可以在 NIC 处排队。当我们考虑可靠性时,这意味着跟踪未完成的请求,直到它们被确认。资源管理需要确保我们不会溢出该请求队列。

此外,支持异步操作(下面详细描述)将引入潜在的新队列。这些队列也不能溢出。

Asynchronous Operations 异步操作

Arguably, the key feature of achieving high-performance is supporting asynchronous operations. The socket API supports asynchronous transfers with its non-blocking mode. However, because the API itself operates synchronously, the result is additional data copies. For an API to be asynchronous, an application needs to be able to submit work, then later receive some sort of notification that the work is done. In order to avoid extra memory copies, the application must agree not to modify its data buffers until the operation completes.

There are two main ways to notify an application that it is safe to re-use its data buffers. One mechanism is for the network layer to invoke some sort of callback or send a signal to the application that the request is done. Some asynchronous APIs use this mechanism. The drawback of this approach is that signals interrupt an application's processing. This can negatively impact the CPU caches, plus requires interrupt processing. Additionally, it is often difficult to develop an application that can handle processing a signal that can occur at anytime.

An alternative mechanism for supporting asynchronous operations is to write events into some sort of completion queue when an operation completes. This provides a way to indicate to an application when a data transfer has completed, plus gives the application control over when and how to process completed requests. For example, it can process requests in batches to improve code locality and performance.

可以说,实现高性能的关键特性是支持异步操作。套接字 API 以非阻塞模式支持异步传输。但是,由于 API 本身是同步操作的,因此结果是额外的数据副本。对于异步的 API,应用程序需要能够提交工作,然后接收某种通知,表明工作已完成。为了避免额外的内存副本,应用程序必须同意在操作完成之前不修改其数据缓冲区。

有两种主要方法可以通知应用程序可以安全地重用其数据缓冲区。一种机制是网络层调用某种回调或向应用程序发送请求完成的信号。一些异步 API 使用这种机制。这种方法的缺点是信号会中断应用程序的处理。这会对 CPU 缓存产生负面影响,并且需要中断处理。此外,开发一个可以处理随时可能发生的信号的应用程序通常很困难。

支持异步操作的另一种机制是在操作完成时将事件写入某种完成队列。这提供了一种在数据传输完成时向应用程序指示的方法,并让应用程序控制何时以及如何处理已完成的请求。例如,它可以批量处理请求以提高代码局部性和性能。

Interrupts and Signals 中断和信号

Interrupts are a natural extension to supporting asynchronous operations. However, when dealing with an asynchronous API, they can negatively impact performance. Interrupts, even when directed to a kernel agent, can interfere with application processing.

If an application has an asynchronous interface with completed operations written into a completion queue, it is often sufficient for the application to simply check the queue for events. As long as the application has other work to perform, there is no need for it to block. This alleviates the need for interrupt generation. A NIC merely needs to write an entry into the completion queue and update a tail pointer to signal that a request is done.

If we follow this argument, then it can be beneficial to give the application control over when interrupts should occur and when to write events to some sort of wait object. By having the application notify the network layer that it will wait until a completion occurs, we can better manage the number and type of interrupts that are generated.

中断是支持异步操作的自然扩展。但是,在处理异步 API 时,它们会对性能产生负面影响。中断,即使指向内核代理,也会干扰应用程序处理。

如果应用程序有一个将已完成操作写入完成队列的异步接口,则应用程序通常只需检查队列中的事件就足够了。只要应用程序有其他工作要执行,就没有必要阻塞。这减轻了对中断生成的需求。 NIC 只需将一个条目写入完成队列并更新尾指针以表示请求已完成。

如果我们遵循这个论点,那么让应用程序控制何时应该发生中断以及何时将事件写入某种等待对象可能是有益的。通过让应用程序通知网络层它将等到完成发生,我们可以更好地管理生成的中断的数量和类型。

Event Queues 事件队列

As outlined above, there are performance advantages to having an API that reports completions or provides other types of notification using an event queue. A very simple type of event queue merely tracks completed operations. As data is received or a send completes, an entry is written into the event queue.

如上所述,使用事件队列报告完成或提供其他类型通知的 API 具有性能优势。 一种非常简单的事件队列仅跟踪已完成的操作。 当数据被接收或发送完成时,一个条目被写入事件队列。

Direct Hardware Access 直接硬件访问

When discussing the network layer, most software implementations refer to kernel modules responsible for implementing the necessary transport and network protocols. However, if we want network latency to approach sub-microsecond speeds, then we need to remove as much software between the application and its access to the hardware as possible. One way to do this is for the application to have direct access to the network interface controller's command queues. Similarly, the NIC requires direct access to the application's data buffers and control structures, such as the above mentioned completion queues.

Note that when we speak about an application having direct access to network hardware, we're referring to the application process. Naturally, an application developer is highly unlikely to code for a specific hardware NIC. That work would be left to some sort of network library specifically targeting the NIC. The actual network layer, which implements the network transport, could be part of the network library or offloaded onto the NIC's hardware or firmware.

在讨论网络层时,大多数软件实现都是指负责实现必要的传输和网络协议的内核模块。但是,如果我们希望网络延迟接近亚微秒级的速度,那么我们需要在应用程序与其对硬件的访问之间尽可能多地删除软件。一种方法是让应用程序直接访问网络接口控制器的命令队列。同样,NIC 需要直接访问应用程序的数据缓冲区和控制结构,例如上面提到的完成队列。

请注意,当我们谈到应用程序可以直接访问网络硬件时,我们指的是应用程序进程。自然,应用程序开发人员极不可能为特定的硬件 NIC 编写代码。这项工作将留给某种专门针对 NIC 的网络库。实现网络传输的实际网络层可以是网络库的一部分,也可以卸载到 NIC 的硬件或固件上。

Kernel Bypass 绕过内核

Kernel bypass is a feature that allows the application to avoid calling into the kernel for data transfer operations. This is possible when it has direct access to the NIC hardware. Complete kernel bypass is impractical because of security concerns and resource management constraints. However, it is possible to avoid kernel calls for what are called 'fast-path' operations, such as send or receive.

For security and stability reasons, operating system kernels cannot rely on data that comes from user space applications. As a result, even a simple kernel call often requires acquiring and releasing locks, coupled with data verification checks. If we can limit the effects of a poorly written or malicious application to its own process space, we can avoid the overhead that comes with kernel validation without impacting system stability.

绕过内核是一项允许应用程序避免调用内核进行数据传输操作的功能。 当它可以直接访问 NIC 硬件时,这是可能的。 由于安全问题和资源管理限制,完全绕过内核是不切实际的。 但是,可以避免内核调用所谓的“快速路径”操作,例如发送或接收。

出于安全和稳定性的原因,操作系统内核不能依赖来自用户空间应用程序的数据。 因此,即使是简单的内核调用也经常需要获取和释放锁,再加上数据验证检查。 如果我们可以将编写不佳或恶意应用程序的影响限制在它自己的进程空间中,我们就可以避免内核验证带来的开销,而不会影响系统稳定性。

Direct Data Placement 直接数据放置

Direct data placement means avoiding data copies when sending and receiving data, plus placing received data into the correct memory buffer where needed. On a broader scale, it is part of having direct hardware access, with the application and NIC communicating directly with shared memory buffers and queues.

Direct data placement is often thought of by those familiar with RDMA - remote direct memory access. RDMA is a technique that allows reading and writing memory that belongs to a peer process that is running on a node across the network. Advanced RDMA hardware is capable of accessing the target memory buffers without involving the execution of the peer process. RDMA relies on offloading the network transport onto the NIC in order to avoid interrupting the target process.

The main advantages of supporting direct data placement is avoiding memory copies and minimizing processing overhead.

直接数据放置意味着在发送和接收数据时避免数据复制,并在需要时将接收到的数据放入正确的内存缓冲区。 在更广泛的范围内,它是直接硬件访问的一部分,应用程序和 NIC 直接与共享内存缓冲区和队列通信。

熟悉 RDMA(远程直接内存访问)的人通常会想到直接数据放置。 RDMA 是一种允许读写属于在网络上的节点上运行的对等进程的内存的技术。 高级 RDMA 硬件能够访问目标内存缓冲区,而不涉及对等进程的执行。 RDMA 依赖于将网络传输卸载到 NIC 上以避免中断目标进程。

支持直接数据放置的主要优点是避免内存复制和最小化处理开销。

Designing Interfaces for Performance 为性能设计API

We want to design a network interface that can meet the requirements outlined above. Moreover, we also want to take into account the performance of the interface itself. It is often not obvious how an interface can adversely affect performance, versus performance being a result of the underlying implementation. The following sections describe how interface choices can impact performance. Of course, when we begin defining the actual APIs that an application will use, we will need to trade off raw performance for ease of use where it makes sense.

When considering performance goals for an API, we need to take into account the target application use cases. For the purposes of this discussion, we want to consider applications that communicate with thousands to millions of peer processes. Data transfers will include millions of small messages per second per peer, and large transfers that may be up to gigabytes of data. At such extreme scales, even small optimizations are measurable, in terms of both performance and power. If we have a million peers sending a millions messages per second, eliminating even a single instruction from the code path quickly multiplies to saving billions of instructions per second from the overall execution, when viewing the operation of the entire application.

We once again refer to the socket API as part of this discussion in order to illustrate how an API can affect performance.

我们想设计一个能够满足上述要求的网络接口。而且,我们还要考虑到接口本身的性能。接口如何对性能产生不利影响通常并不明显,而性能是底层实现的结果。以下部分描述了接口选择如何影响性能。当然,当我们开始定义应用程序将使用的实际 API 时,我们需要在合理的情况下权衡原始性能以换取易用性。

在考虑 API 的性能目标时,我们需要考虑目标应用程序用例。出于本次讨论的目的,我们希望考虑与成千上万个对等进程进行通信的应用程序。数据传输将包括每个对等方每秒数百万条小消息,以及可能高达千兆字节数据的大传输。在如此极端的规模下,就性能和功率而言,即使是很小的优化也是可以衡量的。如果我们有一百万对等方每秒发送数百万条消息,那么在查看整个应用程序的操作时,即使从代码路径中消除一条指令也会迅速成倍地从整体执行中节省数十亿条指令。

我们再次将套接字 API 作为讨论的一部分,以说明 API 如何影响性能。

代码语言:javascript
代码运行次数:0
运行
复制
/* Notable socket function prototypes */
/* "control" functions */
int socket(int domain, int type, int protocol);
int bind(int socket, const struct sockaddr *addr, socklen_t addrlen);
int listen(int socket, int backlog);
int accept(int socket, struct sockaddr *addr, socklen_t *addrlen);
int connect(int socket, const struct sockaddr *addr, socklen_t addrlen);
int shutdown(int socket, int how);
int close(int socket);/* "fast path" data operations - send only (receive calls not shown) */
ssize_t send(int socket, const void *buf, size_t len, int flags);
ssize_t sendto(int socket, const void *buf, size_t len, int flags,
    const struct sockaddr *dest_addr, socklen_t addrlen);
ssize_t sendmsg(int socket, const struct msghdr *msg, int flags);
ssize_t write(int socket, const void *buf, size_t count);
ssize_t writev(int socket, const struct iovec *iov, int iovcnt);
​
Send 和 Write 之间的主要区别在于,套接字编程中的两个函数在它们都存在多个标志方面有所不同。 众所周知,套接字编程中的函数 Send 仅在作为套接字描述符的更专业的函数上起作用。 而众所周知,Write 在这个问题上是通用的,因为它可以处理所有类型的描述符。
​
/* "indirect" data operations */
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
int select(int nfds, fd_set *readfds, fd_set *writefds,
    fd_set *exceptfds, struct timeval *timeout); 

Examining this list, there are a couple of features to note. First, there are multiple calls that can be used to send data, as well as multiple calls that can be used to wait for a non-blocking socket to become ready. This will be discussed in more detail further on. Second, the operations have been split into different groups (terminology is ours). Control operations are those functions that an application seldom invokes during execution. They often only occur as part of initialization.

Data operations, on the other hand, may be called hundreds to millions of times during an application's lifetime. They deal directly or indirectly with transferring or receiving data over the network. Data operations can be split into two groups. Fast path calls interact with the network stack to immediately send or receive data. In order to achieve high bandwidth and low latency, those operations need to be as fast as possible. Non-fast path operations that still deal with data transfers are those calls, that while still frequently called by the application, are not as performance critical. For example, the select() and poll() calls are used to block an application thread until a socket becomes ready. Because those calls suspend the thread execution, performance is a lesser concern. (Performance of those operations is still of a concern, but the cost of executing the operating system scheduler often swamps any but the most substantial performance gains.)

检查此列表,有几个功能需要注意。首先,有多个调用可用于发送数据,以及多个调用可用于等待非阻塞套接字就绪。这将在后面更详细地讨论。其次,操作被分成不同的组(术语是我们的)。控制操作是应用程序在执行期间很少调用的那些功能。它们通常仅作为初始化的一部分出现。

另一方面,数据操作在应用程序的生命周期中可能会被调用数百到数百万次。它们直接或间接地处理通过网络传输或接收数据。数据操作可以分为两组。快速路径调用与网络堆栈交互以立即发送或接收数据。为了实现高带宽和低延迟,这些操作需要尽可能快。仍然处理数据传输的非快速路径操作是那些调用,虽然仍然经常被应用程序调用,但对性能的要求并不高。例如,select() 和 poll() 调用用于阻塞应用程序线程,直到套接字准备好。因为这些调用会暂停线程执行,所以性能是一个不太关心的问题。 (这些操作的性能仍然是一个问题,但是执行操作系统调度程序的成本通常会超过除了最实质性的性能提升之外的任何东西。)

Call Setup Costs 连接建立/初始化设置的开销

The amount of work that an application needs to perform before issuing a data transfer operation can affect performance, especially message rates. Obviously, the more parameters an application must push on the stack to call a function increases its instruction count. However, replacing stack variables with a single data structure does not help to reduce the setup costs.

Suppose that an application wishes to send a single data buffer of a given size to a peer. If we examine the socket API, the best fit for such an operation is the write() call. That call takes only those values which are necessary to perform the data transfer. The send() call is a close second, and send() is a more natural function name for network communication, but send() requires one extra argument over write(). Other functions are even worse in terms of setup costs. The sendmsg() function, for example, requires that the application format a data structure, the address of which is passed into the call. This requires significantly more instructions from the application if done for every data transfer.

Even though all other send functions can be replaced by sendmsg(), it is useful to have multiple ways for the application to issue send requests. Not only are the other calls easier to read and use (which lower software maintenance costs), but they can also improve performance.

应用程序在发出数据传输操作之前需要执行的工作量会影响性能,尤其是消息速率。显然,应用程序必须压入堆栈以调用函数的参数越多,它的指令数就会增加。但是,用单个数据结构替换堆栈变量并不能帮助降低设置成本。

假设应用程序希望将给定大小的单个数据缓冲区发送到对等点。如果我们检查套接字 API,最适合这种操作的是 write() 调用。该调用仅采用执行数据传输所需的那些值。 send() 调用紧随其后,send() 是用于网络通信的更自然的函数名称,但 send() 需要一个比 write() 额外的参数。就设置成本而言,其他功能甚至更糟。例如,sendmsg() 函数要求应用程序格式化一个数据结构,并将其地址传递给调用。如果每次数据传输都完成,这需要来自应用程序的更多指令。

尽管所有其他发送函数都可以用 sendmsg() 代替,但有多种方式让应用程序发出发送请求还是很有用的。其他调用不仅更易于阅读和使用(这降低了软件维护成本),而且还可以提高性能。

Branches and Loops 分支和循环

When designing an API, developers rarely consider how the API impacts the underlying implementation. However, the selection of API parameters can require that the underlying implementation add branches or use control loops. Consider the difference between the write() and writev() calls. The latter passes in an array of I/O vectors, which may be processed using a loop such as this:

在设计 API 时,开发人员很少考虑 API 如何影响底层实现。 但是,API 参数的选择可能需要底层实现添加分支或使用控制循环。 考虑 write() 和 writev() 调用之间的区别。 后者传入一个 I/O 向量数组,可以使用如下循环进行处理:

代码语言:javascript
代码运行次数:0
运行
复制
/* Sample implementation for processing an array */
for (i = 0; i < iovcnt; i++) {
    ...
}

In order to process the iovec array, the natural software construct would be to use a loop to iterate over the entries. Loops result in additional processing. Typically, a loop requires initializing a loop control variable (e.g. i = 0), adds ALU operations (e.g. i++), and a comparison (e.g. i < iovcnt). This overhead is necessary to handle an arbitrary number of iovec entries. If the common case is that the application wants to send a single data buffer, write() is a better option.

In addition to control loops, an API can result in the implementation needing branches. Branches can change the execution flow of a program, impacting processor pipe-lining techniques. Processor branch prediction helps alleviate this issue. However, while branch prediction can be correct nearly 100% of the time while running a micro-benchmark, such as a network bandwidth or latency test, with more realistic network traffic, the impact can become measurable.

We can easily see how an API can introduce branches into the code flow if we examine the send() call. Send() takes an extra flags parameter over the write() call. This allows the application to modify the behavior of send(). From the viewpoint of implementing send(), the flags parameter must be checked. In the best case, this adds one additional check (flags are non-zero). In the worst case, every valid flag may need a separate check, resulting in potentially dozens of checks.

Overall, the sockets API is well designed considering these performance implications. It provides complex calls where they are needed, with simpler functions available that can avoid some of the overhead inherent in other calls.

为了处理 iovec 数组,自然的软件构造将使用循环来迭代条目。循环导致额外的处理。通常,循环需要初始化循环控制变量(例如 i = 0)、添加 ALU 操作(例如 i++)和比较(例如 i < iovcnt)。此开销对于处理任意数量的 iovec 条目是必需的。如果常见情况是应用程序想要发送单个数据缓冲区,那么 write() 是一个更好的选择。

除了控制循环之外,API 还可能导致实现需要分支。分支可以改变程序的执行流程,影响处理器流水线技术。处理器分支预测有助于缓解这个问题。然而,虽然在运行微基准测试(例如网络带宽或延迟测试)时,分支预测几乎 100% 的时间都是正确的,但网络流量更真实,其影响可以变得可衡量。

如果我们检查 send() 调用,我们可以很容易地看到 API 如何将分支引入代码流。 Send() 在 write() 调用上采用额外的标志参数。这允许应用程序修改 send() 的行为。从实现 send() 的角度来看,必须检查 flags 参数。在最好的情况下,这会增加一项额外的检查(标志非零)。在最坏的情况下,每个有效标志都可能需要单独检查,从而可能导致数十次检查。

总体而言,考虑到这些性能影响,套接字 API 设计得很好。它在需要的地方提供复杂的调用,并提供更简单的功能,可以避免其他调用中固有的一些开销。

Command Formatting 命令格式

The ultimate objective of invoking a network function is to transfer or receive data from the network. In this section, we're dropping to the very bottom of the software stack to the component responsible for directly accessing the hardware. This is usually referred to as the network driver, and its implementation is often tied to a specific piece of hardware, or a series of NICs by a single hardware vendor.

In order to signal a NIC that it should read a memory buffer and copy that data onto the network, the software driver usually needs to write some sort of command to the NIC. To limit hardware complexity and cost, a NIC may only support a couple of command formats. This differs from the software interfaces that we've been discussing, where we can have different APIs of varying complexity in order to reduce overhead. There can be significant costs associated with formatting the command and posting it to the hardware.

With a standard NIC, the command is formatted by a kernel driver. That driver sits at the bottom of the network stack servicing requests from multiple applications. It must typically format each command only after a request has passed through the network stack.

With devices that are directly accessible by a single application, there are opportunities to use pre-formatted command structures. The more of the command that can be initialized prior to the application submitting a network request, the more streamlined the process, and the better the performance.

As an example, a NIC needs to have the destination address as part of a send operation. If an application is sending to a single peer, that information can be cached and be part of a pre-formatted network header. This is only possible if the NIC driver knows that the destination will not change between sends. The closer that the driver can be to the application, the greater the chance for optimization. An optimal approach is for the driver to be part of a library that executes entirely within the application process space.

调用网络函数的最终目标是从网络传输或接收数据。在本节中,我们将下降到软件堆栈的最底层,即负责直接访问硬件的组件。这通常称为网络驱动程序,其实现通常与特定硬件或单个硬件供应商的一系列 NIC 相关联。

为了向 NIC 发出信号,它应该读取内存缓冲区并将该数据复制到网络上,软件驱动程序通常需要向 NIC 写入某种命令。为了限制硬件复杂性和成本,一个 NIC 可能只支持几种命令格式。这与我们一直在讨论的软件接口不同,我们可以拥有不同复杂度的不同 API 以减少开销。格式化命令并将其发布到硬件可能会产生大量成本。

对于标准 NIC,命令由内核驱动程序格式化。该驱动程序位于网络堆栈的底部,为来自多个应用程序的请求提供服务。它通常必须仅在请求通过网络堆栈后才格式化每个命令。

对于可由单个应用程序直接访问的设备,有机会使用预先格式化的命令结构。在应用程序提交网络请求之前可以初始化的命令越多,流程就越精简,性能也越好。

例如,NIC 需要将目标地址作为发送操作的一部分。如果应用程序正在发送到单个对等点,则该信息可以被缓存并成为预先格式化的网络标头的一部分。这只有在 NIC 驱动程序知道目标在发送之间不会改变的情况下才有可能。驱动程序离应用程序越近,优化的机会就越大。一种最佳方法是让驱动程序成为完全在应用程序进程空间内执行的库的一部分。

下文:

https://cloud.tencent.com/developer/article/2531004

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
暂无评论
推荐阅读
【Hierarchical RL】隐空间分层强化学习(HRL-LS )算法
隐空间分层强化学习,Hierarchical Reinforcement Learning with Latent Space (HRL-LS) 是一种分层强化学习(Hierarchical Reinforcement Learning, HRL)算法,旨在通过在隐空间(Latent Space)中进行策略优化,来处理高维复杂任务中的长期依赖问题。该算法提出了一种新的框架,能够同时利用分层结构和潜在变量模型,来提高在复杂环境中的学习效率。
不去幼儿园
2024/12/03
1970
【Hierarchical RL】隐空间分层强化学习(HRL-LS )算法
【Hierarchical RL】离线策略修正分层强化学习(HIRO)算法
离线策略修正分层强化学习,Hierarchical Reinforcement Learning with Off-Policy Correction (HIRO) 是一种基于分层强化学习的算法,旨在解决长时间跨度和稀疏奖励问题。HIRO 特别引入了离策略(off-policy)校正机制,允许高层策略利用低层策略的经验,而不会因为低层策略的更新而产生偏差。
不去幼儿园
2024/12/03
3100
【Hierarchical RL】离线策略修正分层强化学习(HIRO)算法
【Hierarchical RL】分层深度Q网络(Hierarchical-DQN)算法
Hierarchical-DQN (Hierarchical Deep Q-Network) 是一种分层强化学习算法,专门设计用于解决复杂的任务,通过将任务分解为层次化的子任务来学习。它结合了深度 Q 网络(DQN)和分层强化学习的思想,将复杂任务分解为多个具有不同时间尺度的子任务。Hierarchical-DQN 的设计思路和 FeUdal Networks 类似,都是通过层次结构来解决长时间跨度的任务,但 Hierarchical-DQN 的具体实现有所不同,尤其在策略的选择和值函数的更新方面。
不去幼儿园
2024/12/03
3530
【Hierarchical RL】分层深度Q网络(Hierarchical-DQN)算法
【RL Latest Tech】分层强化学习:FeUdal Networks算法
FeUdal Networks(FuN)是一种分层强化学习(Hierarchical Reinforcement Learning, HRL)算法,由Google DeepMind团队提出。该算法的灵感来源于层级控制结构,将任务分解为高层目标和低层执行细节,从而提高强化学习在复杂环境中的效率。与传统的强化学习算法不同,FeUdal Networks将学习过程分为不同的层次,每个层次的角色不同,但都为共同完成任务服务。
不去幼儿园
2024/12/03
2780
【RL Latest Tech】分层强化学习:FeUdal Networks算法
【Hierarchical RL】动态分层强化学习(DHRL)算法
动态分层强化学习,Dynamic Hierarchical Reinforcement Learning (DHRL) 是一种自适应分层强化学习算法,其目标是根据任务和环境的复杂性动态地构建、修改和利用分层策略。DHRL 不仅仅是预定义层次结构的简单执行,而是允许代理在学习过程中根据需要动态生成和调整分层策略,从而实现更好的任务分解和高效学习。
不去幼儿园
2024/12/03
2780
【Hierarchical RL】动态分层强化学习(DHRL)算法
【RL Latest Tech】分层强化学习:Option-Critic架构算法
分层强化学习(Hierarchical Reinforcement Learning, HRL)通过将复杂问题分解为更小的子问题,显著提高了强化学习算法在解决高维状态空间和长期目标任务中的效率。Option-Critic架构是分层强化学习中一种非常有影响力的方法,专门用于自动发现和优化子策略(称为“Option”)。它是在经典的Options框架基础上提出的,用来处理分层决策问题,特别是可以在没有明确的子目标定义的情况下自动学习子策略。
不去幼儿园
2024/12/03
4011
【RL Latest Tech】分层强化学习:Option-Critic架构算法
【Hierarchical RL】Options Framework(选项框架)
Options Framework(选项框架)是分层强化学习中的一种经典方法,旨在通过将动作抽象化为**选项(Options)**来简化复杂任务的学习过程。基于 Sutton 等人提出的选项框架(Options Framework),其中选项是从一个子任务执行到完成的高层决策链。高层决策什么时候调用特定选项,低层负责具体执行选项的策略。
不去幼儿园
2024/12/03
2000
【Hierarchical RL】Options Framework(选项框架)
【强化学习】Soft Actor-Critic (SAC) 算法
Soft Actor-Critic(SAC) 是一种最先进的强化学习算法,属于 Actor-Critic 方法的变体。它特别适合处理 连续动作空间,并通过引入最大熵(Maximum Entropy)强化学习的思想,解决了许多传统算法中的稳定性和探索问题。
不去幼儿园
2025/01/08
2.1K0
【强化学习】Soft Actor-Critic (SAC) 算法
【强化学习】演员评论家Actor-Critic算法(万字长文、附代码)
Actor-Critic算法是一种强化学习中的方法,结合了“演员”(Actor)和“评论家”(Critic)两个部分。下面用一个生活中的比喻来说明它的原理:
不去幼儿园
2024/12/26
1.5K0
【强化学习】演员评论家Actor-Critic算法(万字长文、附代码)
【RL Latest Tech】离线强化学习:行为规范Actor Critic (BRAC) 算法
离线强化学习(Offline Reinforcement Learning)旨在从静态数据集中学习策略,而无须与环境进行交互。传统的强化学习方法依赖大量环境交互,这在某些情况下是不切实际或昂贵的。离线强化学习通过利用已有的数据,降低了这些需求。
不去幼儿园
2024/12/03
2340
【RL Latest Tech】离线强化学习:行为规范Actor Critic (BRAC) 算法
【Hierarchical RL】不允许你不了解分层强化学习(总结篇)
下面这张图片展示了两层结构,上层为管理者(高层策略),下层为工人(低层策略)。管理者选择子目标,表现为分支路径,工人执行动作以实现子目标,动作通过箭头指向远处的最终目标。环境表现为网格世界,管理者从上方监控进度。
不去幼儿园
2024/12/03
1.8K0
【Hierarchical RL】不允许你不了解分层强化学习(总结篇)
一文读懂强化学习:RL全面解析与Pytorch实战
强化学习(Reinforcement Learning, RL)是人工智能(AI)和机器学习(ML)领域的一个重要子领域,与监督学习和无监督学习并列。它模仿了生物体通过与环境交互来学习最优行为的过程。与传统的监督学习不同,强化学习没有事先标记好的数据集来训练模型。相反,它依靠智能体(Agent)通过不断尝试、失败、适应和优化来学习如何在给定环境中实现特定目标。
TechLead
2023/10/21
3K0
一文读懂强化学习:RL全面解析与Pytorch实战
【SSL-RL】自监督强化学习:事后经验回放 (HER)算法
📢本篇文章是博主强化学习(RL)领域学习时,用于个人学习、研究或者欣赏使用,并基于博主对相关等领域的一些理解而记录的学习摘录和笔记,若有不当和侵权之处,指出后将会立即改正,还望谅解。文章分类在👉强化学习专栏: 【强化学习】(41)---《【RL】强化学习入门:从基础到应用》
不去幼儿园
2024/12/03
4081
【SSL-RL】自监督强化学习:事后经验回放 (HER)算法
【RL Latest Tech】分层强化学习(Hierarchical RL)
分层强化学习(Hierarchical Reinforcement Learning,HRL)是一类旨在通过引入多层次结构来提高强化学习算法效率的方法。其核心思想是将复杂的任务分解为若干子任务,通过解决这些子任务来最终完成整体目标。以下是关于分层强化学习的详细介绍:
不去幼儿园
2024/12/03
5230
【RL Latest Tech】分层强化学习(Hierarchical RL)
DDPG强化学习的PyTorch代码实现和逐步讲解
来源:Deephub Imba本文约4300字,建议阅读10分钟本文将使用pytorch对其进行完整的实现和讲解。 深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是受Deep Q-Network启发的无模型、非策略深度强化算法,是基于使用策略梯度的Actor-Critic,本文将使用pytorch对其进行完整的实现和讲解。 DDPG的关键组成部分是 Replay Buffer Actor-Critic neural network Explorati
数据派THU
2023/04/05
9240
DDPG强化学习的PyTorch代码实现和逐步讲解
【强化学习】深度确定性策略梯度算法(DDPG)详解(附代码)
深度确定性策略梯度(Deep Deterministic Policy Gradient、DDPG)算法是一种基于深度强化学习的算法,适用于解决连续动作空间的问题,比如机器人控制中的连续运动。它结合了确定性策略和深度神经网络,是一种模型无关的强化学习算法,属于Actor-Critic框架,并且同时利用了DQN和PG(Policy Gradient)的优点。
不去幼儿园
2025/01/02
3.9K0
【强化学习】深度确定性策略梯度算法(DDPG)详解(附代码)
【强化学习】近端策略优化算法(PPO)万字详解(附代码)
近端策略优化、PPO(Proximal Policy Optimization)是一种强化学习算法,设计的目的是在复杂任务中既保证性能提升,又让算法更稳定和高效。以下用通俗易懂的方式介绍其核心概念和流程。
不去幼儿园
2025/01/02
12.5K0
【强化学习】近端策略优化算法(PPO)万字详解(附代码)
【强化学习】双延迟深度确定性策略梯度算法(TD3)详解
双延迟深度确定性策略梯度算法,TD3(Twin Delayed Deep Deterministic Policy Gradient)是强化学习中专为解决连续动作空间问题设计的一种算法。TD3算法的提出是在深度确定性策略梯度(DDPG)算法的基础上改进而来,用于解决强化学习训练中存在的一些关键挑战。
不去幼儿园
2025/01/02
2K0
【强化学习】双延迟深度确定性策略梯度算法(TD3)详解
【Hierarchical RL】半马尔可夫决策过程 (SMDP) -->分层强化学习
半马尔可夫决策过程,Semi-Markov Decision Processes (SMDP) 是一种用于分层强化学习的模型,适用于那些包含不规则时间步或长期延迟决策的任务。相比于标准的马尔可夫决策过程(Markov Decision Process, MDP),SMDP 能够处理不同时间间隔之间的决策问题,因此在强化学习中广泛应用于分层结构,尤其是需要长时间跨度或多步策略的复杂任务中。
不去幼儿园
2024/12/03
4690
【Hierarchical RL】半马尔可夫决策过程 (SMDP) -->分层强化学习
Hands on Reinforcement Learning 10 Actor-Critic Algorithm
本书之前的章节讲解了基于值函数的方法(DQN)和基于策略的方法(REINFORCE),其中基于值函数的方法只学习一个价值函数,而基于策略的方法只学习一个策略函数。那么,一个很自然的问题是,有没有什么方法既学习价值函数,又学习策略函数呢?答案就是 Actor-Critic。Actor-Critic 是囊括一系列算法的整体架构,目前很多高效的前沿算法都属于 Actor-Critic 算法,本章接下来将会介绍一种最简单的 Actor-Critic 算法。需要明确的是,Actor-Critic 算法本质上是基于策略的算法,因为这一系列算法的目标都是优化一个带参数的策略,只是会额外学习价值函数,从而帮助策略函数更好地学习。
一只野生彩色铅笔
2023/04/09
6630
Hands on Reinforcement Learning 10 Actor-Critic Algorithm
推荐阅读
相关推荐
【Hierarchical RL】隐空间分层强化学习(HRL-LS )算法
更多 >
目录
  • OFI指导
  • Introduction OFI简介
  • Review of Sockets Communication 套接字通信回顾
    • Connected (TCP) Communication 面向连接的TCP通信
    • Connection-less (UDP) Communication 无连接(连接更少)的UDP
    • Advantages 优点
    • Disadvantages 不足
  • High-Performance Networking 高性能网络
    • Avoiding Memory Copies 避免内存拷贝
      • Network Buffers
      • Resource Management 资源管理
    • Asynchronous Operations 异步操作
      • Interrupts and Signals 中断和信号
      • Event Queues 事件队列
    • Direct Hardware Access 直接硬件访问
      • Kernel Bypass 绕过内核
      • Direct Data Placement 直接数据放置
  • Designing Interfaces for Performance 为性能设计API
    • Call Setup Costs 连接建立/初始化设置的开销
    • Branches and Loops 分支和循环
    • Command Formatting 命令格式
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档