来自Nvidia官网:
NVIDIA GPUDirect 是一系列技术, 用于增强 GPU间(P2P)或GPU与第三方设备(RDMA)间的数据移动和访问, 无论您是在探索海量数据、研究科学问题、训练神经网络还是为金融市场建模,您都需要一个具有最高数据吞吐量的计算平台。GPU 的数据处理速度比 CPU 快得多,随着 GPU 计算能力的提高,对 IO 带宽的需求也随之增加。NVIDIA GPUDirect®是Magnum IO的一部分,可增强 NVIDIA 数据中心 GPU 的数据移动和访问。使用 GPUDirect,网络适配器和存储驱动器可以直接读取和写入 GPU 内存,从而消除不必要的内存复制、减少 CPU 开销和延迟,从而显着提高性能。这些技术(包括 GPUDirect Storage(GDS)、GPUDirect RDMA(GDR)、GPUDirect 点对点 (P2P) 和 GPUDirect Video)通过一套全面的 API 呈现
远程直接内存访问 (RDMA) 使外围 PCIe 设备能够直接访问 GPU 内存。GPUDirect RDMA 专为满足 GPU 加速需求而设计,可在远程系统中的 NVIDIA GPU 之间提供直接通信。这消除了系统 CPU 和通过系统内存所需的数据缓冲区副本,从而将性能提高了 10 倍。CUDA Toolkit 中提供了 GPUDirect RDMA, 参考如下拓扑:
来自论文[1]:
NVIDIA GPUDirect 技术允许对等 GPU、网络适配器和其他设备直接读取和写入 GPU 设备内存。这样就无需向主机内存额外复制数据,从而减少延迟并降低 CPU 开销。这大大缩短了在 NVIDIA Tesla、GeForce 和 Quadro GPU 上运行的应用程序的数据传输时间。
第一个 GPUDirect 版本于 2010 年与 CUDA 3.1 一起推出,旨在通过共享固定主机内存加速与第三方 PCIe 网络和存储设备驱动程序的通信。
2011 年,从 CUDA 4.0 开始,GPUDirect Peer-to-Peer (P2P) 允许在同一PCIe 根端口上直接访问和传输 GPU 之间的数据。
最后,NVIDIA 发布了 CUDA 5.0,即 GPUDirect RDMA 技术,以便在无需 CPU 参与的情况下,通过 PCIe 总线在 GPU 和第三方设备(如网络接口控制器)之间建立直接数据路径。
CUDA 6.0开始支持Unified Memory(如下图),在CUDA 6.0之前,需要手动的在Host和GPU之间分配内存,并在两者之间不断地进行拷贝(cudaMemcpy),即需要自己进行CPU和GPU之间地内存管理。 采用UM后,从程序员的视角,可通过一个统一的指针进行内存管理,由系统自动的迁移内存, 使用UM机制,不能再用cudaMalloc分配GPU memory上的内存,应该使用cudaMallocManaged, UM机制极大地减少了代码量,可以很大程度上减少程序员的工作量。在UM出现之前,由于CPU和GPU之间的地址空间是各自独立的,需要进行多次的手动分配和频繁地使用cudaMemcpy在CPU和GPU的memory之间来回拷贝内存。 当实际数据结构更加复杂时,会使得内存管理变得很复杂
从设备驱动程序版本 OFED 2.1 开始,Mellanox 在 ConnectX-3 和 ConnectIB 主机通道适配器 (HCA) 上支持 GPUDirect RDMA。
MPI 实现了 GPUDirect 系列支持,引入了所谓的 CUDAaware MPI 改进:MPI 发送和接收函数能够直接读取或写入 GPU 内存缓冲区,而无需额外的设备到主机或主机到设备内存副本。
GPUDirect Async(GDA)技术(在GPU 加速应用程序中卸载通信控制逻辑):
在 CUDA 8.0 之前,GPUDirect 技术都是关于高效地从 GPU 内存移动数据和将数据移入和移出 GPU 内存。CUDA 8.0 中引入的功能支持的 GPUDirect Async 优化了 GPU 和网卡之间的控制路径,使 GPU 能够触发和同步网络传输。这减少了 CPU 在应用程序关键路径中的参与。一些研究人员探索了将通信控制路径卸载到 GPU 上的方法。
GPUnet 项目允许 CUDA 内核线程通过套接字使用标准发送和接收函数发送和接收数据,即无需与 HCA 直接对话,通信调用由 CPU 执行。Lena Oden 等人尝试构建 GPU 端 RDMA 库,修改 NVIDIA 设备驱动程序的开源部分,并为 InfiniBand 用户空间驱动程序开发补丁,以在 GPU 上映射整个 InfiniBand 上下文。他们的结果并不令人满意,这导致作者得出结论,GPU 原生设计不如传统的 CPU 控制的网络传输。GPUrdma是一个 GPU 端库,用于从 GPU 代码进行高性能 RDMA 通信,GPU 和 HCA 之间有直接对话,在 GPU 设备内存上分配 InfiniBand 结构(QP 和 CQ),并通过对等 DMA API 和修改后的 HCA 初始化例程将它们公开给 HCA。通信是在 CUDA 内核代码中准备和触发的。所有这些先前的工作都需要修改 GPU 或 HCA 驱动程序的某些功能,为 CUDA 内核线程(通过设备功能)提出动词来准备和提交网络命令。这种方法有两个缺点:由于 GPU 核心的串行效率低,每次发送/接收设备功能期间都会使用 GPU 资源;以及在已经运行的 CUDA 内核内部同步通信时存在 GPU 内存一致性问题 。GPUDirect Async 的不同之处在于它遵循 CUDA 编程模型,同时减少了 CPU 对 GPU 应用程序关键路径的参与。它将计算和通信阶段之间的转换卸载到 GPU 上,使 CPU 无需在发出通信之前等待 GPU 计算完成,也无需在发出计算之前轮询通信完成。
GPU-GPU 通信的最新进展是 GPUDirect RDMA。该技术在 GPU 内存与 NVIDIA 网络适配器设备之间提供直接的 P2P(点对点)数据路径。这显著降低了 GPU-GPU 通信延迟,并完全减轻了 CPU 的负担,使其不再参与网络中的所有 GPU-GPU 通信。GPU Direct 利用 NVIDIA 网络适配器的 PeerDirect RDMA 和 PeerDirect ASYNC™ 功能
支持 GPU 到 GPU 的复制以及直接通过内存结构 (PCIe、NVLink) 进行加载和存储。CUDA 驱动程序原生支持 GPUDirect Peer to Peer。开发人员应在具有两个或更多兼容设备的系统上使用最新的 CUDA 工具包和驱动程序
整体架构拓扑, 参考文档[2]
https://github.com/NVIDIA/open-gpu-kernel-modules
静态:
动态:
术语GDR MR和DMA-BUF MR与高性能计算 (HPC) 中常用的内存注册技术有关,尤其是在涉及 GPU 和其他加速器的场景中。这些术语通常出现在直接内存访问 (DMA) 和 RDMA(远程直接内存访问)的上下文中,以优化 GPU、NIC 和 CPU 等设备之间的数据传输,同时将延迟降至最低。
GDR(GPU Direct)是 NVIDIA 开发的一项技术,允许GPU 直接访问内存缓冲区(通常来自主机或其他设备),而无需 CPU 参与。这在GPUDirect RDMA用例中特别有用,在这种情况下,GPU 直接与其他网络设备或其他 GPU 通信。
GDR MR(内存注册)是指注册可供 GPU 或对等设备直接访问的内存缓冲区的过程。此内存注册是优化 HPC 性能的关键部分,尤其是对于科学计算或机器学习等需要在 GPU 和其他设备之间快速移动大型数据集的应用程序。
用例:该技术通常用于基于 RDMA 的通信,允许 GPU 直接向 NIC(网络接口卡)或另一个 GPU 发送/接收数据,绕过 CPU 以减少开销。
DMA-BUF是一个 Linux 内核框架,允许以统一的方式在不同设备(例如 GPU、CPU、NIC 等)之间共享内存缓冲区。其理念是允许不同的子系统访问共享内存缓冲区,而无需来回复制数据。
DMA-BUF MR(内存注册)与使用 DMA-BUF 框架在设备之间共享的缓冲区的注册有关。这通常用于 GPU 需要与其他设备共享内存的情况,并且您希望确保这些设备能够高效地访问内存而无需重复。
用例:它通常用于多个 GPU 之间或 GPU 与可能使用 RDMA 的 NIC 等设备之间的GPU 内存共享等场景。DMA-BUF 框架通常与GPU Direct RDMA和其他零拷贝技术结合使用,从而实现跨设备高效内存共享。
主要区别: 技术:
GDR MR与 NVIDIA 的GPU Direct技术相关,专注于 GPU 的直接内存访问。 DMA-BUF MR是一个更通用的 Linux 内核框架,允许跨设备(不仅仅是 GPU)共享内存,并且可以与更广泛的设备一起使用。 范围:
GDR MR通常专门用于 GPU 到 GPU、GPU 到 NIC 或其他高性能、低延迟场景,尤其是对于 RDMA。 DMA-BUF MR用于各种设备互连中的内存共享,不一定与 RDMA 或 GPU 特定任务相关,尽管它可以在这些环境中使用。 用例:
GDR MR通常用于基于 RDMA 的环境,例如高性能计算 (HPC) 和机器学习,其中需要以最小的延迟传输大型数据集。 DMA-BUF MR通常用于 Linux 系统中,其中设备(包括 GPU)需要有效地共享内存缓冲区,例如在视频解码/编码、多 GPU 系统或复杂的多媒体处理中。 何时使用? GDR MR:当您想要以最少的 CPU 参与实现GPU 直接访问时使用,尤其是在 RDMA 场景中或在 GPU 和 NIC 之间直接移动数据时。它非常适合 GPU 执行大量计算并需要以低延迟与其他设备共享数据的系统。
DMA-BUF MR:当您需要在 GPU、NIC 或其他硬件等设备之间共享内存时使用,并且这些设备不一定需要紧密耦合或参与高性能 RDMA 操作。这对于多媒体应用程序、通用内存共享或异构系统很有用。
示例使用场景: GDR MR :一种使用GPU和RDMA的高性能计算应用,可在不同节点的GPU之间或GPU与网络设备之间快速移动数据。
DMA-BUF MR:需要在 GPU、视频解码器和网卡之间共享视频缓冲区以实现实时视频流的多媒体应用程序。
struct ib_umem_dmabuf {
struct ib_umem umem;
struct dma_buf_attachment *attach;
struct sg_table *sgt;
struct scatterlist *first_sg;
struct scatterlist *last_sg;
unsigned long first_sg_offset;
unsigned long last_sg_trim;
void *private;
u8 pinned : 1;
};
.reg_user_mr_dmabuf = mlx5_ib_reg_user_mr_dmabuf,
...
struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset,
u64 length, u64 virt_addr,
int fd, int access_flags,
struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_ib_mr *mr = NULL;
struct ib_umem_dmabuf *umem_dmabuf;
int err;
if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) ||
!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
return ERR_PTR(-EOPNOTSUPP);
mlx5_ib_dbg(dev,
"offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x\n",
offset, virt_addr, length, fd, access_flags);
/* dmabuf requires xlt update via umr to work. */
if (!mlx5_ib_can_load_pas_with_umr(dev, length))
return ERR_PTR(-EINVAL);
umem_dmabuf = ib_umem_dmabuf_get(&dev->ib_dev, offset, length, fd,
access_flags,
&mlx5_ib_dmabuf_attach_ops);
if (IS_ERR(umem_dmabuf)) {
mlx5_ib_dbg(dev, "umem_dmabuf get failed (%ld)\n",
PTR_ERR(umem_dmabuf));
return ERR_CAST(umem_dmabuf);
}
mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr,
access_flags);
if (IS_ERR(mr)) {
ib_umem_release(&umem_dmabuf->umem);
return ERR_CAST(mr);
}
mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages);
umem_dmabuf->private = mr;
err = mlx5r_store_odp_mkey(dev, &mr->mmkey);
if (err)
goto err_dereg_mr;
err = mlx5_ib_init_dmabuf_mr(mr);
if (err)
goto err_dereg_mr;
return &mr->ibmr;
err_dereg_mr:
mlx5_ib_dereg_mr(&mr->ibmr, NULL);
return ERR_PTR(err);
}
...
在: drivers/infiniband/hw/mlx5/odp.c 中实现初始化dmabuf_mr
int mlx5_ib_init_dmabuf_mr(struct mlx5_ib_mr *mr)
{
int ret;
ret = pagefault_dmabuf_mr(mr, mr->umem->length, NULL,
MLX5_PF_FLAGS_ENABLE);
return ret >= 0 ? 0 : ret;
}
...
static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt,
u32 *bytes_mapped, u32 flags)
{
struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
u32 xlt_flags = 0;
int err;
unsigned int page_size;
if (flags & MLX5_PF_FLAGS_ENABLE)
xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;
dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
err = ib_umem_dmabuf_map_pages(umem_dmabuf);
if (err) {
dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
return err;
}
page_size = mlx5_umem_find_best_pgsz(&umem_dmabuf->umem, mkc,
log_page_size, 0,
umem_dmabuf->umem.iova);
if (unlikely(page_size < PAGE_SIZE)) {
ib_umem_dmabuf_unmap_pages(umem_dmabuf);
err = -EINVAL;
} else {
err = mlx5_ib_update_mr_pas(mr, xlt_flags);
}
dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
if (err)
return err;
if (bytes_mapped)
*bytes_mapped += bcnt;
return ib_umem_num_pages(mr->umem);
}
RDMA/irdma:添加对 dmabuf pin内存区域的支持,这是 EFA dmabuf[https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/] 的后续。Irdma 驱动程序目前不支持按需分页 (ODP)。因此,它使用 habanalabs 作为 dmabuf 导出器,使用 irdma 作为导入器,以允许通过 libibverbs 进行对等(P2P)访问。在此提交中,使用了函数 ib_umem_dmabuf_get_pinned()。此函数是在 EFA dmabuf[1] 中引入的,它允许驱动程序获取已固定且不需要 move_notify 回调实现的 dmabuf umem。返回的 umem 像标准 cpu umems 一样被固定和 DMA 映射,并通过 ib_umem_release() 释放
static struct ib_mr *irdma_reg_user_mr_dmabuf(struct ib_pd *pd, u64 start,
u64 len, u64 virt,
int fd, int access,
struct uverbs_attr_bundle *attrs)
{
struct irdma_device *iwdev = to_iwdev(pd->device);
struct ib_umem_dmabuf *umem_dmabuf;
struct irdma_mr *iwmr;
int err;
if (len > iwdev->rf->sc_dev.hw_attrs.max_mr_size)
return ERR_PTR(-EINVAL);
umem_dmabuf = ib_umem_dmabuf_get_pinned(pd->device, start, len, fd, access);
if (IS_ERR(umem_dmabuf)) {
err = PTR_ERR(umem_dmabuf);
ibdev_dbg(&iwdev->ibdev, "Failed to get dmabuf umem[%d]\n", err);
return ERR_PTR(err);
}
iwmr = irdma_alloc_iwmr(&umem_dmabuf->umem, pd, virt, IRDMA_MEMREG_TYPE_MEM);
if (IS_ERR(iwmr)) {
err = PTR_ERR(iwmr);
goto err_release;
}
err = irdma_reg_user_mr_type_mem(iwmr, access, true);
if (err)
goto err_iwmr;
return &iwmr->ibmr;
err_iwmr:
irdma_free_iwmr(iwmr);
err_release:
ib_umem_release(&umem_dmabuf->umem);
return ERR_PTR(err);
}
https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/
大家好,这是我之前的 RFC的后续,现在它使用了 RDMA 子系统中实现的动态附加(dynamic attachment) API,但调用 dma_buf_pin() 以确保不会使用 move_notify 回调,正如 Christian 所建议的那样。如之前的 RFC 所述,move_notify 要求 RDMA 设备支持按需分页 (ODP),这在大多数设备上并不常见(仅 mlx5 支持)。虽然动态要求对于某些 GPU 有意义,但某些设备(例如 habanalabs)的设备内存始终处于“固定”状态,不需要/使用 move_notify 操作。第一个补丁更改了 dmabuf 文档,以明确说明固定并不一定意味着必须将内存移动到系统内存,这取决于导出器的决定。此 RFC 的动机是使用 habanalabs 作为 dmabuf 导出器,使用 EFA 作为导入器,以允许通过 libibverbs 进行对等访问
https://github.com/torvalds/linux/commit/682358fd35dece838e6ae2d9d6a69fc0b9a9d411
https://patch.msgid.link/038aad36a43797e5591b20ba81051fc5758124f9.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky
RDMA/umem:添加使用给定 dma 设备创建固定 DMABUF umem 的支持,添加使用指定 DMA 设备(而不是给定 IB 设备的 DMA 设备)创建固定 DMABUF umem 的支持。当实现多路径 DMA 时,此 API 将在该系列的后续补丁中得到利用。
struct ib_umem_dmabuf *
ib_umem_dmabuf_get_pinned_with_dma_device(struct ib_device *device,
struct device *dma_device,
unsigned long offset, size_t size,
int fd, int access);
struct ib_umem_dmabuf *
ib_umem_dmabuf_get_pinned_with_dma_device(struct ib_device *device,
struct device *dma_device,
unsigned long offset, size_t size,
int fd, int access)
{
struct ib_umem_dmabuf *umem_dmabuf;
int err;
umem_dmabuf = ib_umem_dmabuf_get_with_dma_device(device, dma_device, offset,
size, fd, access,
&ib_umem_dmabuf_attach_pinned_ops);
if (IS_ERR(umem_dmabuf))
return umem_dmabuf;
dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
err = dma_buf_pin(umem_dmabuf->attach);
if (err)
goto err_release;
umem_dmabuf->pinned = 1;
err = ib_umem_dmabuf_map_pages(umem_dmabuf);
if (err)
goto err_unpin;
dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
return umem_dmabuf;
err_unpin:
dma_buf_unpin(umem_dmabuf->attach);
err_release:
dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
ib_umem_release(&umem_dmabuf->umem);
return ERR_PTR(err);
}
EXPORT_SYMBOL(ib_umem_dmabuf_get_pinned_with_dma_device);
static struct ib_umem_dmabuf *
ib_umem_dmabuf_get_with_dma_device(struct ib_device *device,
struct device *dma_device,
unsigned long offset, size_t size,
int fd, int access,
const struct dma_buf_attach_ops *ops)
{
struct dma_buf *dmabuf;
struct ib_umem_dmabuf *umem_dmabuf;
struct ib_umem *umem;
unsigned long end;
struct ib_umem_dmabuf *ret = ERR_PTR(-EINVAL);
if (check_add_overflow(offset, (unsigned long)size, &end))
return ret;
if (unlikely(!ops || !ops->move_notify))
return ret;
dmabuf = dma_buf_get(fd);
if (IS_ERR(dmabuf))
return ERR_CAST(dmabuf);
if (dmabuf->size < end)
goto out_release_dmabuf;
umem_dmabuf = kzalloc(sizeof(*umem_dmabuf), GFP_KERNEL);
if (!umem_dmabuf) {
ret = ERR_PTR(-ENOMEM);
goto out_release_dmabuf;
}
umem = &umem_dmabuf->umem;
umem->ibdev = device;
umem->length = size;
umem->address = offset;
umem->writable = ib_access_writable(access);
umem->is_dmabuf = 1;
if (!ib_umem_num_pages(umem))
goto out_free_umem;
umem_dmabuf->attach = dma_buf_dynamic_attach(
dmabuf,
dma_device,
ops,
umem_dmabuf);
if (IS_ERR(umem_dmabuf->attach)) {
ret = ERR_CAST(umem_dmabuf->attach);
goto out_free_umem;
}
return umem_dmabuf;
out_free_umem:
kfree(umem_dmabuf);
out_release_dmabuf:
dma_buf_put(dmabuf);
return ret;
}
https://github.com/linux-rdma/perftest/commit/49755436f022683456c06846550075775fa68aa5
添加内存类型抽象,重构 perftest 以允许轻松扩展不同内存类型(主要用于 AI 加速器直接内存支持)。根据新设计移动 DRAM(主机内存)、mmap、Cuda 和 ROCm 的现有实现。签名人:Michael Margolin
内存类型定义:
enum memory_type {
MEMORY_HOST,
MEMORY_MMAP,
MEMORY_CUDA,
MEMORY_ROCM
};
以创建内存(memory_create)为例, 支持CUDA/ROCM/neuron(AWS)/hl(Habana Labs devices)/mlu(Cambricon MLU)/mmap等设备类型的内存
struct memory_ctx {
int (*init)(struct memory_ctx *ctx);
int (*destroy)(struct memory_ctx *ctx);
int (*allocate_buffer)(struct memory_ctx *ctx, int alignment, uint64_t size, int *dmabuf_fd,
uint64_t *dmabuf_offset, void **addr, bool *can_init);
int (*free_buffer)(struct memory_ctx *ctx, int dmabuf_fd, void *addr, uint64_t size);
void *(*copy_host_to_buffer)(void *dest, const void *src, size_t size);
void *(*copy_buffer_to_host)(void *dest, const void *src, size_t size);
void *(*copy_buffer_to_buffer)(void *dest, const void *src, size_t size);
};
/* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
/*
* Copyright 2023 Amazon.com, Inc. or its affiliates. All rights reserved.
*/
#ifndef CUDA_MEMORY_H
#define CUDA_MEMORY_H
#include "memory.h"
#include "config.h"
struct perftest_parameters;
bool cuda_memory_supported();
bool cuda_memory_dmabuf_supported();
struct memory_ctx *cuda_memory_create(struct perftest_parameters *params);
#ifndef HAVE_CUDA
inline bool cuda_memory_supported() {
return false;
}
inline bool cuda_memory_dmabuf_supported() {
return false;
}
inline struct memory_ctx *cuda_memory_create(struct perftest_parameters *params) {
return NULL;
}
#endif
#endif /* CUDA_MEMORY_H */
实现:
struct cuda_memory_ctx {
struct memory_ctx base;
int device_id;
char *device_bus_id;
CUdevice cuDevice;
CUcontext cuContext;
bool use_dmabuf;
};
static int init_gpu(struct cuda_memory_ctx *ctx)
{
int cuda_device_id = ctx->device_id;
int cuda_pci_bus_id;
int cuda_pci_device_id;
int index;
CUdevice cu_device;
printf("initializing CUDA\n");
CUresult error = cuInit(0);
if (error != CUDA_SUCCESS) {
printf("cuInit(0) returned %d\n", error);
return FAILURE;
}
int deviceCount = 0;
error = cuDeviceGetCount(&deviceCount);
if (error != CUDA_SUCCESS) {
printf("cuDeviceGetCount() returned %d\n", error);
return FAILURE;
}
/* This function call returns 0 if there are no CUDA capable devices. */
if (deviceCount == 0) {
printf("There are no available device(s) that support CUDA\n");
return FAILURE;
}
if (cuda_device_id >= deviceCount) {
fprintf(stderr, "No such device ID (%d) exists in system\n", cuda_device_id);
return FAILURE;
}
printf("Listing all CUDA devices in system:\n");
for (index = 0; index < deviceCount; index++) {
CUCHECK(cuDeviceGet(&cu_device, index));
cuDeviceGetAttribute(&cuda_pci_bus_id, CU_DEVICE_ATTRIBUTE_PCI_BUS_ID , cu_device);
cuDeviceGetAttribute(&cuda_pci_device_id, CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID , cu_device);
printf("CUDA device %d: PCIe address is %02X:%02X\n", index, (unsigned int)cuda_pci_bus_id, (unsigned int)cuda_pci_device_id);
}
printf("\nPicking device No. %d\n", cuda_device_id);
CUCHECK(cuDeviceGet(&ctx->cuDevice, cuda_device_id));
char name[128];
CUCHECK(cuDeviceGetName(name, sizeof(name), cuda_device_id));
printf("[pid = %d, dev = %d] device name = [%s]\n", getpid(), ctx->cuDevice, name);
printf("creating CUDA Ctx\n");
/* Create context */
error = cuCtxCreate(&ctx->cuContext, CU_CTX_MAP_HOST, ctx->cuDevice);
if (error != CUDA_SUCCESS) {
printf("cuCtxCreate() error=%d\n", error);
return FAILURE;
}
printf("making it the current CUDA Ctx\n");
error = cuCtxSetCurrent(ctx->cuContext);
if (error != CUDA_SUCCESS) {
printf("cuCtxSetCurrent() error=%d\n", error);
return FAILURE;
}
return SUCCESS;
}
static void free_gpu(struct cuda_memory_ctx *ctx)
{
printf("destroying current CUDA Ctx\n");
CUCHECK(cuCtxDestroy(ctx->cuContext));
}
int cuda_memory_init(struct memory_ctx *ctx) {
struct cuda_memory_ctx *cuda_ctx = container_of(ctx, struct cuda_memory_ctx, base);
int return_value = 0;
if (cuda_ctx->device_bus_id) {
int err;
printf("initializing CUDA\n");
CUresult error = cuInit(0);
if (error != CUDA_SUCCESS) {
printf("cuInit(0) returned %d\n", error);
return FAILURE;
}
printf("Finding PCIe BUS %s\n", cuda_ctx->device_bus_id);
err = cuDeviceGetByPCIBusId(&cuda_ctx->device_id, cuda_ctx->device_bus_id);
if (err != 0) {
fprintf(stderr, "We have an error from cuDeviceGetByPCIBusId: %d\n", err);
}
printf("Picking GPU number %d\n", cuda_ctx->device_id);
}
return_value = init_gpu(cuda_ctx);
if (return_value) {
fprintf(stderr, "Couldn't init GPU context: %d\n", return_value);
return FAILURE;
}
#ifdef HAVE_CUDA_DMABUF
if (cuda_ctx->use_dmabuf) {
int is_supported = 0;
CUCHECK(cuDeviceGetAttribute(&is_supported, CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED, cuda_ctx->cuDevice));
if (!is_supported) {
fprintf(stderr, "DMA-BUF is not supported on this GPU\n");
return FAILURE;
}
}
#endif
return SUCCESS;
}
int cuda_memory_destroy(struct memory_ctx *ctx) {
struct cuda_memory_ctx *cuda_ctx = container_of(ctx, struct cuda_memory_ctx, base);
free_gpu(cuda_ctx);
free(cuda_ctx);
return SUCCESS;
}
int cuda_memory_allocate_buffer(struct memory_ctx *ctx, int alignment, uint64_t size, int *dmabuf_fd,
uint64_t *dmabuf_offset, void **addr, bool *can_init) {
CUdeviceptr d_A;
int error;
size_t buf_size = (size + ACCEL_PAGE_SIZE - 1) & ~(ACCEL_PAGE_SIZE - 1);
printf("cuMemAlloc() of a %lu bytes GPU buffer\n", size);
error = cuMemAlloc(&d_A, buf_size);
if (error != CUDA_SUCCESS) {
printf("cuMemAlloc error=%d\n", error);
return FAILURE;
}
printf("allocated GPU buffer address at %016llx pointer=%p\n", d_A, (void *)d_A);
*addr = (void *)d_A;
*can_init = false;
#ifdef HAVE_CUDA_DMABUF
{
struct cuda_memory_ctx *cuda_ctx = container_of(ctx, struct cuda_memory_ctx, base);
if (cuda_ctx->use_dmabuf) {
CUdeviceptr aligned_ptr;
const size_t host_page_size = sysconf(_SC_PAGESIZE);
uint64_t offset;
size_t aligned_size;
// Round down to host page size
aligned_ptr = d_A & ~(host_page_size - 1);
offset = d_A - aligned_ptr;
aligned_size = (size + offset + host_page_size - 1) & ~(host_page_size - 1);
printf("using DMA-BUF for GPU buffer address at %#llx aligned at %#llx with aligned size %zu\n", d_A, aligned_ptr, aligned_size);
*dmabuf_fd = 0;
error = cuMemGetHandleForAddressRange((void *)dmabuf_fd, aligned_ptr, aligned_size, CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD, 0);
if (error != CUDA_SUCCESS) {
printf("cuMemGetHandleForAddressRange error=%d\n", error);
return FAILURE;
}
*dmabuf_offset = offset;
}
}
#endif
return SUCCESS;
}
int cuda_memory_free_buffer(struct memory_ctx *ctx, int dmabuf_fd, void *addr, uint64_t size) {
CUdeviceptr d_A = (CUdeviceptr)addr;
printf("deallocating RX GPU buffer %016llx\n", d_A);
cuMemFree(d_A);
return SUCCESS;
}
void *cuda_memory_copy_host_buffer(void *dest, const void *src, size_t size) {
cuMemcpy((CUdeviceptr)dest, (CUdeviceptr)src, size);
return dest;
}
void *cuda_memory_copy_buffer_to_buffer(void *dest, const void *src, size_t size) {
cuMemcpyDtoD((CUdeviceptr)dest, (CUdeviceptr)src, size);
return dest;
}
bool cuda_memory_supported() {
return true;
}
bool cuda_memory_dmabuf_supported() {
#ifdef HAVE_CUDA_DMABUF
return true;
#else
return false;
#endif
}
struct memory_ctx *cuda_memory_create(struct perftest_parameters *params) {
struct cuda_memory_ctx *ctx;
ALLOCATE(ctx, struct cuda_memory_ctx, 1);
ctx->base.init = cuda_memory_init;
ctx->base.destroy = cuda_memory_destroy;
ctx->base.allocate_buffer = cuda_memory_allocate_buffer;
ctx->base.free_buffer = cuda_memory_free_buffer;
ctx->base.copy_host_to_buffer = cuda_memory_copy_host_buffer;
ctx->base.copy_buffer_to_host = cuda_memory_copy_host_buffer;
ctx->base.copy_buffer_to_buffer = cuda_memory_copy_buffer_to_buffer;
ctx->device_id = params->cuda_device_id;
ctx->device_bus_id = params->cuda_device_bus_id;
ctx->use_dmabuf = params->use_cuda_dmabuf;
return &ctx->base;
}
struct host_memory_ctx {
struct memory_ctx base;
int use_hugepages;
};
#define HUGEPAGE_ALIGN (2*1024*1024)
#define SHMAT_ADDR (void *)(0x0UL)
#define SHMAT_FLAGS (0)
#define SHMAT_INVALID_PTR ((void *)-1)
#if !defined(__FreeBSD__)
int alloc_hugepage_region(int alignment, uint64_t size, void **addr)
{
int huge_shmid;
uint64_t buf_size;
uint64_t buf_alignment = (((alignment + HUGEPAGE_ALIGN -1) / HUGEPAGE_ALIGN) * HUGEPAGE_ALIGN);
buf_size = (((size + buf_alignment -1 ) / buf_alignment ) * buf_alignment);
/* create hugepage shared region */
huge_shmid = shmget(IPC_PRIVATE, buf_size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
if (huge_shmid < 0) {
fprintf(stderr, "Failed to allocate hugepages. Please configure hugepages\n");
return FAILURE;
}
/* attach shared memory */
*addr = (void *)shmat(huge_shmid, SHMAT_ADDR, SHMAT_FLAGS);
if (*addr == SHMAT_INVALID_PTR) {
fprintf(stderr, "Failed to attach shared memory region\n");
return FAILURE;
}
/* Mark shmem for removal */
if (shmctl(huge_shmid, IPC_RMID, 0) != 0) {
fprintf(stderr, "Failed to mark shm for removal\n");
return FAILURE;
}
return SUCCESS;
}
#endif
int host_memory_init(struct memory_ctx *ctx) {
return SUCCESS;
}
int host_memory_destroy(struct memory_ctx *ctx) {
struct host_memory_ctx *host_ctx = container_of(ctx, struct host_memory_ctx, base);
free(host_ctx);
return SUCCESS;
}
int host_memory_allocate_buffer(struct memory_ctx *ctx, int alignment, uint64_t size, int *dmabuf_fd,
uint64_t *dmabuf_offset, void **addr, bool *can_init) {
#if defined(__FreeBSD__)
posix_memalign(addr, alignment, size);
#else
struct host_memory_ctx *host_ctx = container_of(ctx, struct host_memory_ctx, base);
if (host_ctx->use_hugepages) {
if (alloc_hugepage_region(alignment, size, addr) != 0){
fprintf(stderr, "Failed to allocate hugepage region.\n");
return FAILURE;
}
} else {
*addr = memalign(alignment, size);
}
#endif
if (!*addr) {
fprintf(stderr, "Couldn't allocate work buf.\n");
return FAILURE;
}
memset(*addr, 0, size);
*can_init = true;
return SUCCESS;
}
int host_memory_free_buffer(struct memory_ctx *ctx, int dmabuf_fd, void *addr, uint64_t size) {
struct host_memory_ctx *host_ctx = container_of(ctx, struct host_memory_ctx, base);
if (host_ctx->use_hugepages) {
shmdt(addr);
} else {
free(addr);
}
return SUCCESS;
}
struct memory_ctx *host_memory_create(struct perftest_parameters *params) {
struct host_memory_ctx *ctx;
ALLOCATE(ctx, struct host_memory_ctx, 1);
ctx->base.init = host_memory_init;
ctx->base.destroy = host_memory_destroy;
ctx->base.allocate_buffer = host_memory_allocate_buffer;
ctx->base.free_buffer = host_memory_free_buffer;
ctx->base.copy_host_to_buffer = memcpy;
ctx->base.copy_buffer_to_host = memcpy;
ctx->base.copy_buffer_to_buffer = memcpy;
ctx->use_hugepages = params->use_hugepages;
return &ctx->base;
}
https://github.com/linux-rdma/perftest/commit/b304311dd9227f6048e523a9099a7399b8361d97
Signed-off-by: Gil Rockah
编译:
./configure CUDA_H_PATH="<cuda.h path>"
执行(当前仅支持带宽BW测试):
./ib_write_bw --use_cuda
主要实现代码:
static int pp_init_gpu(struct pingpong_context *ctx, size_t _size)
{
int ret = 0;
const size_t gpu_page_size = 64*1024;
size_t size = (_size + gpu_page_size - 1) & ~(gpu_page_size - 1);
printf("initializing CUDA\n");
CUresult error = cuInit(0);
if (error != CUDA_SUCCESS) {
printf("cuInit(0) returned %d\n", error);
exit(1);
}
int deviceCount = 0;
error = cuDeviceGetCount(&deviceCount);
if (error != CUDA_SUCCESS) {
printf("cuDeviceGetCount() returned %d\n", error);
exit(1);
}
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0) {
printf("There are no available device(s) that support CUDA\n");
return 1;
} else if (deviceCount == 1)
printf("There is 1 device supporting CUDA\n");
else
printf("There are %d devices supporting CUDA, picking first...\n", deviceCount);
int devID = 0;
// pick up device with zero ordinal (default, or devID)
CUCHECK(cuDeviceGet(&cuDevice, devID));
char name[128];
CUCHECK(cuDeviceGetName(name, sizeof(name), devID));
printf("[pid = %d, dev = %d] device name = [%s]\n", getpid(), cuDevice, name);
printf("creating CUDA Ctx\n");
// Create context
error = cuCtxCreate(&cuContext, CU_CTX_MAP_HOST, cuDevice);
if (error != CUDA_SUCCESS) {
printf("cuCtxCreate() error=%d\n", error);
return 1;
}
printf("making it the current CUDA Ctx\n");
error = cuCtxSetCurrent(cuContext);
if (error != CUDA_SUCCESS) {
printf("cuCtxSetCurrent() error=%d\n", error);
return 1;
}
printf("cuMemAlloc() of a %d bytes GPU buffer\n", size);
CUdeviceptr d_A;
error = cuMemAlloc(&d_A, size);
if (error != CUDA_SUCCESS) {
printf("cuMemAlloc error=%d\n", error);
return 1;
}
printf("allocated GPU buffer address at %016llx pointer=%p\n", d_A, d_A);
ctx->buf = (void*)d_A; // 将GPU内存地址填写到perftest上下文的buf上
return 0;
}
static int pp_free_gpu(struct pingpong_context *ctx)
{
int ret = 0;
CUdeviceptr d_A = (CUdeviceptr) ctx->buf;
printf("deallocating RX GPU buffer\n");
cuMemFree(d_A);
d_A = 0;
printf("destroying current CUDA Ctx\n");
CUCHECK(cuCtxDestroy(cuContext));
return ret;
}
GPUDirect usage:
To utilize GPUDirect feature, perftest should be compiled as:
./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.:
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
Thus --use_cuda=<gpu_index> flag will be available to add to a command line:
./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a
CUDA DMA-BUF requierments:
1) CUDA Toolkit 11.7 or later.
2) NVIDIA Open-Source GPU Kernel Modules version 515 or later.
installation instructions: http://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/README/kernel_open.html
3) Configuration / Usage:
export the following environment variables:
1- export LD_LIBRARY_PATH.
e.g: export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
2- export LIBRARY_PATH.
e.g: export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
perform compilation as decribe in the begining of section 4 (GPUDirect usage).
enum ibv_mr_type {
IBV_MR_TYPE_MR,
IBV_MR_TYPE_NULL_MR,
IBV_MR_TYPE_IMPORTED_MR,
IBV_MR_TYPE_DMABUF_MR,
};
struct ibv_mr *(*reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset,
size_t length, uint64_t iova,
int fd, int access);
int ibv_cmd_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length,
uint64_t iova, int fd, int access,
struct verbs_mr *vmr);
...
verbs注册内存ioctl接口: UVERBS_METHOD_REG_DMABUF_MR -> TO Kernel
...
ibv_reg_mr 是在 GPU 上使用 nvidia-peermem 库分配的内存注册,而 ibv_reg_dmabuf_mr 是在 DMAbuff 上使用的注册
perftest > ib_write_bw:
cuDeviceGetAttribute(&is_supported, CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED, cuDevice)
root@gdr114:~/big/ofed# run_cmd "uname -a"
2025/03/10 16:33:40 s114 uname -a
Linux gdr114 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
2025/03/10 16:33:40 s116 uname -a
Linux gdr116 6.11.0-19-generic #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
root@gdr114:~/big/ofed# run_cmd "cat /etc/*-release"
2025/03/10 16:33:56 s114 cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
2025/03/10 16:33:56 s116 cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
root@gdr114:~/big/ofed#
root@xw:~/big/ofed/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64# ./mlnxofedinstall --add-kernel-support --with-nvmf --force -vvv
or
./mlnxofedinstall --with-nfsrdma --with-nvmf --enable-gds --add-kernel-support -vvv
查看OFED版本:
root@gdr114:~# ofed_info -s
MLNX_OFED_LINUX-24.10-2.1.8.0:
root@gdr114:~# dpkg -l | grep ofed
ii mlnx-ofed-kernel-modules 24.10.OFED.24.10.2.1.8.1-1.kver.6.8.0-41-generic amd64 mlnx-ofed kernel modules
ii mlnx-ofed-kernel-utils 24.10.OFED.24.10.2.1.8.1-1.kver.6.8.0-41-generic amd64 Userspace tools to restart and tune mlnx-ofed kernel modules
ii ofed-scripts 24.10.OFED.24.10.2.1.8-1 amd64 MLNX_OFED utilities
root@xw:~/big/ofed/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64# cat /etc/netplan/01-network-manager-all.yaml
network:
version: 2
renderer: networkd
ethernets:
enp8s0f0np0:
dhcp4: no
addresses:
- 192.168.1.114/24
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
root@xw:~/big/ofed/MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu24.04-x86_64# netplan apply
有一个新的脚本(nvidia-driver-assistant)可用于检测和安装最适合用户系统的 NVIDIA 驱动程序包。此软件旨在帮助用户根据检测到的系统硬件决定安装哪个 NVIDIA 图形驱动程序
apt install nvidia-driver-assistant -y
安装建议的驱动:
nvidia-driver-assistant --install
nvidia-driver-assistant --list-supported-distros
nvidia-driver-assistant --supported-gpus
根据建议安装完整的CUDA和Nvidia驱动(legacy):
apt-get install -y cuda-drivers (GPU Tesla V100驱动安装命令) or
sudo apt-get install -y nvidia-open (GPU RTX A6000)
NVIDIA Linux GPU 驱动程序包含几个内核模块:nvidia.ko、nvidia-modeset.ko、nvidia-uvm.ko、nvidia-drm.ko 和 nvidia-peermem.ko
Nvidia Tesla驱动安装文档: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html
查看当前已安装:
apt list --installed|grep nvidia
删除原来的nvidia安装包:
sudo apt-get --purge remove nvidia*
sudo apt autoremove
reboot
cp /var/nvidia-driver-local-repo-ubuntu2404-550.144.03/nvidia-driver-local-B583B818-keyring.gpg /usr/share/keyrings/
apt install ./nvidia-driver-local-repo-ubuntu2404-550.144.03_1.0-1_amd64.deb
安装驱动
apt-get install -y cuda-drivers (GPU Tesla V100驱动安装命令) or
sudo apt-get install -y nvidia-open (GPU RTX A6000)
NVIDIA 在 Linux 上提供了一个用户空间守护进程,以支持在 CUDA 作业运行期间驱动程序状态的持久性。与持久性模式相比,守护进程方法为该问题提供了更优雅、更强大的解决方案。有关 NVIDIA 持久性守护进程的更多详细信息,请参阅此处的文档。
所有分发包都包含一个systemd预设,如果作为其他驱动程序组件的依赖项安装,则会自动启用它。可以通过运行以下命令手动重新启动它
systemctl restart persistenced [当前不可用]
cat /proc/driver/nvidia/version
加载nvidia_peermem驱动:
modprobe nvidia_peermem
设置开机加载nvidia_peermem驱动
echo "modprobe nvidia_peermem" >>/etc/rc.local
查看nvidia驱动:
root@gdr114:~# modinfo nvidia
filename: /lib/modules/6.8.0-41-generic/updates/dkms/nvidia.ko.zst
alias: char-major-195-*
version: 570.124.06
supported: external
license: NVIDIA
firmware: nvidia/570.124.06/gsp_tu10x.bin
firmware: nvidia/570.124.06/gsp_ga10x.bin
srcversion: F3571BE77DC2A8306835926
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends:
retpoline: Y
name: nvidia
vermagic: 6.8.0-41-generic SMP preempt mod_unload modversions
sig_id: PKCS#7
signer: gdr114 Secure Boot Module Signature key
sig_key: 0B:D5:4E:5F:6F:EE:2F:28:7E:64:A8:8C:C3:74:40:BE:47:FF:86:51
sig_hashalgo: sha512
2025/03/11 15:02:44 s116 modinfo nvidia
filename: /lib/modules/6.11.0-19-generic/updates/dkms/nvidia.ko.zst
alias: char-major-195-*
version: 570.124.06
supported: external
license: NVIDIA
firmware: nvidia/570.124.06/gsp_tu10x.bin
firmware: nvidia/570.124.06/gsp_ga10x.bin
srcversion: F3571BE77DC2A8306835926
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends:
retpoline: Y
name: nvidia
vermagic: 6.11.0-19-generic SMP preempt mod_unload modversions
root@gdr114:~# run_cmd "cat /proc/driver/nvidia/version"
2025/03/10 21:21:47 s114 cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 570.124.06 Wed Feb 26 02:12:04 UTC 2025
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
2025/03/10 21:21:47 s116 cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 570.124.06 Wed Feb 26 02:12:04 UTC 2025
GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)
root@gdr114:~# run_cmd "apt list --installed|grep nvidia"
2025/03/10 21:21:59 s114 apt list --installed|grep nvidia
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnvidia-cfg1-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-570/unknown,now 570.124.06-0ubuntu1 all [installed,automatic]
libnvidia-compute-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-compute-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-decode-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-decode-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-encode-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-extra-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-gl-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-ml-dev/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-compute-utils-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-cuda-dev/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
nvidia-cuda-gdb/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-cuda-toolkit-doc/noble,noble,now 12.0.1-4build4 all [installed,automatic]
nvidia-cuda-toolkit/noble,now 12.0.140~12.0.1-4build4 amd64 [installed] // CUDA版本
nvidia-dkms-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-assistant/unknown,now 0.20.124.06-1 all [installed]
nvidia-driver-local-repo-ubuntu2404-550.144.03/now 1.0-1 amd64 [installed,local]
nvidia-firmware-570-570.124.06/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-source-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-opencl-dev/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-profiler/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
nvidia-settings/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-visual-profiler/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
xserver-xorg-video-nvidia-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
2025/03/10 21:21:59 s116 apt list --installed|grep nvidia
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnvidia-cfg1-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-570/unknown,now 570.124.06-0ubuntu1 all [installed,automatic]
libnvidia-compute-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-compute-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-decode-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-decode-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-encode-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-extra-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-gl-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-570/unknown,now 570.124.06-0ubuntu1 i386 [installed,automatic]
libnvidia-ml-dev/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-compute-utils-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-cuda-dev/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
nvidia-cuda-gdb/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-cuda-toolkit-doc/noble,noble,now 12.0.1-4build4 all [installed,automatic]
nvidia-cuda-toolkit/noble,now 12.0.140~12.0.1-4build4 amd64 [installed] // CUDA版本
nvidia-dkms-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-driver-assistant/unknown,now 0.20.124.06-1 all [installed]
nvidia-driver-local-repo-ubuntu2404-550.144.03/now 1.0-1 amd64 [installed,local]
nvidia-firmware-570-570.124.06/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-source-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-opencl-dev/noble,now 12.0.140~12.0.1-4build4 amd64 [installed,automatic]
nvidia-profiler/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
nvidia-settings/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
nvidia-visual-profiler/noble,now 12.0.146~12.0.1-4build4 amd64 [installed,automatic]
xserver-xorg-video-nvidia-570/unknown,now 570.124.06-0ubuntu1 amd64 [installed,automatic]
root@gdr114:~#
查看NVCC编译器版本
root@gdr114:~# run_cmd "nvcc -V"
2025/03/10 21:26:54 s114 nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
2025/03/10 21:26:54 s116 nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
root@gdr114:~# run_cmd "lsmod | grep nvidia"
2025/03/12 14:22:03 s114 lsmod | grep nvidia
nvidia_peermem 16384 0
nvidia_drm 131072 3
nvidia_modeset 1724416 3 nvidia_drm
nvidia 11640832 32 nvidia_peermem,nvidia_modeset
drm_ttm_helper 16384 1 nvidia_drm
video 77824 1 nvidia_modeset
ib_uverbs 200704 3 nvidia_peermem,rdma_ucm,mlx5_ib
2025/03/12 14:22:03 s116 lsmod | grep nvidia
nvidia_peermem 16384 0
nvidia_uvm 2154496 4
nvidia_drm 131072 3
nvidia_modeset 1724416 3 nvidia_drm
nvidia 11640832 40 nvidia_uvm,nvidia_peermem,nvidia_modeset
drm_ttm_helper 16384 1 nvidia_drm
video 77824 1 nvidia_modeset
ib_uverbs 200704 3 nvidia_peermem,rdma_ucm,mlx5_ib
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export CUDAHOME=$CUDA_HOME:/usr/local/cuda
下文(RDMA - GDR GPU Direct RDMA快速入门2): https://cloud.tencent.com/developer/article/2508959
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。