简介
rdma verbs编程中会使用到一些verbs接口, 如下:
ibv_advise_mr
ibv_alloc_dm
ibv_open_device
ibv_get_device_list
ibv_query_device_ex
ibv_get_device_name
ibv_req_notify_cq
ibv_query_gid
ibv_memcpy_to_dm
ibv_get_cq_event
ibv_start_poll
ibv_end_poll
ibv_next_poll
ibv_wc_read_completion_ts
ibv_ack_cq_events
ibv_free_device_list
ibv_cq_ex_to_cq
ibv_modify_qp
...
首先使用的就是查询设备列表和打开设备, 下文以ceph和mlx5用户态和内核态驱动为例, 详解该调用
ceph示例
以ceph server的rdma编程为例, 服务端监听客户端连接前, 首先就查询设备列表(ibv_get_device_list)以及打开设备(ibv_open_device), 获得上下文
server.start()
msgr->bind(addr)
AsyncMessenger::bind
bindv -> int r = p->bind
int Processor::bind
listen_sockets.resize
conf->ms_bind_retry_count 3次重试
worker->center.submit_to lambda []()->void 匿名函数
c->in_thread()
pthread_equal(pthread_self(), owner) 本线程
C_submit_event<func> event(std::move(f), false) f=listen
void do_request -> f() -> listen -> worker->listen(listen_addr, k, opts, &listen_sockets[k]) -> int RDMAWorker::listen 由事件触发执行
ib->init() -> void Infiniband::init
new DeviceList(cct)
ibv_get_device_list 4网口
if (cct->_conf->ms_async_rdma_cm)
new Device(cct, device_list[i]) -> Device::Device
ibv_open_device
ibv_get_device_name
ibv_query_device 参考设备属性: device_attr
get_device 根据配置的设备名在设备列表中查询, 默认取第一个, 如: mlx5_0
binding_port -> void Device::binding_port
new Port(cct, ctxt, port_id) 端口ID从1开始 -> Port::Port
ibv_query_port(ctxt, port_num, &port_attr)
ibv_query_gid(ctxt, port_num, gid_idx, &gid)
ib_physical_port = device->active_port->get_port_num() 获取物理端口
new ProtectionDomain(cct, device) -> Infiniband::ProtectionDomain::ProtectionDomain -> ibv_alloc_pd(device->ctxt)
support_srq = cct->_conf->ms_async_rdma_support_srq 共享接收队列srq
rx_queue_len = device->device_attr.max_srq_wr 最终为4096
tx_queue_len = device->device_attr.max_qp_wr - 1 发送队列为beacon保留1个WR, 如:1024 1_K 重载操作符
device->device_attr.max_cqe 设备允许 4194303 完成事件
memory_manager = new MemoryManager(cct, device, pd) -> Infiniband::MemoryManager::MemoryManager 128K -> mem_pool -> boost::pool
memory_manager->create_tx_pool(cct->_conf->ms_async_rdma_buffer_size, tx_queue_len) -> void Infiniband::MemoryManager::create_tx_pool
send = new Cluster(*this, size)
send->fill(tx_num) -> int Infiniband::MemoryManager::Cluster::fill
base = (char*)manager.malloc(bytes) -> void* Infiniband::MemoryManager::malloc -> std::malloc(size) 标准分配或分配大页(huge_pages_malloc)
ibv_reg_mr 注册内存
new(chunk) Chunk
free_chunks.push_back(chunk)
create_shared_receive_queue
ibv_create_srq
post_chunks_to_rq -> int Infiniband::post_chunks_to_rq
chunk = get_memory_manager()->get_rx_buffer() -> return reinterpret_cast<Chunk *>(rxbuf_pool.malloc())
ibv_post_srq_recv
dispatcher->polling_start() -> void RDMADispatcher::polling_start
ib->get_memory_manager()->set_rx_stat_logger(perf_logger) -> void PerfCounters::set
tx_cc = ib->create_comp_channel(cct) -> Infiniband::CompletionChannel* Infiniband::create_comp_channel -> new Infiniband::CompletionChannel
tx_cq = ib->create_comp_queue(cct, tx_cc)
cq->init() -> int Infiniband::CompletionChannel::init
ibv_create_comp_channel 创建完成通道 -> NetHandler(cct).set_nonblock(channel->fd) 设置非阻塞
t = std::thread(&RDMADispatcher::polling, this) 启动polling线程 rdma-polling -> void RDMADispatcher::polling
tx_cq->poll_cq(MAX_COMPLETIONS, wc)
handle_tx_event -> tx_chunks.push_back(chunk) -> post_tx_buffer
tx -> void RDMAWorker::handle_pending_message()
handle_rx_event -> void RDMADispatcher::handle_rx_event
conn->post_chunks_to_rq(1) 向接收队列补一个内存块(WR) -> int Infiniband::post_chunks_to_rq
ibv_post_srq_recv | ibv_post_recv
polled[conn].push_back(*response)
qp->remove_rq_wr(chunk)
chunk->clear_qp()
pass_wc -> void RDMAConnectedSocketImpl::pass_wc(std::vector<ibv_wc> &&v) -> notify() -> void RDMAConnectedSocketImpl::notify
eventfd_write(notify_fd, event_val) -> eventfd_read(notify_fd, &event_val) <- ssize_t RDMAConnectedSocketImpl::read <- process
new RDMAServerSocketImpl(cct, ib, dispatcher, this, sa, addr_slot)
int r = p->listen(sa, opt) -> int RDMAServerSocketImpl::listen
server_setup_socket = net.create_socket(sa.get_family(), true) -> socket_cloexec
net.set_nonblock
net.set_socket_options
::bind(server_setup_socket, sa.get_sockaddr(), sa.get_sockaddr_len()) 系统调用
::listen backlog=512
*sock = ServerSocket(std::unique_ptr<ServerSocketImpl>(p))
cond.notify_all() -> 通知等待的线程
dispatch_event_external -> void EventCenter::dispatch_event_external
external_events.push_back(e)
wakeup()
write(notify_send_fd, &buf, sizeof(buf)) buf=c -> notify_receive_fd, 唤醒 epoll_wait
event.wait()
msgr->add_dispatcher_head(&dispatcher)
ready()
p->start() -> void Processor::start()
worker->center.create_file_event listen_handler -> pro->accept() -> void Processor::accept()
msgr->start() -> int AsyncMessenger::start()
msgr->wait() -> void AsyncMessenger::wait()
获取设备列表(ibv_get_device_list)
代码路径: rdma-core, libibverbs/device.c
调用栈
ibv_get_device_list -> 返回设备列表
ibverbs_get_device_list -> verbs: 刷新缓存的 ibv_device 列表 问题 ======== 目前,libibverbs 仅在第一次调用 ibv_get_device_list 时构建缓存的 ibv_device 列表,因此无论硬件是否发生变化,该列表都不会更新。 系统。 解决方案======== 修改 ibv_get_device_list() 的实现,以便连续的调用将以与今天相同的方式重新扫描 sysfs,以便每次创建一个新的 ibv_device 列表。 为此,将缓存的设备列表更改为真正的链表而不是动态数组。 我们如何识别新设备•============================= 根据 /sys/class/infiniband_verbs/ 的时间戳创建来识别同一设备 uverbs%d/ibdev。 我们使用 stat 系统调用获取文件状态,并使用 st_mtime 字段来实现此目的。 当我们重新扫描 sysfs 设备时,我们会检查每个 sysfs 设备是否已经在上次扫描中,如果没有,则分配新的 ibv_device 并将其添加到缓存设备列表中。 本系列的下一个补丁处理设备不再使用的情况。 注意:此补丁根据上面 verbs_device 结构体注释中的要求更改了 IBVERBS_PRIVATE 符号
find_sysfs_devs_nl -> verbs:使用 netlink 来发现 uverbs 设备而不是 sysfs,netlink 查询为我们提供了 ibdev idx,它对于设备来说大多是唯一的,并且在设备重命名时充当稳定的 id。 如果在 verbs 用户操作期间重命名设备,这会使 verbs 更加健壮。 此外,netlink 仅返回在进程的网络命名空间中实际可见的设备,从而简化了发现过程
rdmanl_socket_alloc
nl_socket_alloc
nl_socket_disable_auto_ack
nl_socket_disable_msg_peek
nl_connect(nl, NETLINK_RDMA)
rdmanl_get_devices find_sysfs_devs_nl_cb
nl_send_simple RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_GET) NLM_F_DUMP
nl_socket_modify_err_cb
nl_socket_modify_cb
nl_recvmsgs_default
find_uverbs_nl find_uverbs_sysfs try_access_device
rdmanl_get_chardev(nl, sysfs_dev->ibdev_idx, "uverbs", find_uverbs_nl_cb
nlmsg_alloc_simple RDMA_NLDEV_CMD_GET_CHARDEV
...
check_snprintf(path, sizeof(path), "%s/device/infiniband_verbs",
setup_sysfs_uverbs
abi_version
...
stat(devpath, &cdev_stat)
nl_socket_free
find_sysfs_devs
%s/class/infiniband_verbs
ibv_read_sysfs_file_at(uv_dirfd, "ibdev",
check_abi_version
"class/infiniband_verbs/abi_version"
try_all_drivers
try_drivers
match_driver_id -> VERBS_MATCH_SENTINEL -> 动词:提供通用代码以将提供程序与内核设备进行匹配 根据表检查 PCI 设备基本上在每个驱动程序中都是重复的。 遵循内核的模式,并将匹配表附加到 verbs_device_ops 驱动程序入口点,该入口点描述提供程序可以处理的所有内核设备,并使核心代码与该表匹配。 驱动程序获取一个指向与分配函数中匹配的表条目的指针。 此实现基于模式别名,而不是读取 PCI 特定供应商和设备文件。 modalias 让我们支持 ACPI 和 OF 提供程序,并提供了一个简单的路径,使提供程序根据其支持的 modalias 字符串(如内核)进行需求加载
try_driver
match_device
alloc_device -> mlx5_device_alloc
dev->transport_type = IBV_TRANSPORT_IB -> 传输类型
...
load_drivers
dlhandle = dlopen(so_name, RTLD_NOW)
ibverbs_device_hold
该调用利用NetLink通信机制, 查询过滤IB设备, 并返回给用户态
打开设备(ibv_open_device)
调用栈:
ibv_open_device -> LATEST_SYMVER_FUNC(ibv_open_device -> verbs_open_device
verbs_get_device
cmd_fd = open_cdev -> verbs:启用 verbs_open_device() 以在非 sysfs 设备上工作,从 mlx5 开始,启用 verbs_open_device() 通过 VFIO 在非 sysfs 设备上工作。 verbs_sysfs_dev 上的任何其他 API 都应该彻底失败
verbs_device->ops->alloc_context -> mlx5_alloc_context -> verbs:始终分配 verbs_context,现在所有内容都在一棵树中,我们可以修改旧版 init_context 路径,通过在所有提供程序的包装结构中将 ibv_context 交换为 verbs_context 来始终分配 verbs_context。 为了保持提供者差异最小,这个补丁同时做了几件事: - 引入 verbs_init_and_alloc_context() 宏。 这会为每个驱动程序分配、清零并初始化 verbs_context。 值得注意的是,这个新宏在失败时根据需要正确设置 errno。 - 从所有驱动程序、calloc、malloc、memset、cmd_fd 和设备分配中删除样板文件 - 与 verbs_init 方案一起必然出现 verbs_uninit 方案,该方案将 uninit 调用降低到提供者而不是公共代码中。 这使我们能够在 init 错误路径上正确地 uninit。 总之,这遵循我们在内核中看到的相当成功的模式,用于对子系统进行驱动程序初始化。 此外,这会将 ibv_cmd_get_context 更改为接受 verbs_context,因为大多数调用者现在都提供该内容,这使得差异较小。 这使得整个流程更加一致,并且可以让我们消除 init_context 流程
mlx5_init_context
verbs_init_and_alloc_context -> _verbs_init_and_alloc_context
verbs_init_context
ibverbs_device_hold
verbs_set_ops(context_ex, &verbs_dummy_ops) -> rdma verbs操作 -> 在上下文中设置 -> 如果更改,则必须更改 PRIVATE IBVERBS_PRIVATE_ 符号。 这是驱动程序可以支持的每个操作的联合。 如果向此结构添加新元素,则 verbs_dummy_ops 也必须更新。 保持排序
SET_OP -> 设置一系列操作
...
use_ioctl_write = has_ioctl_write(context)
mlx5_open_debug_file
mlx5_set_debug_mask
single_threaded_app
get_uar_info
get_total_uuars
get_num_low_lat_uuars
mlx5_cmd_get_context
ibv_cmd_get_context
...
execute_write_bufs(context, IB_USER_VERBS_CMD_GET_CONTEXT
mlx5_set_context
adjust_uar_info
cl_qmap_init
mlx5_mmap
mlx5_read_env
verbs_set_ops(v_ctx, &mlx5_ctx_common_ops)
mlx5_query_device_ctx
get_hca_general_caps
mlx5dv_devx_general_cmd MLX5_CMD_OP_QUERY_HCA_CAP
ibv_cmd_query_device_any
execute_cmd_write_ex IB_USER_VERBS_EX_CMD_QUERY_DEVICE
execute_cmd_write(context, IB_USER_VERBS_CMD_QUERY_DEVICE -> 通过命令执行框架, 转到内核态
mlx5_set_singleton_nc_uar
set_lib_ops
ibv_cmd_alloc_async_fd
打开设备时, 通过SET_OP给上下文设置(绑定)一些列verbs操作(如下), 设置调试文件, 掩码, 通过查询寄存器(如下:MLX5_CMD_OP_QUERY_HCA_CAP)获取设备硬件能力(集), 赋值并返回给用户态
mlx5上下文通用操作函数:
设置rdma verbs操作:
static const struct verbs_context_ops mlx5_ctx_common_ops = {
.query_port = mlx5_query_port,
.alloc_pd = mlx5_alloc_pd,
.async_event = mlx5_async_event,
.dealloc_pd = mlx5_free_pd,
.reg_mr = mlx5_reg_mr,
.reg_dmabuf_mr = mlx5_reg_dmabuf_mr,
.rereg_mr = mlx5_rereg_mr,
.dereg_mr = mlx5_dereg_mr,
.alloc_mw = mlx5_alloc_mw,
.dealloc_mw = mlx5_dealloc_mw,
.bind_mw = mlx5_bind_mw,
.create_cq = mlx5_create_cq,
.poll_cq = mlx5_poll_cq,
.req_notify_cq = mlx5_arm_cq,
.cq_event = mlx5_cq_event,
.resize_cq = mlx5_resize_cq,
.destroy_cq = mlx5_destroy_cq,
.create_srq = mlx5_create_srq,
.modify_srq = mlx5_modify_srq,
.query_srq = mlx5_query_srq,
.destroy_srq = mlx5_destroy_srq,
.post_srq_recv = mlx5_post_srq_recv,
.create_qp = mlx5_create_qp,
.query_qp = mlx5_query_qp,
.modify_qp = mlx5_modify_qp,
.destroy_qp = mlx5_destroy_qp,
.post_send = mlx5_post_send,
.post_recv = mlx5_post_recv,
.create_ah = mlx5_create_ah,
.destroy_ah = mlx5_destroy_ah,
.attach_mcast = mlx5_attach_mcast,
.detach_mcast = mlx5_detach_mcast,
.advise_mr = mlx5_advise_mr,
.alloc_dm = mlx5_alloc_dm,
.alloc_parent_domain = mlx5_alloc_parent_domain,
.alloc_td = mlx5_alloc_td,
.attach_counters_point_flow = mlx5_attach_counters_point_flow,
.close_xrcd = mlx5_close_xrcd,
.create_counters = mlx5_create_counters,
.create_cq_ex = mlx5_create_cq_ex,
.create_flow = mlx5_create_flow,
.create_flow_action_esp = mlx5_create_flow_action_esp,
.create_qp_ex = mlx5_create_qp_ex,
.create_rwq_ind_table = mlx5_create_rwq_ind_table,
.create_srq_ex = mlx5_create_srq_ex,
.create_wq = mlx5_create_wq,
.dealloc_td = mlx5_dealloc_td,
.destroy_counters = mlx5_destroy_counters,
.destroy_flow = mlx5_destroy_flow,
.destroy_flow_action = mlx5_destroy_flow_action,
.destroy_rwq_ind_table = mlx5_destroy_rwq_ind_table,
.destroy_wq = mlx5_destroy_wq,
.free_dm = mlx5_free_dm,
.get_srq_num = mlx5_get_srq_num,
.import_dm = mlx5_import_dm,
.import_mr = mlx5_import_mr,
.import_pd = mlx5_import_pd,
.modify_cq = mlx5_modify_cq,
.modify_flow_action_esp = mlx5_modify_flow_action_esp,
.modify_qp_rate_limit = mlx5_modify_qp_rate_limit,
.modify_wq = mlx5_modify_wq,
.open_qp = mlx5_open_qp,
.open_xrcd = mlx5_open_xrcd,
.post_srq_ops = mlx5_post_srq_ops,
.query_device_ex = mlx5_query_device_ex,
.query_ece = mlx5_query_ece,
.query_rt_values = mlx5_query_rt_values,
.read_counters = mlx5_read_counters,
.reg_dm_mr = mlx5_reg_dm_mr,
.alloc_null_mr = mlx5_alloc_null_mr,
.free_context = mlx5_free_context,
.set_ece = mlx5_set_ece,
.unimport_dm = mlx5_unimport_dm,
.unimport_mr = mlx5_unimport_mr,
.unimport_pd = mlx5_unimport_pd,
.query_qp_data_in_order = mlx5_query_qp_data_in_order,
};
设备寄存器, net/mlx5_core:使用硬件寄存器描述头文件,添加自动生成的头文件来描述硬件寄存器以及设置/获取值的宏集。这些宏进行静态检查以避免溢出、处理字节顺序,并总体上提供了一种干净的命令编码方式。目前头文件很小,我们将在使用宏时添加结构。一些命令已从命令枚举中删除,因为当前不支持它们,将在支持可用时添加
mellanox 命令操作码, 寄存器:
enum {
MLX5_CMD_OP_QUERY_HCA_CAP = 0x100,
MLX5_CMD_OP_QUERY_ADAPTER = 0x101,
MLX5_CMD_OP_INIT_HCA = 0x102,
MLX5_CMD_OP_TEARDOWN_HCA = 0x103,
MLX5_CMD_OP_ENABLE_HCA = 0x104,
MLX5_CMD_OP_DISABLE_HCA = 0x105,
MLX5_CMD_OP_QUERY_PAGES = 0x107,
MLX5_CMD_OP_MANAGE_PAGES = 0x108,
MLX5_CMD_OP_SET_HCA_CAP = 0x109,
MLX5_CMD_OP_QUERY_ISSI = 0x10a,
MLX5_CMD_OP_SET_ISSI = 0x10b,
MLX5_CMD_OP_SET_DRIVER_VERSION = 0x10d,
MLX5_CMD_OP_QUERY_SF_PARTITION = 0x111,
MLX5_CMD_OP_ALLOC_SF = 0x113,
MLX5_CMD_OP_DEALLOC_SF = 0x114,
MLX5_CMD_OP_SUSPEND_VHCA = 0x115,
MLX5_CMD_OP_RESUME_VHCA = 0x116,
MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE = 0x117,
MLX5_CMD_OP_SAVE_VHCA_STATE = 0x118,
MLX5_CMD_OP_LOAD_VHCA_STATE = 0x119,
MLX5_CMD_OP_CREATE_MKEY = 0x200,
MLX5_CMD_OP_QUERY_MKEY = 0x201,
MLX5_CMD_OP_DESTROY_MKEY = 0x202,
MLX5_CMD_OP_QUERY_SPECIAL_CONTEXTS = 0x203,
MLX5_CMD_OP_PAGE_FAULT_RESUME = 0x204,
MLX5_CMD_OP_ALLOC_MEMIC = 0x205,
MLX5_CMD_OP_DEALLOC_MEMIC = 0x206,
MLX5_CMD_OP_MODIFY_MEMIC = 0x207,
MLX5_CMD_OP_CREATE_EQ = 0x301,
MLX5_CMD_OP_DESTROY_EQ = 0x302,
MLX5_CMD_OP_QUERY_EQ = 0x303,
MLX5_CMD_OP_GEN_EQE = 0x304,
MLX5_CMD_OP_CREATE_CQ = 0x400,
MLX5_CMD_OP_DESTROY_CQ = 0x401,
MLX5_CMD_OP_QUERY_CQ = 0x402,
MLX5_CMD_OP_MODIFY_CQ = 0x403,
MLX5_CMD_OP_CREATE_QP = 0x500,
MLX5_CMD_OP_DESTROY_QP = 0x501,
MLX5_CMD_OP_RST2INIT_QP = 0x502,
MLX5_CMD_OP_INIT2RTR_QP = 0x503,
MLX5_CMD_OP_RTR2RTS_QP = 0x504,
MLX5_CMD_OP_RTS2RTS_QP = 0x505,
MLX5_CMD_OP_SQERR2RTS_QP = 0x506,
MLX5_CMD_OP_2ERR_QP = 0x507,
MLX5_CMD_OP_2RST_QP = 0x50a,
MLX5_CMD_OP_QUERY_QP = 0x50b,
MLX5_CMD_OP_SQD_RTS_QP = 0x50c,
MLX5_CMD_OP_INIT2INIT_QP = 0x50e,
MLX5_CMD_OP_CREATE_PSV = 0x600,
MLX5_CMD_OP_DESTROY_PSV = 0x601,
MLX5_CMD_OP_CREATE_SRQ = 0x700,
MLX5_CMD_OP_DESTROY_SRQ = 0x701,
MLX5_CMD_OP_QUERY_SRQ = 0x702,
MLX5_CMD_OP_ARM_RQ = 0x703,
MLX5_CMD_OP_CREATE_XRC_SRQ = 0x705,
MLX5_CMD_OP_DESTROY_XRC_SRQ = 0x706,
MLX5_CMD_OP_QUERY_XRC_SRQ = 0x707,
MLX5_CMD_OP_ARM_XRC_SRQ = 0x708,
MLX5_CMD_OP_CREATE_DCT = 0x710,
MLX5_CMD_OP_DESTROY_DCT = 0x711,
MLX5_CMD_OP_DRAIN_DCT = 0x712,
MLX5_CMD_OP_QUERY_DCT = 0x713,
MLX5_CMD_OP_ARM_DCT_FOR_KEY_VIOLATION = 0x714,
MLX5_CMD_OP_CREATE_XRQ = 0x717,
MLX5_CMD_OP_DESTROY_XRQ = 0x718,
MLX5_CMD_OP_QUERY_XRQ = 0x719,
MLX5_CMD_OP_ARM_XRQ = 0x71a,
MLX5_CMD_OP_QUERY_XRQ_DC_PARAMS_ENTRY = 0x725,
MLX5_CMD_OP_SET_XRQ_DC_PARAMS_ENTRY = 0x726,
MLX5_CMD_OP_QUERY_XRQ_ERROR_PARAMS = 0x727,
MLX5_CMD_OP_RELEASE_XRQ_ERROR = 0x729,
MLX5_CMD_OP_MODIFY_XRQ = 0x72a,
MLX5_CMD_OP_QUERY_ESW_FUNCTIONS = 0x740,
MLX5_CMD_OP_QUERY_VPORT_STATE = 0x750,
MLX5_CMD_OP_MODIFY_VPORT_STATE = 0x751,
MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT = 0x752,
MLX5_CMD_OP_MODIFY_ESW_VPORT_CONTEXT = 0x753,
MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT = 0x754,
MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT = 0x755,
MLX5_CMD_OP_QUERY_ROCE_ADDRESS = 0x760,
MLX5_CMD_OP_SET_ROCE_ADDRESS = 0x761,
MLX5_CMD_OP_QUERY_HCA_VPORT_CONTEXT = 0x762,
MLX5_CMD_OP_MODIFY_HCA_VPORT_CONTEXT = 0x763,
MLX5_CMD_OP_QUERY_HCA_VPORT_GID = 0x764,
MLX5_CMD_OP_QUERY_HCA_VPORT_PKEY = 0x765,
MLX5_CMD_OP_QUERY_VNIC_ENV = 0x76f,
MLX5_CMD_OP_QUERY_VPORT_COUNTER = 0x770,
MLX5_CMD_OP_ALLOC_Q_COUNTER = 0x771,
MLX5_CMD_OP_DEALLOC_Q_COUNTER = 0x772,
MLX5_CMD_OP_QUERY_Q_COUNTER = 0x773,
MLX5_CMD_OP_SET_MONITOR_COUNTER = 0x774,
MLX5_CMD_OP_ARM_MONITOR_COUNTER = 0x775,
MLX5_CMD_OP_SET_PP_RATE_LIMIT = 0x780,
MLX5_CMD_OP_QUERY_RATE_LIMIT = 0x781,
MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT = 0x782,
MLX5_CMD_OP_DESTROY_SCHEDULING_ELEMENT = 0x783,
MLX5_CMD_OP_QUERY_SCHEDULING_ELEMENT = 0x784,
MLX5_CMD_OP_MODIFY_SCHEDULING_ELEMENT = 0x785,
MLX5_CMD_OP_CREATE_QOS_PARA_VPORT = 0x786,
MLX5_CMD_OP_DESTROY_QOS_PARA_VPORT = 0x787,
MLX5_CMD_OP_ALLOC_PD = 0x800,
MLX5_CMD_OP_DEALLOC_PD = 0x801,
MLX5_CMD_OP_ALLOC_UAR = 0x802,
MLX5_CMD_OP_DEALLOC_UAR = 0x803,
MLX5_CMD_OP_CONFIG_INT_MODERATION = 0x804,
MLX5_CMD_OP_ACCESS_REG = 0x805,
MLX5_CMD_OP_ATTACH_TO_MCG = 0x806,
MLX5_CMD_OP_DETACH_FROM_MCG = 0x807,
MLX5_CMD_OP_GET_DROPPED_PACKET_LOG = 0x80a,
MLX5_CMD_OP_MAD_IFC = 0x50d,
MLX5_CMD_OP_QUERY_MAD_DEMUX = 0x80b,
MLX5_CMD_OP_SET_MAD_DEMUX = 0x80c,
MLX5_CMD_OP_NOP = 0x80d,
MLX5_CMD_OP_ALLOC_XRCD = 0x80e,
MLX5_CMD_OP_DEALLOC_XRCD = 0x80f,
MLX5_CMD_OP_ALLOC_TRANSPORT_DOMAIN = 0x816,
MLX5_CMD_OP_DEALLOC_TRANSPORT_DOMAIN = 0x817,
MLX5_CMD_OP_QUERY_CONG_STATUS = 0x822,
MLX5_CMD_OP_MODIFY_CONG_STATUS = 0x823,
MLX5_CMD_OP_QUERY_CONG_PARAMS = 0x824,
MLX5_CMD_OP_MODIFY_CONG_PARAMS = 0x825,
MLX5_CMD_OP_QUERY_CONG_STATISTICS = 0x826,
MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT = 0x827,
MLX5_CMD_OP_DELETE_VXLAN_UDP_DPORT = 0x828,
MLX5_CMD_OP_SET_L2_TABLE_ENTRY = 0x829,
MLX5_CMD_OP_QUERY_L2_TABLE_ENTRY = 0x82a,
MLX5_CMD_OP_DELETE_L2_TABLE_ENTRY = 0x82b,
MLX5_CMD_OP_SET_WOL_ROL = 0x830,
MLX5_CMD_OP_QUERY_WOL_ROL = 0x831,
MLX5_CMD_OP_CREATE_LAG = 0x840,
MLX5_CMD_OP_MODIFY_LAG = 0x841,
MLX5_CMD_OP_QUERY_LAG = 0x842,
MLX5_CMD_OP_DESTROY_LAG = 0x843,
MLX5_CMD_OP_CREATE_VPORT_LAG = 0x844,
MLX5_CMD_OP_DESTROY_VPORT_LAG = 0x845,
MLX5_CMD_OP_CREATE_TIR = 0x900,
MLX5_CMD_OP_MODIFY_TIR = 0x901,
MLX5_CMD_OP_DESTROY_TIR = 0x902,
MLX5_CMD_OP_QUERY_TIR = 0x903,
MLX5_CMD_OP_CREATE_SQ = 0x904,
MLX5_CMD_OP_MODIFY_SQ = 0x905,
MLX5_CMD_OP_DESTROY_SQ = 0x906,
MLX5_CMD_OP_QUERY_SQ = 0x907,
MLX5_CMD_OP_CREATE_RQ = 0x908,
MLX5_CMD_OP_MODIFY_RQ = 0x909,
MLX5_CMD_OP_SET_DELAY_DROP_PARAMS = 0x910,
MLX5_CMD_OP_DESTROY_RQ = 0x90a,
MLX5_CMD_OP_QUERY_RQ = 0x90b,
MLX5_CMD_OP_CREATE_RMP = 0x90c,
MLX5_CMD_OP_MODIFY_RMP = 0x90d,
MLX5_CMD_OP_DESTROY_RMP = 0x90e,
MLX5_CMD_OP_QUERY_RMP = 0x90f,
MLX5_CMD_OP_CREATE_TIS = 0x912,
MLX5_CMD_OP_MODIFY_TIS = 0x913,
MLX5_CMD_OP_DESTROY_TIS = 0x914,
MLX5_CMD_OP_QUERY_TIS = 0x915,
MLX5_CMD_OP_CREATE_RQT = 0x916,
MLX5_CMD_OP_MODIFY_RQT = 0x917,
MLX5_CMD_OP_DESTROY_RQT = 0x918,
MLX5_CMD_OP_QUERY_RQT = 0x919,
MLX5_CMD_OP_SET_FLOW_TABLE_ROOT = 0x92f,
MLX5_CMD_OP_CREATE_FLOW_TABLE = 0x930,
MLX5_CMD_OP_DESTROY_FLOW_TABLE = 0x931,
MLX5_CMD_OP_QUERY_FLOW_TABLE = 0x932,
MLX5_CMD_OP_CREATE_FLOW_GROUP = 0x933,
MLX5_CMD_OP_DESTROY_FLOW_GROUP = 0x934,
MLX5_CMD_OP_QUERY_FLOW_GROUP = 0x935,
MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY = 0x936,
MLX5_CMD_OP_QUERY_FLOW_TABLE_ENTRY = 0x937,
MLX5_CMD_OP_DELETE_FLOW_TABLE_ENTRY = 0x938,
MLX5_CMD_OP_ALLOC_FLOW_COUNTER = 0x939,
MLX5_CMD_OP_DEALLOC_FLOW_COUNTER = 0x93a,
MLX5_CMD_OP_QUERY_FLOW_COUNTER = 0x93b,
MLX5_CMD_OP_MODIFY_FLOW_TABLE = 0x93c,
MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT = 0x93d,
MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT = 0x93e,
MLX5_CMD_OP_QUERY_PACKET_REFORMAT_CONTEXT = 0x93f,
MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT = 0x940,
MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT = 0x941,
MLX5_CMD_OP_QUERY_MODIFY_HEADER_CONTEXT = 0x942,
MLX5_CMD_OP_FPGA_CREATE_QP = 0x960,
MLX5_CMD_OP_FPGA_MODIFY_QP = 0x961,
MLX5_CMD_OP_FPGA_QUERY_QP = 0x962,
MLX5_CMD_OP_FPGA_DESTROY_QP = 0x963,
MLX5_CMD_OP_FPGA_QUERY_QP_COUNTERS = 0x964,
MLX5_CMD_OP_CREATE_GENERAL_OBJECT = 0xa00,
MLX5_CMD_OP_MODIFY_GENERAL_OBJECT = 0xa01,
MLX5_CMD_OP_QUERY_GENERAL_OBJECT = 0xa02,
MLX5_CMD_OP_DESTROY_GENERAL_OBJECT = 0xa03,
MLX5_CMD_OP_CREATE_UCTX = 0xa04,
MLX5_CMD_OP_DESTROY_UCTX = 0xa06,
MLX5_CMD_OP_CREATE_UMEM = 0xa08,
MLX5_CMD_OP_DESTROY_UMEM = 0xa0a,
MLX5_CMD_OP_SYNC_STEERING = 0xb00,
MLX5_CMD_OP_QUERY_VHCA_STATE = 0xb0d,
MLX5_CMD_OP_MODIFY_VHCA_STATE = 0xb0e,
MLX5_CMD_OP_SYNC_CRYPTO = 0xb12,
MLX5_CMD_OP_ALLOW_OTHER_VHCA_ACCESS = 0xb16,
MLX5_CMD_OP_MAX
};
总结
查询设备列表, 利用内核通信机制NetLink进行通信, 完成设备查询,过滤
打开设备, 利用更安全的ioctl, 读取设备寄存器, 得到设备能力集, 提示应用进行决策(hints)
驱动也利用动态加载(dlopen)so机制
关于实现细节的代码量较大, 也有一定的历史(很多代码都是10年或5年前已经成型), 不得不敬畏这些技术以及背后的设计逻辑, 我辈仍需努力追赶
参考
mojo打开设备: https://www.rdmamojo.com/2012/06/29/ibv_open_device/
rdma-core以及linux内核源码
晓兵(ssbandjl)
博客: https://cloud.tencent.com/developer/user/5060293/articles | https://logread.cn | https://blog.csdn.net/ssbandjl | https://www.zhihu.com/people/ssbandjl/posts
DPU专栏
https://cloud.tencent.com/developer/column/101987