path: root/docs/mellanox.txt
diff options
authorSuren A. Chilingaryan <>2020-09-03 03:00:30 +0200
committerSuren A. Chilingaryan <>2020-09-03 03:00:30 +0200
commit5172421d248250b4ab3b69eb57fd83656e23a4da (patch)
treea499d9f1dd0b74b754816884a59927b3171656fc /docs/mellanox.txt
parent7b2e6168b049be9e7852b2d364d897592eff69fc (diff)
This is unfinished work implemeting out-of-UFO network serversHEADmaster
Diffstat (limited to 'docs/mellanox.txt')
1 files changed, 88 insertions, 0 deletions
diff --git a/docs/mellanox.txt b/docs/mellanox.txt
new file mode 100644
index 0000000..ed20048
--- /dev/null
+++ b/docs/mellanox.txt
@@ -0,0 +1,88 @@
+ - Send/Receive Queues
+ QP (Queue Pair): Combines RQ and SQ. Generally, irrelevant for the following
+ RQ (Receive Queue):
+ SQ (Send Queue):
+ CQ (Completion Queue): Completed operations reported here
+ EQ (Event Queue): Completions generate events (at specified rate) which in turn generate IRQs
+ WR/WQ (Work Request Queue): This is basically buffers (SG-lists) which should be either send or used for data reception
+ *QE (* Queue Event)
+ Flow: WQE --submit work--> WQ --execute--> SQ/RQ --on completion-> CQ --signal--> EQ -> IRQ
+ * Completion Event Moderation: Redeuce amount of reported events (EQ)
+ - Ofloads
+ RSS (Receive Side Scalling): Distribute load across CPU cores
+ LRO (Large Receive Offload): Group packets and deliver to user-space as a large single grouped packet [ ethtool -K shows if LRO on/off ]
+ - Various
+ AEV (Asynchronous Event): Errors,etc.
+ SRQ (Shared Receive Queue):
+ ICM (Interconnect Context Memory): Address Translation Tables, Control Objects, User Access Region (registers)
+ MPT (Memory Protection Table):
+ RMP (Receive Memory Pool):
+ TIR (Transport Interface Receive):
+ RQT (RQ Table):
+ MCG (Multicast Group):
+ - Network packets is/are streamed to ring buffers (with all Ethernet, IP, UDP/TCP headers).
+ The number of ring buffers dependents on VMA_RING_ALLOCATION parameter:
+ 0 - per network interface
+ 1 - per IP
+ => 10 - per socket
+ 20 - per thread (which was used to create the socket)
+ 30 - per core
+ 31 - per core (with some affinity of threads to cores)
+ - The memory for ring buffer is allocated based on VMA_MEM_ALLOC_TYPE:
+ 0 - malloc (this will be very slow if large buffers are requested)
+ 1 - contigous
+ => 2 - HugePages
+ - The number of buffers per ring is controlled with VMA_RX_BUFS (this is total in all rings)
+ * Each buffer VMA_MTU bytes
+ * Recommended: VMA_RX_BUFS ~ #rings * VMA_RX_WRE (number of WRE allocated on all interfaces)
+ There is 3 interfaces:
+ - MP-RQ (Multi-packet Receive Queue): vma_cyclic_buffer_read
+ This is useful for processing data streams when packet size stays contant and the packet flow doesn't change
+ drastically over time. Requires ConntextX-5 or newer.
+ * Use 'vma_add_ring_profile' to configure the size of ring buffer (specifies buffer size & the packet size)
+ * Set per-socket SO_VMA_RING_ALLOC_LOGIC using setsockopt
+ * Call 'vma_cyclic_buffer_read' to access raw ring buffer, specifies minimum and maximum packets to return
+ * The returned 'completion' structure referencing the position in the ring buffer. Packets in ring buffer
+ include all headers (ethernet - 14 bytes, ip - 20 bytes, udp - 8 bytes).
+ * New packets meanwhile are written in the remaining part of the ring buffer (until the linear end of the
+ buffer - consequently the returned data is not overwritten).
+ * The buffer rewinded only on call to 'vma_cyclic_buffer_read'. Less than the specified minimum amount of
+ packets can be returned if currently near the end of buffer and not enough space to fullfil the minimum
+ requirement.
+ * To ensure enough space for the follow up packets, synchronization between buffer size and min/max packet
+ is required. It should never happen that the space for only few packets is left when end of the buffer is
+ close.
+ - SocketXtreme: socketxtreme_poll
+ More complex interface allowing more control over process particularly processing packets with varing size.
+ Requires ConnectX-5 or newer.
+ * Get ring buffers associated with socket 'get_socket_rings_num' and 'get_socket_rings_fds'
+ * Get ready completions on the specified ring buffer with 'socketxtreme_poll' (pass 'fd' returned with 'get_socket_rings_fds')
+ * For the second type, process an associated list of buffers and keep reference counting with 'socketxtreme_ref_vma_buf',
+ 'socketxtreme_free_vma_buf'.
+ * Clean/unreference received packets with socketxtreme_free_vma_packets
+ - Zero Copy: recvfrom_zcopy
+ The simplest interface working with ConnectX-3 cards. The packet is still written to ring-buffers. The data is not copied out
+ of ring buffers. This interface provides a way to get pointers to locations in ring buffer. There is a slight overhead compared
+ to MP-RQ approach to prepare list of packet pointers.