From 860ca5277c37cc93d8e44e5b7a7757b930b83603 Mon Sep 17 00:00:00 2001 From: "Suren A. Chilingaryan" Date: Wed, 13 May 2015 04:56:05 +0200 Subject: Add BIOS and kernel optimization instructions --- docs/HARDWARE | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) create mode 100644 docs/HARDWARE (limited to 'docs/HARDWARE') diff --git a/docs/HARDWARE b/docs/HARDWARE new file mode 100644 index 0000000..aaa7c59 --- /dev/null +++ b/docs/HARDWARE @@ -0,0 +1,88 @@ +BIOS +==== + The important options in BIOS: + - IOMMU (Intel VT-d) - Enable hardware translation between physcal and bus addresses + - No Snoop - Disables hardware cache coherency between DMA and CPU + - Max Payload (MMIO Size) - Maximal (useful) payload for PCIe protocol + - Above 4G Decoding - This seesm to allow bus addresses wider than 32-bit + - Memory performance - Frequency, Channel-interleaving, Hardware prefetcher affect memory performance + + +IOMMU +===== + - As many PCI-devices can address only 32-bit memory, for DMA operation some address + translation mechanism is required (also it helps with security limiting PCI devices + to only allowed address range). There are several methods to achieve this. + * Linux provides so called Bounce Buffers (or SWIOTLB). This is just a small memory + buffer in the lower 4 GB of memory. The DMA is actually performed into this buffer + and data is, then, copied to the appropriate location. One problem with SWIOTLB + is that it does not gurantee 4K aligned address when mapping memory pages (to + optimally use space). This is not properly supported neither by NWLDMA nor by IPEDMA. + * Alternatively hardware IOMMU can be used which will provide hardware address + translation between physical and bus addresses. To allow it, we need to + allow the technology in the BIOS and in the kernel. + + Intel VT-d or AMD-Vi (AMD IOMMU) virtualization technologies have to be enabled + + Intel is enabled with "intel_iommu=on" kernel parameter (alternative is to build kernel with CONFIG_INTEL_IOMMU_DEFAULT_ON) + + Checking: dmesg | grep -e IOMMU -e DMAR -e PCI-DMA + +DMA Cache Coherency +=================== + DMA API distinguishes two types of memory coherent and non-coherent. + - For the coherent memory, the hardware will care for cache consistency. This is often + achieved by snooping (No Snoop should be disabled in the BIOS). Alternatively, the same + effect can be achieved by using non-cached memory. There is architectures with 100% + cache coherent memory and others where only part of memory is kept cache coherent. + For such architectures the coherent memory can be allocated with + dma_alloc_coheretnt(...) / dma_alloc_attrs(...) + * However, the coherent memory could be slow (especially on large SMP systems). Also + minimal allocation unit may be restricted to page. Therefore, it is useful to group + consistent mapping into the groups. + + - On other hand, it is possible to allocate streaming DMA memory which are synchronized + using: + pci_dma_sync_single_for_device / pci_dma_sync_single_for_cpu + + - It may happen that all memory is coherent anyway and we do not need to call this 2 + functions. Currently, it seems not required on x86_64 which may indicate that snooping + is performed for all available memory. On other hand, may be only because nothing + was get cached luckely so far. + + +PCIe Payload +============ + - Kind of MTU for PCI protocol. Higher the value, the lower will be slow down due to + protocol headers while streaming large amount of data. The current values can be checked + with 'lspci -vv'. For each device, there is 2 values: + * MaxPayload under DevCap which indicates MaxPayload supported by the dvice + * MaxPayload under DevCtl indicates MaxPayload negotiated between device and chipset. + Negotiated MaxPayload is a minimal value among all the infrastructure between the device + chipset. Normally, it is limited by the MaxPaylod supported by the PCIe root port on + the chipset. Most systems currently restricted to 256 bytes. + + +Memory Performance +================== + - Memory performance is quite critical as we currently tripple the PCIe bandwidth: + DMA writes to memory, we read memory (it is not in cache), we write memory. + - The most important to enable Channel Interleaving (otherwise a single-channel copy + will be performed). On other hand, Rank Interleaving does not matter much. + - On some motherboards (Asrock X79 for instance), when the memory speed is set + manually, the interleaving is switched off in AUTO mode. So, it is safer to set + interleaving manually on. + - Hardware prefetching helps a little bit and should be turned on + - Faster memory frequency helps. As we are streaming I guess this is more important + compared even to slighly higher CAS & RAS latencies, but I have not checked. + - The memory bank conflicts sometimes may significant harm performance. Bank conflict + will happen if we read and write from/to different rows of the same bank (also there + could be conflict with DMA operation). I don't have a good idea how to prevent this + now. + - The most efficient memcpy performance depends on CPU generation. For latest models, + AVX seems to be most efficient. Filling all AVX registers before writting increases + performance. It also gives quite much of performance, if multiple pages copied in + parallel (still first we reading from multiple pages and then writting to multiple + pages, see ssebench). + - Usage of HugePages makes performance more stable. Using page-locked memory does not + help at all. + - This still will give about 10 - 15 GB/s at max. On multiprocessor systems about 5 GB/s, + because of performance penalties due to snooping. Therefore, copying with multiple + threads is preferable. -- cgit v1.2.3