High-performance linpack benchmark

The Hardware Locality library (hwloc) is used to allow OpenMPI to automatically extract platform topology information for process and memory binding. Dell 14G PowerEdge systems use different CPU numbering compared to other systems in the market and will require updates to the script to accommodate. Note the script has dependency on logical CPU numbering as specified by the ACPI MADT supplied by UEFI Firmware. If SMT is enabled, each child thread is bound to only the lower numbered of the two logical CPUs per core. The AVX2 DGEMM computation is SIMD bound so an additional active thread per core hinders rather than helps performance. To take maximum advantage of the Zen architecture, the OpenMPI rank running the multi-threaded client is bound to an 元 cache domain, with one computational child thread per physical core sharing the cache (SMT disabled or unused). Outside of this domain (for example from server to server), RDMA or Ethernet would be used. Within a cache coherent domain (for example, within a server), the zero copy KNEM shared memory module is used for OpenMPI communication among clients. The ideal scalable implementation for an EPYC system is a hybrid approach where OpenMPI is used at the top level(s) of the hierarchy and a multi-threaded (OpenMP) client is used for each OpenMPI rank targeting a single shared 元 cache instance within the EPYC architecture. The library comes with several pre-built optimized kernels including an optimized AVX2 kernel for AMD Zen. This library can be optionally configured with threading support (POSIX threads or OpenMP). HPL Implementation: The open source BLIS library (BLAS-like Linear algebra Instantiation Library) is used for DGEMM, which performs the majority of the computations for HPL.

1 AMD High Performance Linpack Benchmark This document describes building the components to run the High Performance Linpack (HPL) benchmark using the AMD xhpl binary, HPL.dat and script files.