Internship on RDMA-based FPGA peer-topeer collective communication
Internships at the Xilinx Research Labs in Dublin, Ireland
Xilinx Research Labs is a small, diverse and dynamic part of Xilinx. Through customization and tailored solutions, we investigate how programmable logic and FPGAs can help make data centers faster, cheaper and greener by accelerating common applications and reducing the energy consumption of a given workload. Our team conducts cutting-edge research in topics such as machine learning, HPC and video processing to push the performance envelope of what’s possible with today’s devices and to help shape the next big thing in computing. In particular, the team in Dublin is focused on deep neural networks, including training paradigms and techniques, hardware friendly novel topologies, quantization techniques, and custom hardware
architectures that help support the enormous computational workloads associated with the roll-out of AI, even for energy-constrained compute environments. Fulfilling this goal requires top talent and thus we are looking to enrich our team with the finest engineers with bold, collaborative and creative personalities from top universities worldwide.
MPI is a popular communication API in high-performance computing (HPC). MPI enables remote compute nodes to identify each-other and exchange data. It includes simple send and receive operations, as well as collective operations, which include, among others:
• Broadcast: sending one data structure to all other nodes
• Scatter: divide up a data structure and each piece to one other node
• Gather: reverse of scatter
• Reduce: gather from other nodes and compute the elementwise sum of data received
The performance of MPI communication is key to the scalability of many HPC applications. Because of this, compute accelerator vendors typically implement specialized implementations of MPI which optimize communication over MPI between the accelerators. Examples of these are NCCL for Nvidia GPUs and the equivalent RCCL for AMD GPUs. FPGA offloading of MPI enables communication between FPGA-resident compute kernels without CPU intervention, reducing the latency of communication. On FPGA accelerator boards equipped with network interfaces (e.g. Alveo), the MPI communication can utilize these interfaces to eliminate the latency of moving data cross the PCIe bus to the host NIC. ACCL is such an MPI-like library for Xilinx Alveo FPGAs.
Description of Work
The internship is focused on the implementation and optimization of MPI offloading for Alveo. The Xilinx team has developed ACCL, an MPI-like offload system using UDP/TCP transport. ACCL is implemented as a Vitis RTL kernel. More specifically, the system itself is a Vivado block design utilizing Microblaze for control and DMAs and other AXI blocks available in Vivado. Some blocks are implemented in HLS. The kernel is linked to the network stack using Vitis.
Internship Duration: 6-9 Months
• Migrate the MPI offloading solution to RDMA using existing Xilinx IP (ERNIC). More specifically, the work involves:
- identifying and planning required work for the migration,
- implementing interface blocks in RTL or HLS,
- writing control software for the embedded Microblaze microcontroller,
- developing and implementing a verification plan
• Measure and optimize performance of existing and new variants of the MPI offload system,
- Profiling and optimizing the control code running on an embedded Microblaze
- Identifying functionality which can be moved from software into FPGA, and implementing the blocks in Vivado HLS or Verilog RTL
- Measuring and optimizing the interface between the MPI offload, FPGA kernels, other accelerator kernels, and the host
The outcome of the project is a 100Gbps RDMA-enabled ACCL implementation which can compete in performance with MPI over 100Gbps InfiniBand.
Skills and Tools
Due to the nature of ACCL implementation, the work requires (and builds) experience with Vivado block designs, programming and debugging Microblaze code, and developing FPGA circuits with Vivado HLS from C++. The work also builds experience with the Vitis FPGA acceleration framework,
XRT, the software interface framework for FPGA kernels, and FPGA-based networking solutions. Interns will be exposed to several high-level applications accelerated in FPGA (DNN inference, HPCG).