Doctoral Dissertation Defense: Jonathan Graf
Advisor: Dr. Matthias Gobbert
Friday, April 14, 2017 · 3 - 5 PM
Title: Parallel Performance of Numerical Simulations for Applied Partial Differential Equation Models on the Intel Xeon Phi Knights Landing Processor
Abstract:
Current high-performance computing clusters feature CPUs with 8 to 16 cores. The many-integrated-core (MIC) Intel Xeon Phi processors feature 60 or more cores on a single chip, with lower power consumption per core than CPUs. The Intel Xeon Phi Knights Landing (KNL) is the second-generation Xeon Phi processor released in 2016. It represents a significant improvement over the first-generation Knights Corner (KNC), since the KNL can serve as a standalone processor and has a 2D mesh interconnect on the chip to connect the cores to the 16 GB of high-performance memory on the chip. This architecture is very accessible to researchers who need only add a compiler flag to their code as a result of the x86 compatibility of each Xeon Phi core. But the different configurations available for the KNL add a layer of decisions for researchers on how to run their code. We use the Stampede cluster at the Texas Advanced Computing Center (TACC) for all hardware choices, since it is accessible to many researchers via an Extreme Science and Engineering Discovery Environment (XSEDE) allocation.
This work is inspired by the calcium induced calcium release (CICR) model of calcium dynamics in a three-dimensional heart cell. This application problem is modeled by a system of coupled, non-linear, time-dependent advection-diffusion-reaction partial differential equations. The model now includes eight species and connections between the electrical excitation, calcium signaling, and mechanical contraction systems. Parameter studies on modern CPUs examine the feedback strength from the calcium signaling to the electrical excitation system and motivate the need for parameter studies on meshes that fit in the memory of a KNL.
The elliptic Poisson equation in two dimensions serves as a prototypical test prob- lem, since the linear system solution by the conjugate gradient method mimics the computational kernels in many applications that use Krylov subspace methods. Our tests assess the configurations possible with the KNL and demonstrate the distinct advantage of the 16 GB of on-chip memory over the main memory of the node. For this problem, with localized communication and carefully managed memory efficiency, the performance of the main configuration choices are equivalent. We include a com- parison to the first-generation KNC and modern CPU nodes currently available in Stampede and note the performance improvement.
Finally, we study the performance of the KNL when used for the CICR code. The CICR code requires more demanding and significant communication and is more computationally intensive than the Poisson problem. We demonstrate performance and scalability on a single KNL node with MPI only and hybrid MPI+OpenMP code. We also test scalability using multiple KNL and carefully consider the number and placement of OpenMP threads relative to the number of MPI processes used with hybrid MPI+OpenMP code. The scalability for multiple KNL nodes is good for both MPI and MPI+OpenMP code. The balance of OpenMP threads to MPI processes influences performance for this problem. Overall, the KNL demonstrates significant performance benefit when used appropriately on various applicationproblems.
Abstract:
Current high-performance computing clusters feature CPUs with 8 to 16 cores. The many-integrated-core (MIC) Intel Xeon Phi processors feature 60 or more cores on a single chip, with lower power consumption per core than CPUs. The Intel Xeon Phi Knights Landing (KNL) is the second-generation Xeon Phi processor released in 2016. It represents a significant improvement over the first-generation Knights Corner (KNC), since the KNL can serve as a standalone processor and has a 2D mesh interconnect on the chip to connect the cores to the 16 GB of high-performance memory on the chip. This architecture is very accessible to researchers who need only add a compiler flag to their code as a result of the x86 compatibility of each Xeon Phi core. But the different configurations available for the KNL add a layer of decisions for researchers on how to run their code. We use the Stampede cluster at the Texas Advanced Computing Center (TACC) for all hardware choices, since it is accessible to many researchers via an Extreme Science and Engineering Discovery Environment (XSEDE) allocation.
This work is inspired by the calcium induced calcium release (CICR) model of calcium dynamics in a three-dimensional heart cell. This application problem is modeled by a system of coupled, non-linear, time-dependent advection-diffusion-reaction partial differential equations. The model now includes eight species and connections between the electrical excitation, calcium signaling, and mechanical contraction systems. Parameter studies on modern CPUs examine the feedback strength from the calcium signaling to the electrical excitation system and motivate the need for parameter studies on meshes that fit in the memory of a KNL.
The elliptic Poisson equation in two dimensions serves as a prototypical test prob- lem, since the linear system solution by the conjugate gradient method mimics the computational kernels in many applications that use Krylov subspace methods. Our tests assess the configurations possible with the KNL and demonstrate the distinct advantage of the 16 GB of on-chip memory over the main memory of the node. For this problem, with localized communication and carefully managed memory efficiency, the performance of the main configuration choices are equivalent. We include a com- parison to the first-generation KNC and modern CPU nodes currently available in Stampede and note the performance improvement.
Finally, we study the performance of the KNL when used for the CICR code. The CICR code requires more demanding and significant communication and is more computationally intensive than the Poisson problem. We demonstrate performance and scalability on a single KNL node with MPI only and hybrid MPI+OpenMP code. We also test scalability using multiple KNL and carefully consider the number and placement of OpenMP threads relative to the number of MPI processes used with hybrid MPI+OpenMP code. The scalability for multiple KNL nodes is good for both MPI and MPI+OpenMP code. The balance of OpenMP threads to MPI processes influences performance for this problem. Overall, the KNL demonstrates significant performance benefit when used appropriately on various applicationproblems.