Hardware Counter Based Performance Analysis, Modelling, and Improvement through Thread Migration in NUMA Systems
While for most of the history of computing programming and execution was sequential, one instruction followed another, with few exceptions, on the last ten years this paradigm has completely changed. Modern computer systems are based on multicore processors, which means parallel programming and execution is now dominant, when before it was mostly the realm of high performance computing. This change has brought many challenges, not all of them solved. Parallel programming is inherently more difficult than sequential programming. If, during the 20\textsuperscript{th} century, it could be taken for granted that someone who needed parallel programming would at least have access to expert knowledge, this is no longer the case. As such, all approaches to make parallel programming more accessible are welcomed. A parallel computer system where all its components are equal and have the same performance is simpler to program for. Unfortunately, these systems do not allow for the higher peak performance or for enough flexibility to carry out a variety of tasks. This is why, nowadays, many computers are of a heterogeneous nature, mixing different architectural approaches in the same system. But even on those computers apparently simpler, like shared memory systems with multiple processors, imbalances negatively affect performance. These systems are prevalent on internet servers and workstations, and are the foundation of high performance supercomputers. In this work, a series of tools, applications and models designed to help the programming of these systems, and even to improve their performance without direct user intervention, are presented. The advantages modern processors give for performance monitoring allow the users to gain insight on the execution of their applications. Nevertheless, the performance information processors give may not be used to analyse program improvements in a straightforward way. This information may be complex and, by virtue of its detail, extensive. During this work, the tools and models presented take advantage of these facilities to offer the users a clear view of the behaviour of their codes to tackle actual issues that affect performance. With the experience acquired developing these tools and models, an application to automatically improve the performance of parallel applications or mixed workloads was implemented and tested. First, a set of memory accesses analysis tools, designed to allow its users to understand the data locality and data placement of their codes, is presented. Its usefulness in a series of cases is shown. Second, a new performance model, based on the Berkeley Roofline Model, is introduced, alongside with a set of tools to simplify the task of obtaining it. A series of applications of the model are presented to highlight its usefulness. Finally, a tool to improve the performance in parallel computers is presented. This tool automatically places and migrates threads during execution, using different strategies. Those proposals are detailed and tested, clearly showing the importance of thread and data placement for performance.
keywords: performance, NUMA, hardware counters, thread migration, Berkeley Roofline Model, 3DyRM