Multicore systems present on-board memory hierarchies and communication networks that influence performance when executing shared-memory parallel codes.
Characterising this influence is complex, and understanding the effect of particular hardware configurations on different codes is of paramount importance.
In this context, precise monitoring information can be extracted from hardware counters (HC) at runtime to characterise the behaviour of each thread of a parallel code. This technology provides high accuracy with a low overhead. In particular, we introduce a new tool to get this information from hardware counters in terms of number of floating point operations per second, operational intensity, latency of memory access, and energy consumption. Note the first two parameters define the well-known Roofline Model, an intuitive visual performance model used to provide performance estimates of applications running on multi-core architectures. The third parameter quantifies data locality and the fourth one is related to the load of each node of the system. All this information is accessed through the perf_events interface provided by Linux, with the aid of the libpfm library. This tool can be used to utilise its monitoring information to optimise execution efficiency in NUMA systems by balancing or scheduling the workloads, guiding thread and page migration strategies in order to increase locality and affinity.
The designated migrations are based on optimisation strategies, supported by runtime information provided by hardware counters.
Overall, the profiling application is launched from a terminal as a background process, it does not require superuser permissions to run properly, and can lead to performance optimization in multithreaded applications and power saving in NUMA systems.
Keywords: Roofline Model, Performance, Hardware Counters, PEBS, Energy Usage