Using an extended Roofline Model to understand data and thread affinities on NUMA systems

Today’s microprocessors include multicores that feature a diverse set of compute cores and onboard memory subsystems connected by complex communication networks and protocols. The analysis of factors that affect performance in such complex systems is far from being an easy task. Anyway, it is clear that increasing data locality and affinity is one of the main challenges to reduce the access latency to data. As the number of cores increases, the influence of this issue on the performance of parallel codes is more and more important. Therefore, models to characterize the performance in such systems are broadly demanded. This paper shows the use of an extension of the well known Roofline Model adapted to the main features of the memory hierarchy present in most of the current multicore systems. Also the Roofline Model was extended to show the dynamic evolution of the execution of a given code. In order to reduce the overheads to get the information needed to obtain this dynamic Roofline Model, hardware counters present in most of the current microprocessors are used. To illustrate its use, two simple parallel vector operations, SAXPY and SDOT, were considered. Different access strides and initial location of vectors in memory modules were used to show the influence of different scenarios in terms of locality and affinity. The effect of thread migration were also considered. We conclude that the proposed Roofline Model is an useful tool to understand and characterise the behaviour of the execution of parallel codes in multicore systems.