Thread migration techniques based on dynamic Roofline models and latency information

Current multicore systems present on-board memory hierarchies that influence their performance when they execute shared memory codes. The question of how to efficiently support the shared memory model is of paramount importance. In this paper, Hardware Counters are used to extract, in runtime, performance information on the execution of shared memory codes in multicore systems. Data from Hardware Counters are used to characterize the behaviour of a code in terms of the Roofline Model with the inclusion of additional information about memory access latencies. We propose to use this information to guide thread migration strategies that improve the efficiency of the execution of the code by increasing locality and affinity. Different configurations of the SAXPY and SDOT kernels on multicores were used to validate the benefits of the proposed thread migration strategies. The results show that our strategy produces improvements up to 25% in scenarios where locality and affinity are low. In addition, thanks to the use of hardware counters, the overheads of the strategy are low.

keywords: 3DyRM, Hardware Counters, Performance