Work

Thermal-aware Task Management for High Performance Systems

Public Deposited

Thermal overheating is a serious concern in modern supercomputing systems. Elevated temperature levels reduce the reliability and the lifetime of the underlying hardware and increase their power consumption. Previous studies on mitigating thermal hotspots at the hardware and run-time system levels have typically used approaches that trade off performance for reduced operating temperatures. In this thesis, we first first study a two node system. In this system, we first show that physical attributes cause an uneven temperature distribution. We then develop a model to characterize the thermal behavior of the two-node system using machine learning methods. We propose to improve application placement by incorporating thermal awareness into the decision-making process. Specifically, our system predicts the thermal condition of the system based on application mapping and uses these predictions to mitigate thermal hotspots without any performance loss. We provide two versions of our prediction mechanism. On a two-node configuration, these models achieve 72.5% and 78.8% success rates in their predictions for better task assignment schemes, respectively. In other words, the scheduling decisions of our models result in a task placement that has a lower maximum average temperature. Overall, the more aggressive scheme reduces the average peak temperature by up to 11.9°C (2.3°C on average) without any performance degradation. We then move our focus to a dynamic setup with more nodes in the system, where the overhead of task migration must be considered. We also improve our framework to make predictions without prior profiling of the applications. When applying to the two-node system that suffers thermal throttling, our method can improve average performance up to 25.58% (on average 4.27%). By applying our method on a the four-node system with aggressive and costly cooling, power consumption can be reduced by task placement. Our framework can reduce the overall system power by up to 3.7% and the cooling power by up to 33.7% without sacrificing performance.

Last modified
  • 04/02/2018
Creator
DOI
Subject
Keyword
Date created
Resource type
Rights statement

Relationships

Items