We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham e-Theses
You are in:

Task-Based Parallelism for General Purpose Graphics Processing Units and Hybrid Shared-Distributed Memory Systems.

CHALK, AIDAN,BERNARD,GERARD (2017) Task-Based Parallelism for General Purpose Graphics Processing Units and Hybrid Shared-Distributed Memory Systems. Doctoral thesis, Durham University.

PDF - Accepted Version


Modern computers can no longer rely on increasing CPU speed to improve their performance as further increasing the clock speed of single CPU machines will make them too difficult to cool, or the cooling require too much power. Hardware manufacturers must now use parallelism to drive performance to the levels expected by Moore's Law. More recently, High Performance Computers (HPCs) have adopted heterogeneous architectures, i.e.having multiple types of computing hardware (such as CPU & GPU) on a single node. These architectures allow the opportunity to extract performance from non-CPU architectures, while still providing a general purpose platform for less modern codes.

In this thesis we investigate Task-Based Parallelism, a shared-memory paradigm for parallel computing. Task-Based Parallelism requires the programmer to divide the work into chunks (known as tasks) and describe the data dependencies between tasks. The tasks are then scheduled amongst the threads automatically by the task-based scheduler. In this thesis we examine how Task-Based Parallelism can be used with GPUs and hybrid shared-distributed memory, in particular we examine how data transfer can be incorporated into a task-based framework, either to the GPU from the host, or between separate nodes. We also examine how we can use the task graph to load balance the computation between multiple nodes or GPUs.

We test our task-based methods with Molecular Dynamics, a tiled QR decomposition, and a new task-based Barnes-Hut algorithm. These are problems with different dependency structures which tests the ability of the scheduler to handle a variety of different types of computation. The results with these testcases show improved performance when we use asynchronous data transfer to and from the GPU, and show reasonable parallel efficiency over a small number of MPI ranks.

Item Type:Thesis (Doctoral)
Award:Doctor of Philosophy
Faculty and Department:Faculty of Science > Engineering and Computing Science, School of (2008-2017)
Thesis Date:2017
Copyright:Copyright of this thesis is held by the author
Deposited On:05 Sep 2017 12:46

Social bookmarking: del.icio.usConnoteaBibSonomyCiteULikeFacebookTwitter