To make applications with dynamic data sharing among threads benefit from gpu acceleration, we propose a novel software transactional. Sep 15, 2008 3 the graphics memory is the gpu s version of host memory. As the downside, software implementations usually come with a performance penalty, when compared to hardware. Modern gpu architectures have a memory hierarchy that needs to be explicitly programmed to obtain good performance. View anup holeys profile on linkedin, the worlds largest professional community.
However, performance and energy overhead of kilo tm may deter gpu vendors from incorporating it into future designs. Accelerating gpu hardware transactional memory with snapshot. For a set of tmenhanced gpu applications, kilo tm captures 59% of the performance of finegrained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0. Matt software transactional memory, herlihys hardware accelerator concept. Modern apus implement cpugpu platform atomics for simple data types.
Efficient transactionalmemorybased implementation of morph. Computing without processors august 2011 communications. Hardware transactional memory for gpu architectures wilson w. Transactional synchronization extensions tsx, also called transactional synchronization extensions new instructions tsxni, is an extension to the x86 instruction set architecture isa that adds hardware transactional memory support, speeding up execution of multithreaded software through lock elision. Many tm systems have been proposed in the last two decades for multicore architectures 7, implemented either in hardware or software or a combination. Software transactional memory for gpu architectures ieee xplore.
Software transactional memory for gpu architectures ieee. Secondly, the con ict detection mechanism is based on uni ed readwrite signatures i. Hardware support for local memory transactions on gpu. Toward a software transactional memory for heterogeneous cpu. It is only accessible by the gpu and not accessible via the cpu. Next generation cuda architecture, code named fermi.
Hardware support for local memory transactions on gpu architectures alejandro villegas angeles navarro. Software transactional memory provides transactional memory semantics in a software runtime library or the programming language, and requires minimal hardware support typically an atomic compare and swap operation, or equivalent. Nov 11, 20 compiler, architecture and tools conference program abstracts. Acle version acle q3 2019 acle acle q3 2019 documentation. Yunlong xu, rui wang, nilanjan goswami, tao li and depei qian. Programming gpus is challenging for applications with irregular finegrained communication between threads.
With tm, the programmer does not need to write code with locks to ensure mutual exclusion. Towards a software transactional memory for graphics processors. Hardware transactional memory for gpu architectures. Energy e ciency of software transactional memory in a.
And now having read about intels hw tm i have many curious questions. On the gpu, main memory is accessed via a cache hierarchy where, in most cases, the l1 data cache is not coherent. Or would these kinds of building blocks be just what we want. To make applications with dynamic data sharing among threads benefit from gpu acceleration, we propose a novel software transactional memory system for gpu architectures gpustm. Advanced computer architecture and systems detailed. The unconverted parts of the java program could use up the cpu multicore resources with its multithreaded workload. Evaluation of amds advanced synchronization facility within a complete transactional memory stack performance evaluation of intel transactional synchronization extensions for highperformance computing software transactional memory. This dissertation aims to reduce the burden on gpu software developers with two major enhancements to gpu architectures. Towards a software transactional memory for heterogeneous cpu. Transactional memory for heterogeneous systems arxiv. Pdf modern gpus have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. First, thread block compaction tbc is a microarchitecture innovation that reduces the performance penalty caused by branch divergence in gpu applications. Data layout transformation for enhancing locality on nuca chip multiprocessors. To improve gpus programmability and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory tm on gpus via kilo tm, a novel hardware tm system that scales to thousands of concurrent transactions.
On the hardware side, kilo tm was proposed in 2011. A stm system that supports perthread transactions faces new challenges. To make applications with dynamic data sharing benefit from gpu acceleration, we propose a novel software transactional memory system for gpu architectures gpustm. Toward a software transactional memory for heterogeneous. Tm simplifies software development for parallel architectures by providing the programmer with the illusion that code blocks, called transactions, execute. However, ensuring atomicity for complex data types is a task delegated to programmers. Tm transactional memory stm software transactional memory htm hardware transactional memory hytm hybrid transactional memory tsx intels transactional synchronization extensions hle hardware lock elision rtm restricted transactional memory gpu graphics processing unit gpgpu general purpose computation on graphics processing units cpu central. An efficient software transactional memory using committime invalidation. Were upgrading the acm dl, and would like your input. Nilanjan goswami gpu architect advanced computing lab. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and. Software transactional memory for gpu architectures nilanjan. In addition, it ensures forward progress through an automatic serialization mechanism. Both hardware and software transactional memories have been proposed for the gpu architectures.
To reduce this effort, prior work has proposed supporting transactional memory on gpu architectures. Improvements in hardware transactional memory for gpu. We propose gpu localtm, a hardware transactional memory tm, as an alternative to data locking mechanisms in local memory. To evaluate tlll, we use it to implement six widely used programs, and compare it with the stateoftheart adhoc gpu synchronization, gpu software transactional memory stm, and cpu hardware. His research interests include parallel programming, software transactional memory, and distributed architectures. To appear in the 12th annual ieeeacm international symposium on code generation and optimization cgo, 2014. If this mechanism is required very often it may harm performance. Systemwide data consistency issues can be handled by a gpu friendly design of software transactional memory. Gpu localtm allocates transactional metadata in the existing memory resources, minimizing the storage requirements for tm support. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and preventing livelocks caused by the simt execution paradigm of gpus. Software transactional memory for gpu architectures. Software transactional memory for gpu architectures yunlong xu. Aamodt university of british columbia, canada motivation.
Thesis, department of electrical and computer engineering, university of colorado. Each kernel launch dispatches a hierarchy of threads a grid of blocks. A cuda program starts on a cpu and then launches parallel compute kernels onto a gpu. While transactional memory for processors with hundreds of cores is likely to require hardware support, software implementations will be required for backward compatibility with current and near. Transactional memory tm is an optimistic approach to achieve this goal. Towards a software transactional memory for heterogeneous. Cpu and gpu architectures, memory subsystem design, hardwaresoftware codesign.
Exploration of lockbased software transactional memory justin gottschlich. The ability of the gpu to handle considerably more threads than the cpu has recently led to increased interest in utilising transactional memory for gpu. The major challenges include ensuring good scalability with respect to the massively multithreading of gpus, and preventing livelocks. Hardware transactional memory for gpu architectures ubc ece. Today most people who make effective use of gpus undergo a steep learning curve and are forced to program close to the machine using special gpu programming languages. Gpu computing architecture for irregular parallelism ubc. Rafael ubal david kaeli department of electrical and computer engineering. A question that arises in our smart highways use case is this. One hardware proposal, kilo tm, can scale to s of concurrent transaction.
Transactional synchronization extensions wikipedia. Sadayappan, yongjian chen, haibo lin and tinfook ngai. Pdf hardware transactional memory for gpu architectures. The heterogeneous accelerated processing units apus integrate a multicore cpu and a gpu within the same chip.
Ennals, efficient software transactional memory, technical report, intel research cambridge, uk, 2005. Hardware support for scratchpad memory transactions on gpu. Scheduling techniques for gpu architectures with processinginmemory capabilities ashutosh pattnaik1 xulong tang1 adwait jog2 onur kay. Software transactional memory for gpu architectures proceedings.
370 1524 1315 730 195 721 208 746 653 1141 1046 544 893 39 1538 941 728 347 1302 887 238 5 299 1364 328 934 408 1109 744 1328 1165 1290 1221 766 1365 53 274 1431 752