ComputingReviews.com

Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution
Lin Z., Dai H., Mantor M., Zhou H. ACM Transactions on Architecture and Code Optimization16(3):1-27,2019.Type:Article

Date Reviewed: 10/25/19

The high throughput of processors (or, more broadly, computers) has always been a key issue in computer systems architecture. For many years, the most dominant way to increase throughput was to increase the processor clock frequency. But around 15 years ago, the situation changed to a multicore (multiprocessor) paradigm. Processors with 16 or more cores are available for almost everyone. Significant progress in technology allows for many central processing unit (CPU) cores or graphics processing unit (GPU) kernels in one chip, and many problems, previously the domain of (spatial) distributed processing, are now omnipresent.

In this paper, the authors focus on GPU kernels, and observe the following:

Bandwidth-intensive kernels can cause bandwidth oversubscription, which leads to high queueing delays and significantly increases memory latency. In turn, performance degradation influences other co-running kernels.
In the GPU memory system, the “cache L2-to-L1 network-on-chip (NoC) bandwidth can be [a] critical resource.”
Algorithms managing the cooperative thread array (CTA) cannot fully address the issues.
“Different kernels favor different GPU resources.”

These issues are described in detail in sections 2 through 4 in the article.

Taking into account these issues, in section 5, the authors propose a coordinated approach for bandwidth partitioning with a CTA combination. Its main feature is related to “dynamically detect[ing] co-running kernels [that are] latency sensitive or bandwidth intensive.” Because both the L2-to-L1 NoC bandwidth and the DRAM bandwidth can be critical resources, a partitioning of these resources is assumed, together with “selecting proper CTA combinations.” This is done based on the proposed algorithm. As a result, the solution allocates more CTA resources for latency-sensitive kernels and allocates more NoC/DRAM bandwidth resources for NoC-/DRAM-intensive kernels.

Section 7 provides an evaluation for the proposed approach. The results show that the authors’ approach “improves the average harmonic speed-up.” This is true even when taking into account the best possible CTA combinations obtained from an exhaustive search.

This paper is recommended to readers interested in computer architecture, especially those in the single instruction, multiple data (SIMD) field.

Reviewer: Dominik Strzalka

Review #: CR146744 (2002-0030)

Reproduction in whole or in part without permission is prohibited. Copyright 2024 ComputingReviews.com™
Terms of Use | Privacy Policy