Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution
Lin Z., Dai H., Mantor M., Zhou H. ACM Transactions on Architecture and Code Optimization16 (3):1-27,2019.Type:Article
Date Reviewed: Oct 25 2019

The high throughput of processors (or, more broadly, computers) has always been a key issue in computer systems architecture. For many years, the most dominant way to increase throughput was to increase the processor clock frequency. But around 15 years ago, the situation changed to a multicore (multiprocessor) paradigm. Processors with 16 or more cores are available for almost everyone. Significant progress in technology allows for many central processing unit (CPU) cores or graphics processing unit (GPU) kernels in one chip, and many problems, previously the domain of (spatial) distributed processing, are now omnipresent.

In this paper, the authors focus on GPU kernels, and observe the following:

  • Bandwidth-intensive kernels can cause bandwidth oversubscription, which leads to high queueing delays and significantly increases memory latency. In turn, performance degradation influences other co-running kernels.
  • In the GPU memory system, the “cache L2-to-L1 network-on-chip (NoC) bandwidth can be [a] critical resource.”
  • Algorithms managing the cooperative thread array (CTA) cannot fully address the issues.
  • “Different kernels favor different GPU resources.”

These issues are described in detail in sections 2 through 4 in the article.

Taking into account these issues, in section 5, the authors propose a coordinated approach for bandwidth partitioning with a CTA combination. Its main feature is related to “dynamically detect[ing] co-running kernels [that are] latency sensitive or bandwidth intensive.” Because both the L2-to-L1 NoC bandwidth and the DRAM bandwidth can be critical resources, a partitioning of these resources is assumed, together with “selecting proper CTA combinations.” This is done based on the proposed algorithm. As a result, the solution allocates more CTA resources for latency-sensitive kernels and allocates more NoC/DRAM bandwidth resources for NoC-/DRAM-intensive kernels.

Section 7 provides an evaluation for the proposed approach. The results show that the authors’ approach “improves the average harmonic speed-up.” This is true even when taking into account the best possible CTA combinations obtained from an exhaustive search.

This paper is recommended to readers interested in computer architecture, especially those in the single instruction, multiple data (SIMD) field.

Reviewer:  Dominik Strzalka Review #: CR146744 (2002-0030)
Bookmark and Share
  Featured Reviewer  
 
General (C.0 )
 
 
Architectures (H.5.4 ... )
 
 
Parallel (B.2.1 ... )
 
Would you recommend this review?
yes
no
Other reviews under "General": Date
Structured computer organization (3rd ed.)
Tanenbaum A., Prentice-Hall, Inc., Upper Saddle River, NJ, 1989. Type: Book (9780138546625)
Oct 1 1991
Principles of computer systems
Karam G., Bryant J., Prentice-Hall, Inc., Upper Saddle River, NJ, 1992. Type: Book (9780131594685)
Sep 1 1992
Computer organization
Scragg G., McGraw-Hill, Inc., New York, NY, 1992. Type: Book (9780070558434)
May 1 1994
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy