Kernelet(2014-2016)

Publications

Jianlong Zhong, Bingsheng He. Kernelet: High-Throughput GPU Kernel Execution with Dynamic Slicing and Scheduling. IEEE Transactions on Parallel and Distributed System, vol.25, no.6, pp.1522-1532, June 2014.

Mochi Xue, Kun Tian, Yaozu Dong, Jiajun Wang and Zhengwei Qi, Bingsheng He, Haibing Guan. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space. USENIX Annual Technical Conference (ATC) 2016.

Abstract

Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system to improve the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31 percent and 23 percent performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.

Highlights

Our research has significant practical impacts in GPU virtualizations (nowadays an important infrastructure component in cloud computing with GPUs)

In the following, we present more details on the "impact factors" ofkthis project (see definition of "impact factors").

Citations

The Kernelet paper has already received over 125 citations since 2014.

Example quotes on citations.

Relevance to Industry and Open-Source Community

This system has inspired other open-source systems and industry systems.

gScale has been integrated into Intel's Open GPU virtualization platform. Screenshot on Intel, Linux Kernel

[TPDS] Zhang, Haitao, Xin Geng, and Huadong Ma
Learning-driven Interference-aware Workload Parallelization for Streaming Applications in Heterogeneous Cluster, TPDS 2020.

[TACO] Wu, Hao, Weizhi Liu, Huanxin Lin, and Cho-Li Wang
A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs, TACO 2020.

[TC] Li, Zhifang, Beicheng Peng, and Chuliang Weng
XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform, TC 2020.

[TC] Houssam-Eddine, Zahaf, Nicola Capodieci, Roberto Cavicchioli, Giuseppe Lipari, and Marko Bertogna
The HPC-DAG Task Model for Heterogeneous Real-Time Systems, TC 2020.

[DAC] Kim, Jiho, John Kim, and Yongjun Park
Navigator: dynamic multi-kernel scheduling to improve GPU performance, DAC 2020.

[The Journal of Supercomputing] Mohamad Beheshti Roui, S. Kazem Shekofteh, Hamid Noori
Efficient scheduling of streams on GPGPUs, The Journal of Supercomputing 2020.

[TC] Peng, Bo, Jianguo Yao, Yaozu Dong, and Haibing Guan
MDev-NVMe: Mediated Pass-Through NVMe Virtualization Solution with Adaptive Polling, TC 2020.

[TPDS] Shekofteh, Seyed Kazem, Hamid Noori, Mahmoud Naghibzadeh, Holger Froening, and Hadi Sadoghi Yazdi.
cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs, TPDS 2020.

[CGO] Jiao, Qing, Mian Lu, Huynh Phung Huynh, and Tulika Mitra.
Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS, CGO 2015.

[RTAS] Otterness, Nathan, Ming Yang, Sarah Rust, Eunbyung Park, James H. Anderson, F. Donelson Smith, Alex Berg, and Shige Wang.
An Evaluation of the NVIDIA TX 1 for Supporting Real-time ComputerVision Workloads, RTAS 2017.

[ICESS] Li, Junke, Bing Guo, Yan Shen, Deguang Li, and Yanhui Huang.
Low-Energy Kernel Scheduling Approach for Energy Saving, ICESS 2016.

[ISCA] Xu, Qiumin, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram.
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming, ISCA 2016.

[Scientific Programming] Park, Younghun, Minwoo Gu, and Sungyong Park.
Ballooning Graphics Memory Space in Full GPU Virtualization Environments, Scientific Programming 2019.

System Repeatability and Academic Impacts

The system is used in the evaluation of the following papers:

Educational Adoptions

[Book] Hamid Sarbazi-Azad, Morgan Kaufmann.
Advances in GPU Research and Practice, 15 Sep 2016.

[Course] Sogang University.
CSE 6468, Distributed Systems

[Seminar] Indian Institute of Technology Bombay.
Multitasking Support in GPUs

Media Coverage

[News] Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling, 29 Nov 2018. Screenshot on CSDN