Author:
Fuentes-Alventosa Antonio,Gómez-Luna Juan,González-Linares José Maria,Guil Nicolás,Medina-Carnicer R.
Abstract
AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$
×
and 5.4$$\times$$
×
faster than the only state-of-the-art GPU-based implementation of CAVLC.
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Information Systems,Theoretical Computer Science,Software
Reference47 articles.
1. Babionitakis K, Doumenis G, Georgakarakos G, Lentaris G, Nakos K, Reisis D, Sifnaios I, Vlassopoulos N (2008) A real-time H. 264/AVC VLSI encoder architecture. J Real-Time Image Process 3(1–2):43–59
2. Banerji A, Ghosh AM (2010) Multimedia Technologies. Tata McGraw Hill, New Delhi
3. Chang C W, Lin W H, Yu H C, Fan CP (2014) A high throughput CAVLC architecture design with two-path parallel coefficients procedure for digital cinema 4K resolution H. 264/AVC encoding. In: Circuits and Systems (ISCAS), 2014 IEEE International Symposium on (pp. 2616-2619). IEEE
4. Chu X, Wu S, Chang F, He W (2012) Efficient implementation of the CAVLC entropy encoder based on FPGA [J]. J Xidian Univ 3:017
5. Damak T, Werda I, Samet A, Masmoudi N (2008) DSP CAVLC implementation and optimization for H. 264/AVC baseline encoder. In: Electronics, Circuits and Systems, 2008. ICECS 2008. 15th IEEE International Conference on (pp. 45-48). IEEE
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献