Advanced
Tile-level and Frame-level Parallel Encoding for HEVC
Tile-level and Frame-level Parallel Encoding for HEVC
Journal of Broadcast Engineering. 2015. May, 20(3): 388-397
Copyright © 2015, The Korean Society of Broadcast Engineers
  • Received : March 12, 2015
  • Accepted : April 29, 2015
  • Published : May 30, 2015
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
연희 김
kimyounhee@etri.re.kr
진욱 석
순흥 정
휘용 김
진수 최

Abstract
High Efficiency Video Coding (HEVC)/H.265 is a new video coding standard which is known as high compression ratio compared to the previous standard, Advanced Video Coding (AVC)/H.264. Due to achievement of high efficiency, HEVC sacrifices the time complexity. To apply HEVC to the market applications, one of the key requirements is the fast encoding. To achieve the fast encoding, exploiting thread-level parallelism is widely chosen mechanism since multi-threading is commonly supported based on the multi-core computer architecture. In this paper, we implement both the Tile-level parallelism and the Frame-level parallelism for HEVC encoding on multi-core platform. Based on the implementation, we present two approaches in combining the Tile-level parallelism with Frame-level parallelism. The first approach creates the fixed number of tile per frame while the second approach creates the number of tile per frame adaptively according to the number of frame in parallel and the number of available worker threads. Experimental results show that both improves the parallel scalability compared to the one that use only tile-level parallelism and the second approach achieves good trade-off between parallel scalability and coding efficiency for both Full-HD (1080 x 1920) and 4K UHD (3840 x 2160) sequences.
Keywords
I. Introduction
High Efficiency Video Coding (HEVC) [1 , 2] is a new video codec which was finalized in January 2013. HEVC is known to provide two times higher compression ratio than the previous video coding standard, which is promising to the multimedia industry. As the electronics industry pushes the large display products and consumers pursuit the high resolution video contents, broadcast community and other key players in the video market have taken the fast action to applying the HEVC codec to deliver advanced services. Due to achievement of high efficiency, HEVC sacrifices the time complexity. To apply HEVC to the market applications, one of the key requirements is the fast encoding [3] . To achieve the fast encoding, exploiting thread-level parallelism is widely chosen mechanism since multi-threading is commonly supported based on the multi-core computer architecture. In HEVC, there are several picture partition schemes such as slices, tiles [4] , and wavefront parallel processing (WPP) [5 , 6 , 7] . Since the slices result larger coding loss compared to tiles, we consider the tile scheme for picture partitioning parallelism in this paper.
- 1. Tile-level parallelism
In HEVC the picture partitioning parallelism structure is supported as called Tile; a picture is divided into squares and encoded independently. There is no dependency between the tiles so that the several tiles can be encoded in parallel. However, the neighboring blocks at the tile boundary cannot be referenced while encoding, the encoding efficiency is decreased. We have implemented the Tile-level parallelism based on HEVC test model (HM) encoder. Since the HM encoder is single-core codec, the multiple tiles are encoded in serial and use a single CABAC engine. We parallelize multiple tile encoding with each independent CABAC engine.
- 2. Frame-level parallelism
Figure 1 illustrates the random access GOP structure defined in [10] . With the GOP structure, the frames in the fourth GOP level can be employed to the frame-level parallelism. The GOP structure shown in Fig. 1 has a shortage, which is that only GOP level 4 frames can be evolved in parallel processing. To improve the frame-level parallel scalability, we propose to change the GOP structure as Fig. 2 , where the frames in the third GOP level are not referenced each other. In this way, we can encode two frames in the GOP level 3 in parallel and then four frames in the GOP level 4 in parallel. We call this frame-level parallelism scheme Frame-level parallelism of GOP level 3&4 and Fig. 3 illustrates the scheme.
PPT Slide
Lager Image
HEVC random access GOP 구조 [10] Fig. 1. HEVC random access GOP structure proposed in [10]
PPT Slide
Lager Image
프레임 레벨 병렬화를 위한 GOP 구조 Fig. 2. The proposed GOP structure to improve the frame-level parallelism in this paper
PPT Slide
Lager Image
GOP 레벨 3과 레벨 4의 프레임 레벨 병렬화 Fig. 3. Frame-level parallelism of GOP level 3&4
For the previous video coding standard, frame-level parallelism, slice-level parallelism, and combining approach of two different level parallelisms have been proposed. Since the tile is a newly adopted tool to support the parallel HEVC encoding, we have tried to combine the tile-level parallelism and frame-level parallelism in HEVC encoding. To the best of our knowledge, this is the first report of combining the tile-level parallelism and frame-level parallelism for HEVC. In this paper, we implemented tile-level parallel encoding [8 , 9] and frame-level parallel encoding each and proposed an effective combined approach applying the adaptive number of tile as taking the consideration of the number of frames in parallel and the number of available cores.
II. Improvement of parallelism
HEVC allows the tile number up to 25 for full HD video (1920 x 1080) and 110 for 4K UHD video (3840 x 2160). The recent multi-core architecture technique has improved significantly so that many systems have high number of cores. Employing more than two parallelism schemes improves the parallel scalability. Based on the implementation of tile-level parallelism and frame-level parallelism, we combine the two parallelisms to improve the parallel scalability and to more effectively utilize the multi-core system.
- 1. Combined Approch 1: Fixed number of tile in a frame
First, we combine the tile-level parallelism and frame-level parallelism as shown in Fig. 4 . As an example, we set the number of tile in a frame is four. When the encoder takes the input frame, it creates the thread as many as the number of frames in parallel, and then each frame-encoding-thread creates the worker thread as many as the number of tile.
PPT Slide
Lager Image
조합 1 방법: 타일 레벨 병렬화와 프레임 레벨 병렬화를 조합할 때 프레임 레벨 병렬화와 상관없이 동일한 개수의 타일로 화면을 분할하여 타일 레벨 병렬화를 수행하는 방법 Fig 4. Combined Approach 1: Tile-level parallelism combined with the frame-level parallelism of GOP level 3 & 4. The number of tile is the same for all frames. The number of tile in a frame is four as an example
The combined approach 1 illustrated in Fig. 4 has some shortage. When the frame-level parallelism is employed, the number of tiles in parallel is increased by factor of the number of frames in parallel. The unbalanced number of worker threads through the encoding timeline results in difficulty in effectively running multi-threading.
- 2. Combined Approach 2: Adaptive number of tile in a frame
As targeting the problem in the previous section, we propose to apply the adaptive number of tile as considering the number of frames in parallel. The proposed method as taking an example with four tiles in a frame in serial as illustrated in Fig. 5 . With the second approach having the adaptive number of tile, we expect less loss in coding efficiency and good balanced CPU usage.
PPT Slide
Lager Image
조합 2 방법: 타일 레벨 병렬화와 프레임 레벨 병렬화를 조합할 때 병렬로 처리되는 프레임의 개수에 따라 적응적으로 타일의 개수를 조정하며 화면 분할 병렬화를 수행하는 방법 Fig. 5. Combined Approach 2: Tile-level parallelism combined with the frame-level parallelism of GOP level 3 & 4. The number of tile is changed taking the frame-level parallelism into consideration. The initial number of tile in a frame is four as an example
III. Experimental Results
- 1. Test sequences and environments
We implemented our parallel encoding approaches described so far based on HEVC reference software. Multithreading has been applied using Windows threads APIs. The test sequences used in the experiment are two set of video. The first set has five Full-HD (1920 x 1080) videos of 100 frames (Kimono, Park Scene, Cactus, Basketball Drive, and BQ Terrace), which are from HEVC test sequences. The second set has five 4K UHD (3840 x 2160) videos of 100 frames (Jockey, YachtRide, ReadySteadyGo, ShakeNDry, and HoneyBee), which are from Kvazaar Encoder [11 , 12] test sequences. We used the separate platforms according to the test sequence resolution. The platform used for Full-HD test set has one Intel Xeon E5-2690 processor with eight physical cores whereas the platform for 4K UHD test set has two Intel Xeon E5-2690 processors with sixteen physical cores in total. For speedup measurement, the sequences encoded with one tile per frame in serial are used as the anchor. We select Main profile and HEVC common condition random access setting with two modifications: The GOP structure is changed as Fig. 2 to encode the frames in the level 3 in parallel and the encoding tool of AMP is not applied.
- 2. Coding efficiency analysis
As described in Section II-1, picture partition schemes result coding loss. However as shown in Fig. 6 tiles cause less coding loss compared to slices. As shown in Fig. 7 slices produces large boundary compared to tiles and the coding loss for slices is high.
PPT Slide
Lager Image
타일과 슬라이스 분할로 인한 부호화 효율 비교(테스트영상: BasketballDrive) Fig. 6. Coding loss comparison for Tile and Slice encoding
PPT Slide
Lager Image
Slice 분할 인코딩 Fig. 7. Pictuer partitioning using slices
As designing the encoder with various parallelisms, coding loss and parallel scalability should be carefully considered. The coding efficiency is measured using the Bjøntegaard delta (BD) bitrate as described in [13]. To measure the BD bitrate (BDBR), test sequences are encoded with no parallelization and no picture partitioning. Table 1 - 2 shows the coding losses by encoding with Tile-level parallelism. The coding losses of the tile-level parallelism occur mainly at the tile boundary that the neighboring encoding information cannot be used. The coding loss by tile partitioning increases as the global motion in the sequence is large such as Jockey. Since a race horse runs very fast in Jockey, coding efficiency drops very significantly at tile boundaries. Note that for the Jockey sequence, our proposed Combined Approach 2 decreases the coding loss significantly compared to the fixed tile partitioning method. Table 3 - 4 shows the coding losses when we applied the adaptive picture partitioning according to the GOP level as described in Fig. 5 . The results present that coding loss from Combined Approach 2 parallelism produces less coding losses compared to the Tile-level parallelism.
FHD: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR)Table 1. FHD: Tile-level parallelism coding efficiency (Y-BDBR)
PPT Slide
Lager Image
FHD: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR) Table 1. FHD: Tile-level parallelism coding efficiency (Y-BDBR)
4K: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR)Table 2. 4K: Tile-level parallelism coding efficiency (Y BDBR)
PPT Slide
Lager Image
4K: 타일 레벨 병렬화 방법의 부호화 효율(Y-BDBR) Table 2. 4K: Tile-level parallelism coding efficiency (Y BDBR)
FHD: 조합 2 방법의 부호화 효율(Y-BDBR)Table 3. FHD: Combined approach 2 parallelism coding efficiency (Y-BDBR)
PPT Slide
Lager Image
FHD: 조합 2 방법의 부호화 효율(Y-BDBR) Table 3. FHD: Combined approach 2 parallelism coding efficiency (Y-BDBR)
4K: 조합 2방법의 부호화 효율 (Y-BDBR)Table 4. 4K: Combined approach 2 parallelism coding efficiency (Y-BDBR)
PPT Slide
Lager Image
4K: 조합 2방법의 부호화 효율 (Y-BDBR) Table 4. 4K: Combined approach 2 parallelism coding efficiency (Y-BDBR)
- 3. Parallel scalibility analysis
Table 5 - 10 show the speedup by Tile-level parallelism, Combined Approach 1 parallelism, and Combined Approach 2 parallelism, respectively. From the results of Table 7 - 10 , the combined parallelism improves the parallel scalability significantly.
FHD: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수)Table 5. FHD: Tile-level parallelism scalability (Speedup)
PPT Slide
Lager Image
FHD: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수) Table 5. FHD: Tile-level parallelism scalability (Speedup)
4K: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수)Table 6. 4K: Tile-level parallelism scalability (Speedup)
PPT Slide
Lager Image
4K: 타일 레벨 병렬화에 의한 부호화 속도 향상(배수) Table 6. 4K: Tile-level parallelism scalability (Speedup)
FHD: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수)Table 7. FHD: Combined approach 1 parallelism scalability (Speedup)
PPT Slide
Lager Image
FHD: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수) Table 7. FHD: Combined approach 1 parallelism scalability (Speedup)
4K: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수)Table 8. 4K: Combined approach 1 parallelism scalability (Speedup)
PPT Slide
Lager Image
4K: 조합 1 방법 병렬화에 의한 부호화 속도 향상(배수) Table 8. 4K: Combined approach 1 parallelism scalability (Speedup)
FHD: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수)Table 9. FHD: Combined approach 2 parallelism scalability (Speedup)
PPT Slide
Lager Image
FHD: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수) Table 9. FHD: Combined approach 2 parallelism scalability (Speedup)
4K: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수)Table 10. 4K: Combined approach 2 parallelism scalability (Speedup)
PPT Slide
Lager Image
4K: 조합 2 방법 병렬화에 의한 부호화 속도 향상(배수) Table 10. 4K: Combined approach 2 parallelism scalability (Speedup)
When designing the encoding parallelism, we carefully consider the trade-off between the speedup and coding efficiency. From our experimental results, we aggregate the speedup results against coding loss for each parallelization scheme as shown in Fig. 8 . Both Combined Approach 1 and 2 are better parallel scalability compared to the Tile-level parallelism. Combined Approach 2 parallelism shows better speedup against the coding loss for Full-HD and 4K test sequences. In addition, the comparison of speedup between the combined approach 2 and the tile-level parallelism shows the similar pattern as Fig 9 no matter what number of physical core. However, the speedup scalability according to the number of available core is different. The speedup scalability decreases as the number of cores increases. This performance decrease causes from the inefficient parallelization such as frame synchronization.
PPT Slide
Lager Image
부호화 효율 대비 병렬화로 인한 속도 향상 Fig. 8. Speedup against BD bitrate
PPT Slide
Lager Image
코어개수 및 thread 개수 조합에 의한 속도 향상 Fig. 9. Speedup against number of thread and number of cores
IV. Conclusion
In this paper, we present two effective approaches of combining the tile-level parallelism and frame-level parallelism. Both approaches provide better parallel scalability compared to the tile-level parallelism and the experimental results show that when combining the tile-level parallelism and frame-level parallelism, applying the adaptive number of tile as taking the consideration of the number of frames in parallel and the number of available cores results in better trade-off between coding loss and parallel scalability for Full-HD and 4K UHD video sequences.
BIO
김 연 희
- 2000년 : 아주대학교 정보및컴퓨터공학과 학사
- 2002년 : 아주대학교 정보및컴퓨터공학과 석사
- 2009년 : Geroge Mason University Computer Science 박사
- 2009년 ~ 현재 : 한국전자통신연구원 영상미디어연구실 선임연구원
- ORCID: http://orcid.org/0000-0003-0658-6762
- 주관심분야 : 영상압축, 영상신호처리, 정보은닉, 실감
석 진 욱
- 1993년 : 홍익대학교 전기제어공학과 학사
- 1995년 : 홍익대학교 전기제어공학과 석사
- 1998년 : 홍익대학교 전기공학과 박사
- 2000년 ~ 현재 : 한국전자통신연구원 영상미디어연구실 책임연구원
- 주관심분야 : 비선형 확률시스템, 영상 압축, UHDTV 방송 시스템
정 순 흥
- 2001년 : 부산대학교 전자공학과 학사
- 2003년 : 한국과학기술원(KAIST) 전기및전자공학과 석사
- 2003년 ~ 2005년 : LG전자 주임연구원
- 2005년 ~ 현재 : 한국전자통신연구원 영상미디어연구실 선임연구원
- 2010년 ~ 현재 : 한국과학기술원(KAIST) 전기및전자공학과 박사과정
- 주관심분야 : 영상처리, 영상통신, 실감방송
김 휘 용
- 1994년 : 한국과학기술원(KAIST) 전기및전자공학과 학사
- 1998년 : 한국과학기술원(KAIST) 전기및전자공학과 석사
- 2004년 : 한국과학기술원(KAIST) 전기및전자공학과 박사
- 2003년 ~ 2005년 : ㈜애드팍테크놀러지 기술연구소 멀티미디어팀장
- 2005년 ~ 현재 : 한국전자통신연구원 영상미디어연구실장
- 주관심분야 : 영상압축, 컴퓨터 비전, 멀티미디어 시스템, 실감미디어서비스
최 진 수
- 1990년 : 경북대학교 전자공학과 학사
- 1992년 : 경북대학교 전자공학과 석사
- 1996년 : 경북대학교 전자공학과 박사
- 1996년 ~ 현재 : 한국전자통신연구원 실감방송미디어연구부장
- 주관심분야 : 영상통신, UHDTV 방송, 실감미디어서비스
References
2013 High Efficiency Video Coding, document ITU-T Rec. H.265 and ISO/IEC 23008-2 (HEVC), ITU-T and ISO/IEC
Sullivan G. J. 2012 “Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol. 22 (12) 1649 - 1668    DOI : 10.1109/TCSVT.2012.2221191
Kim Y. 2013 "A Fast Intra-Prediction Method in HEVC Using Rate-Distortion Estimation Based on Hadamard Transform," ETRI Journal 35 (2) 270 - 280    DOI : 10.4218/etrij.13.0112.0223
Fuldseth A. 2011 “Tiles”, JCT-VC document JCTVC-F335 Torino
Gordon C. , Henry F. , Pateux S. 2011 “Wavefront Parallel Processing for HEVC Encoding and Decoding,” JCT-VC document JCTVC-F274
Thiesse J. M. , Viéron J. 2012 “On Tiles and Wavefront tools for parallelism,” JCT-VC document JCTVC-I0198 Geneva
Zhang S. , Zhang X. , Gao Z. “Implementation and Improvement of Wavefront Parallel Processing for HEVC Encoding on Many-core Platform,” Proceedings of IEEE International Conference on Multimedia and Expo Workshops (ICMEW) 2014 1 - 6
Chi C. 2012 “Parallel Scalability and Efficiency of HEVC Parallelization Approaches,” IEEE Trans. Circuits Syst. Video Technol. 22 (12) 1827 - 1838    DOI : 10.1109/TCSVT.2012.2223056
Chi C. C. 2012 .,“Parallel Scalability and Efficiency of WPP and Tiles,” JCT-VC document JCTVC-I0520 Geneva
Bossen F. 2012 “Common test conditions and software refernce configurations,” JCT-VC document JCTVC-I1100
Viitanen M. “Kvazaar HEVC encoder for efficient intra coding,” in Proc. IEEE Int. Symp. Circuits Syst. Lisbon, Portugal May 2015 Ultra video group [Online]. Available: .
Bjontegaard G. 2001 “Calculation of Average PSNR Differences between RD-Curves,” Document VCEG-M33, Austin, TX, USA