Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC

Birendra Kathariya; Zhu Li; Geert Van Der Auwera

Journal ArticleOPEN ACCESS

Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC

IEEE Transactions on Circuits and Systems for Video Technology (2024) 34(5) 4070-4083

DOI: 10.1109/TCSVT.2023.3323483

10Citations

12Readers

Abstract

Block-based video codecs such as Versatile Video Coding (VVC)/H.266, High Efficiency Video Coding (HEVC)/H.265, Advanced Video Coding (AVC)/H.264 etc. inherently introduces compression artifacts. Although these codecs have in-loop filters to correct these distortions, they are not always effective due to the complexity of the noise. Recently, deep-learning approaches emerged as a promising solution for in-loop filtering. However, most of the previous approaches were designed solely for learning from images and neglected the high-frequency signals present in the reconstructed video frames. Furthermore, some previous methods employed a multi-level feature-extraction and feature-fusion strategy to enhance performance. However, they utilized complex feature-extractors while relying on naive feature-fusion methods. In this article, we propose a novel framework called TSF-Net, which jointly learns from both the pixel (spatial) and frequency-decomposed information and through powerful capability of a channel-wise transformer, it fuses both these information to improve performance. Our approach deviates from previous approaches by employing a simple feature-extractor coupled with an advanced transformer-based feature-fusion module. Simultaneously, TSF-Net introduces a few fundamental modifications in the multi-head self-attention module of the channel-wise transformer to make it computationally efficient. Our experimental results show that the proposed TSF-Net achieves a Bjontegaard Delta (BD) - bitrate saving of up to 10.258% for the luma (Y) component under all-intra (AI) profile outperforming the VVC baseline and other state-of-the-art methods. Moreover, the proposed TSF-Net with an efficient channel-wise transformer is twice as efficient as TSF-Net with a vanilla channel-wise transformer.

Author supplied keywords

Cite

CITATION STYLE

APA

Kathariya, B., Li, Z., & Auwera, G. V. D. (2024). Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC. IEEE Transactions on Circuits and Systems for Video Technology, 34(5), 4070–4083. https://doi.org/10.1109/TCSVT.2023.3323483

Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC

Abstract

Author supplied keywords

Cite

Register to see more suggestions