Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC

10Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Block-based video codecs such as Versatile Video Coding (VVC)/H.266, High Efficiency Video Coding (HEVC)/H.265, Advanced Video Coding (AVC)/H.264 etc. inherently introduces compression artifacts. Although these codecs have in-loop filters to correct these distortions, they are not always effective due to the complexity of the noise. Recently, deep-learning approaches emerged as a promising solution for in-loop filtering. However, most of the previous approaches were designed solely for learning from images and neglected the high-frequency signals present in the reconstructed video frames. Furthermore, some previous methods employed a multi-level feature-extraction and feature-fusion strategy to enhance performance. However, they utilized complex feature-extractors while relying on naive feature-fusion methods. In this article, we propose a novel framework called TSF-Net, which jointly learns from both the pixel (spatial) and frequency-decomposed information and through powerful capability of a channel-wise transformer, it fuses both these information to improve performance. Our approach deviates from previous approaches by employing a simple feature-extractor coupled with an advanced transformer-based feature-fusion module. Simultaneously, TSF-Net introduces a few fundamental modifications in the multi-head self-attention module of the channel-wise transformer to make it computationally efficient. Our experimental results show that the proposed TSF-Net achieves a Bjontegaard Delta (BD) - bitrate saving of up to 10.258% for the luma (Y) component under all-intra (AI) profile outperforming the VVC baseline and other state-of-the-art methods. Moreover, the proposed TSF-Net with an efficient channel-wise transformer is twice as efficient as TSF-Net with a vanilla channel-wise transformer.

Cite

CITATION STYLE

APA

Kathariya, B., Li, Z., & Auwera, G. V. D. (2024). Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC. IEEE Transactions on Circuits and Systems for Video Technology, 34(5), 4070–4083. https://doi.org/10.1109/TCSVT.2023.3323483

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free