A method typically used in video coding is to down-sample to half the input video frame before encoding. This reduces the computational cost but requires an up-sampling filter to recover the original video resolution in the decoded video to reduce as much as possible the loss in visual quality. Currently used filters are bicubic and Lanczos,

Figure 1 – Up-sampling Filters for Video application (EVC-UFV)

 

In the last few years, Artificial Intelligence (AI), Machine Learning (ML), and especially Deep Learning (DL) techniques, have demonstrated their capability to enhance the performance of various image and video processing tasks.  MPAI has performed an investigation to assess how video coding performance could be improved by replacing traditional coding blocks with deep-learning ones. The outcome of this study has shown that deep-learning based up-sampling filters significantly improve the performance of existing video codecs.

MPAI issued a Call for Technologies for up-sampling filters for video applications in October 2024. This was followed by an intense phase of development that enabled MPAI to approve Technical Specification: AI-Enhanced Video Coding (MPAI-EVC) – Up-sampling Filter for Video application (EVC-UFV) V1.0 with a request for Community Comments at its 58th General Assembly (MPAI-58).

EVC-UFV standard enables efficient and low complexity up-sampling filters applied to video with different bit-depth of 8 and 10 bit per pixels per component, in standard YCbCr colour space with 4:2:0 sub-sampling, encoded with a variety of encoding technologies using different encoding features such as random access and low delay.

As depicted in Figure 2, the filter is a Densely Residual Laplacian Super-Resolution network (DRLN), offering a novel deep-learning approach.

Figure 2 – Densely Residual Laplacian Super-Resolution network (DRLN).

The complexity of the filter is reduced in two steps. First, a drastic simplification of the deep-learning structure that reduces the numbers of blocks provides a much lighter network while keeping similar performances of the baseline DRLN. This is achieved by identifying the DRLN’s principal components and understanding the impact of each component on the output video frame quality, memory size, and computational costs.

As shown in Figure 2, the main component of the DRLN architecture is a Residual Block which is composed of the Densely Residual Laplacian Modules (DRLM) and a convolutional layer. Each DRLM contains three Residual Units, as well as one compression unit and one Laplacian attention unit (a set of Convolutional Layers with a square filter size and Dilation that is greater than or equal the filter size). Each Residual Unit consists of two convolutional layers and two ReLU Layers. All DRLM modules in each Residual Block and all Residual Units in each DRLM are densely connected. The Laplacian attention unit consists of three convolutional layers with filter size 3×3 and dilation (a technique for expanding a convolutional kernel by inserting holes or gaps between its elements) equal to 3, 5, 7. All convolutional layers in the network, except the Laplacian one, have filter size 3×3 with dilation equal to 1. Throughout the network, the number of feature maps (the outputs of convolutional layers) is 64.

Based on this structural analysis, reducing the number of the main Residual Blocks, adding more DRLMs, and reducing the complexity of the Residual Unit and the number of hidden convolutional layers and features map drastically accelerates execution speed and reduces memory management but does not substantially affect the network’s visual quality performance.

Figure 3 depicts the resulting EVC-UFV Up-sampling Filter,

Figure 3 – Structure of the EVC-UFV Up-sampling Filter

 

The parameters of the original and complexity-reduced network are given in Table 1.

 

Table 1 – Parameters of the original and the complexity-reduced network

Original Final
Residual Blocks 6 2
DRLMs per Residual Block 3 6
Residual Block per DRLM 3 3
Hidden Convolutional Layers per Residual Unit 2 1
Input Feature Maps 64 32

 

Further, by pruning the parameters and weights of the network, the network complexity is reduced by 40%. The loss in performance is less than 1% in BD-rate. This is achieved, by first using the well-known DeepGraph technique, modified to work with deep-learning based up-sampling filter, understanding the dependency of the different components’ layers of the simplified deep-learning network. This facilitates grouping components sharing a common pruning approach that can be applied without introducing dimensional inconsistencies among the inputs and outputs of the layers.

Verification Tests of the technology has been performed on:

 

Standard sequences  CatRobot, FoodMarket4, ParkRunning3.
Bits/sample  8 and 10 bit-depth per component.
Colour space  YCbCr with 4:2:0 sub sampling.
Encoding technologies  AVC, HEVC, and VVC.
Encoding settings  Random Access and Low Delay at QPs 22, 27, 32, 37, 42, 47.
Up-sampling SD to HD and HD to UHD.
Metrics BD-Rate, BD-PSNR and BD-VMAF
Deep-learning structure Same for all QPs

 

Results show an impressive improvement for all coding technologies, and encoding options for all three objective metrics when compared with the currently used traditional bicubic interpolation. The results of Table 2 have been obtained foe the low-delay coding mode.

 

Table 2 – Performance of the EVC-UFV Up-sampling Filter

AVC HEVC VVC
SD to HD (using own trained filter) 14.4% 12.2% 13.8%
HD to UHD (using own trained filter) 5.6% 6% 6.5%
SD to HD (using HD to UHD filter) 14% 11.6% 11.4%

 

All results are obtained with the 40% pruned network.