EVC-UFV V1.0 Design Procedure

<-References Go to ToC Complexity Reduction ->

1. Introduction

This chapter specifies the steps enabling a User to design a neural network able to up-sample a video sequence to a higher resolution than the current video resolution through the following steps:

Selection of video sequences for use in the development of the Training Dataset.
Creation of the Training Dataset.
Selection of Training Approach.
Selection of the NN Architecture (initial architecture and modification).
Training of the Neural Network (e.g., how many epochs for SDtoHD).

2. Data Preparation

Assuming the the target resolution of m rows by n columns, the training dataset of frames to be used in the training process, consists pairs of input frames of resolution m/2 by n/2 and output frames of resolution m by n. If the input frames are not available, they may be obtained from the output frames by using a down-sampling filter.

To reduce the computing time required for training, as well as to overcome memory management issues, patches extracted from the input and output frames may be used. The resolutions of the patches are h/2 by k/2 and h by k for the input and output patches, respectively. The number of patches extracted from the frames shall be an appropriate smaller number than the total number of patched in the frame and h and k shall be appropriately smaller than m and n, respectively.

Patches may be extracted with different methods, e.g., randomly, feature-based etc.

To ensure that the trained filter is applicable to to a wide range of video material outside of those used for training, Augmentation maybe used. The size of the training dataset is increased by transforming patches or frames, e.g., by rotating, adding noise, mirroring, etc.

3. Pre-Training

Although the model can be trained starting from an untrained or from a trained model, the latter provides better result by fine tuning a model that was pre-trained using the method specified below.

The pre-training method is performed with the following process:

The pre-training set shall have a size of 800 high definition images at least.
The images are diversifies through data Augmentation with the following process:
1. Selection of square patches.
2. Each patch is randomly changed by applying one of more of the following:
  1. Rotations by multiples of 90°.
  2. Horizontal flipping.
  3. Vertical flipping.
The pre-training uses the following:
1. Batch size of 4.
2. Backpropagation algorithm according to ADAM with default parameters of β1 =0.9, β2 =0.999, and ϵ= 10−8.
3. The learning rate is fixed to 10⁻⁴ originally and then decreased to half after every 2⁴ iterations.

4, Fine-Tuning

The fine-tuning is performed with the following process:

Select a fine-tuning dataset data for the specific application domain, e.g., in case of video application, encoded and decoded video sequences
Compute the Saliency Value.
Retain the patch if it is adequately separated in the Cumulative Distribution Function of the Saliency Value.
Augment the dataset size by randomly changing the patch by applying one of more of the following:
1. Rotations by multiples of 90°.
2. Horizontal flipping.
3. Vertical flipping.
The first four Residual Blocks are frozen while the rest of the Residual Blocks are trained.
The fine tuning is applied for 200 epochs using a batch size of 4.
The learning rate is initially set to 10^-5 and then reduced during learning with a ReduceLROnPlateau scheduler with Patience 15 and learning rate factor of 0.5.
The ADAM optimization is used with initial parameters 0.9, 0.999, 10-8 for β1, β2, and ϵ respectively.
The extracted pair of patches for the training set have a size of 64×64 pixels for the input and 128×128 pixels (or the output (2x up-sampling).
The data sets is split into training and validation sets with a 20% validation dataset.

The reference implementation of the training process will be made available at the MPAI Git.

5. Up-sampling Network

The network used is a simplified Densely Residual Laplacian Super-Resolution network whose main component is a residual block that includes a Densely Residual Laplacian Modules (DRLM ) and a compression unit (one convolutional layer). Each DRLM contains three residual units consisting of two convolutional layers and two ReLU Layers, one compression unit, and one Laplacian attention unit.

All DRLMs in each residual block and all residual units in each DRLM are densely connected. The Laplacian attention unit consists of three convolutional layers with filter size 3×3 and dilation 3, 5, 7. All convolutional layers in the network, except the Laplacian one, have filter size 3×3 with dilation of 1. Throughout the network, all feature maps have size 64.

The number of Feature Maps – that impacts the overall complexity and performance of the network – is reduced by half with respect the original Densely Residual Laplacian Network using….

Figure 1 gives the structure of the network having the following characteristics:

It is composed of 2 Residual Blocks.
Each RB of is composed of 6 DRLMs.
Each DRLM is composed of of 2 Residual Units.
The number of Hidden Convolutional Layers in the Residual Unit is 1.
The number of input Feature Maps .

Figure 1 – Structure of the EVC-UFV Up-sampling Filter