This chapter specifies the steps enabling a User to design a neural network able to up-sample a video sequence to a higher resolution than the current video resolution through the following steps:
- Selection of video sequences for use in the development of the Training Dataset.
- Creation of the Training Dataset.
- Selection of Training Approach.
- Selection of the NN Architecture (initial architecture and modification).
- Training of the Neural Network (e.g., how many epochs for SDtoHD).
Data Preparation
Assuming the the target resolution of m rows by n columns, the training dataset of frames to be used in the training process, consists pairs of input frames of resolution m/2 by n/2 and output frames of resolution m by n. If the input frames are not available, they may be obtained from the output frames by using a down-sampling filter.
To reduce the computing time required for training, as well as to overcome memory management issues, patches extracted from the input and output frames may be used. The resolutions of the patches are h/2 by k/2 and h by k for the input and output patches, respectively. The number of patches extracted from the frames shall be an appropriate smaller number than the total number of patched in the frame and h and k shall be appropriately smaller than m and n, respectively.
Patches may be extracted with different methods, e.g., randomly, feature-based etc.
To ensure that the trained filter is applicable to to a wide range of video material outside of those used for training, Augmentation maybe used. The size of the training dataset is increased by transforming patches or frames, e.g., by rotating, adding noise, mirroring, etc.
Training
Although the model can be trained starting from an untrained or from a trained model, the latter provides better result by fine tuning a model that was pre-trained using the method specified below.
The pre-training method is performed with the following process:
- The pre-training set shall have a size of at least.
- The images are diversifies through data Augmentation with the following process:
- Selection of square patches.
- Each patch is randomly changed by applying one of more of the following:
- Rotations by multiples of 90°.
- Horizontal flipping.
- Vertical flipping.
- The pre-training uses the following:
- Batch size of 4.
- Backpropagation algorithm according to ADAM with default parameters of β1 =0.9, β2 =0.999, and ϵ= 10−8.
- The learning rate is fixed to 10−4 originally and then decreased to half after every 24 iterations.
The fine-tuning is performed with the following process:
- Select a fine-tuning dataset data for the specific application domain, e.g., in case of video application, encoded and decoded video sequences
- Compute the Saliency Value.
- Retain the patch if it is adequately separated in the Cumulative Distribution Function of the Saliency Value.
- Augment the dataset size by randomly changing the patch by applying one of more of the following:
- Rotations by multiples of 90°.
- Horizontal flipping.
- Vertical flipping.
- The first four Residual Blocks are frozen while the rest of the Residual Blocks are trained.
- The fine tuning is applied for 200 epochs using a batch size of 4.
- The learning rate is initially set to 10-5 and then reduced during learning with a ReduceLROnPlateau scheduler with Patience 15 and learning rate factor of 0.5.
- The ADAM optimization is used with initial parameters 0.9, 0.999, 10-8 for β1, β2, and ϵ respectively.
- The extracted pair of patches for the training set have a size of 64×64 pixels for the input and 128×128 pixels (or the output (2x up-sampling).
- The data sets is split into training and validation sets with a 20% validation dataset.