The MPAI EEV group has established another milestone regarding the most powerful fully neural network-based video codec. The objective and vision of MPAI EEV is to make a bridge between the “traditional” video coding issues and those of neural video coding. The latest version of the reference model of MPAI EEV, EEV-0.6, has been completed research and finalized. This model reflects the state-of-the-art (SOTA) neural video coding technology in the senses of bi-directional motion compensation prediction-based video compression.
The ultra-high-resolution videos contain large quantities of fine-grained textural details, complicated motion dynamics and significant signal intensity variation in both local and global content modelling. In the conventional frameworks, coding for ultra-high resolution (UHD) videos has long been a challenging task due to plenty of reasons, such as large volumes of computation, the difficulty of modelling the diverse content with a unified compact representation, the limitation of the model capacity that covers a wide range of motion characteristics, etc. In the conventional video coding, the coding tools achieved limited coding gain in the terms of rate-distortion efficacy. As such, the previous solutions tend to enjoy a simple combination of different coding methods with different functionality to cover the different difficulty in UHD video coding.
In prior EEV models, the low-delay configuration based coding methods have been studied. The significant coding gain could be obtained and the EEV models has outperformed the conventional video coding standard.
Exploiting bi-directional context prediction has long been recognized as a key direction for improving compression efficiency in neural video coding. Ever since EEV-0.5, the major attention has been paid to B-frame based end-to-end video coding. However, existing neural B-frame codecs still exhibit limited performance gains, particularly in high-resolution videos with large motion, where optical flow estimation becomes unreliable and balanced prediction fusion introduces distortions. To address these challenges, in EEV-0.6, we present the first High-Resolution bi-directional neural video coding method, termed as HR-NVC, which non-uniformly integrates confidence-guided predictive cues from both temporal directions to achieve more reliable and efficient compression. Specifically, EEV-0.6 designs Spatio-Temporal Anchored Motion Estimation, which introduces virtual anchor frames and low-resolution priors to significantly improve estimation robustness under large displacements. Followed by a novel Hierarchical Motion Representation that converges multi-scale motion with temporal references, enabling compact and adaptive modeling of motion reliability across resolutions, EEV-0.6 further presents a Bi-Contextual Asymmetric Harmonization module that performs confidence-guided fusion of bidirectional references, effectively suppressing unreliable contexts and restoring structural consistency near occlusion and scene transition regions.
By introducing the concept of trustworthy factor for the enhanced bi-directional predictors generation, the EEV-0.6 designed a self-contained robust compression architecture that supports scaled-hierarchical B-frame based inter prediction and contextual coding. Moreover, the motion representation scheme has been extremely studied with high-efficiency representation fashion. Such design not only enables flexible and versatile motion characteristic adaptation but also supports fine-grained quality adjustment for the bi-directional prediction accuracy.
Regarding rate-distortion efficiency. EEV-0.6 realizes SOTA compression performance and outperforms related neural model as well as the conventional video coding standards such as H.266/VVC and H.265/HEVC, proposing a solid milestone for the research and study for neural video coding and its associated standardization activities. Notably, our model is the first end-to-end-optimized video codec evaluated on 4K-resolution videos, establishing a new benchmark for higher-resolution NVC and achieving state-of-the-art performance among neural B-frame codecs. The underlying enabling technology behind EEV-0.6 has been accept by CVPR conference this year with highlight presentation and will be made public available in the future.