Neural Network Approaches for Video Transmission, Motion Estimation and Error Concealment

Sankisa, Arun

doi:https://doi.org/10.21985/n2-4nj8-cm51

Work

Neural Network Approaches for Video Transmission, Motion Estimation and Error Concealment

Public

Download PDF

Download All Files (.zip)

Abstract The work presented in this dissertation addresses three broad areas of video signal processing: video transmission, motion estimation and error concealment. In the first category, focused on the source-side, we present two machine learning models for efficient content-aware resource allocation and packet prioritization for video transmission over shared/constrained, lossy networks. In the second and third categories we present three deep learning models, on the destination side, for optical flow estimation and motion-compensated error concealment. The solutions presented in the first category are classical machine learning approaches, which are viewed as single-layer implementations of simple neural networks. In the category of resource allocation and packet prioritization, we begin by presenting an extension to an existing content-aware rate-distortion optimization (RDO) method. We modify the algorithm to evaluate the expected receiver distortion with a new metric on the source side utilizing encoder information, transmission channel characteristics and motion compensation. Specifically, distinct video transmission units, called Group-of-Blocks (GOBs) or slices, are iteratively built at the source by considering various macroblock coding and error concealment modes for three different combinations of packet loss. A stochastic computation of the overall expected distortion of the reconstructed unit, using the Structural SIMilarity (SSIM) metric, is performed using the probabilities of occurrence for each case. The proposed SSIM-based model provides a more accurate estimate of distortion, one that closely models quality as perceived through the human visual system. When incorporated into the content-aware utility function, experimental results show improved packet ordering & scheduling efficiency and higher overall quality of the video at the receiver. Next, in order to present a more complete distortion metric to accurately reflect the impact of packet loss in videos, we enhance a metric defined for still images and propose a comprehensive measure of the overall impact on the perceptual quality due to packet loss in predictively coded video frames. The proposed metric, referred to as Cumulative Distortion using Structural Similarity (CDSSIM) henceforth in this document, quantifies the total structural distortion resulting from the propagation of reconstruction errors to frames in the past and the future, in motion-estimated, bi-directionally predicted, intra- and inter-frame coded video sequences. Furthermore, to mitigate the cost of real-time computation for estimating this metric, we implement a No-Reference (NR) sparse-regression machine-learning model to predict the CDSSIM metric using pre-defined features present within the lost packet. The Least Absolute Shrinkage and Selection Operator (LASSO) technique, viewed as a single-layer neural network implementation, is applied for two video frame resolution formats with features extracted solely from the encoded transmission bitstream. Standardized statistical performance measures show that the feature-based model can predict the cumulative distortion to a high degree of accuracy. We further evaluate the results using a Quartile-Based Prioritization (QBP) scheme and demonstrate that the model offers an effective way to prioritize packets using a framework for predicting the proposed CDSSIM metric in real-time streaming video applications. On the works addressing motion estimation and error-concealment at the destination, we present three “spatiotemporal” deep learning models using different classes of neural networks – traditional Convolutional Neural Networks (CNNs), Convolutional Long Short-Term Memory modules (ConvLSTMs) and a newly introduced class of networks called Capsule Networks. Two of these models operate directly on texture and pixel intensities and the third explicitly on motion field information extracted a-priori from pairs of video frames. In the first of the three works under this umbrella, we explore the efficacy of multiple deep neural network architectures with conventional convolutional and fully connected layers – using single-frame input resembling still-image in-painting solutions and stacked, temporally-consecutive-frames input to enable the network to implicitly learn the embedded motion and reconstruct the degraded portion of a future frame. The effectiveness of these architectures is evaluated by measuring the quality of the neural network reconstructions using standard metrics. In the second part of our motion-compensated error concealment work, we present a novel optical flow prediction model using an adaptable deep neural network architecture for performing blind and non-blind error concealment in videos degraded due to transmission packet loss. The two-stream network model is trained by separating the horizontal and vertical motion fields, each of which is passed through two similarly configured parallel pipelines that include traditional convolutional (Conv) and convolutional long short-term memory (ConvLSTM) layers. The ConvLSTM layers are used to extract temporally correlated motion information while the conventional Convolutional layers correlate motion spatially. The output from each pipeline is combined to form the overall predicted flow that is used to reconstruct the degraded future video frame. In the non-blind error concealment mode, which requires knowledge of the exact location of the lost packet, slice-based datasets that include rows of macroblocks above and below the degraded portion from three previous optical flows are used to predict the missing portion. In the blind version, we take three previous full-frame optical flows to predict the flow of the entire degraded frame. In both models, the motion-compensated reconstruction of the full video frame is performed using the predicted optical flow and the degraded portion is replaced from the recovered frame. Unlike current models that perform optical flow “estimation” using information in the pixel/voxel domain, the proposed network uses optical flow inputs that are obtained through a flow generation step to “predict” a future flow. In order to design an end-to-end neural-network based architecture, we utilized a state-of-the-art deep learning optical flow estimation model for the flow generation step. The generator can be easily interchanged, thereby increasing the adaptability of the overall two-pipeline model. The network is trained in supervised mode and the performance is evaluated by comparing frames reconstructed with predicted flows from the proposed model and the estimated “ground-truth” flows (obtained from the generator) using standard video quality metrics such as PSNR and SSIM. Furthermore, we evaluated the proposed architecture using real-world video transmission scenarios in which packet loss is random and can result in frames with overlapping losses. Specifically, we conducted tests that included predicted flows, i.e., the output generated by the model, as inputs into the proposed optical flow prediction network to produce “second-order” predictions. The quality of these predictions were assessed using PSNR and SSIM metrics which further corroborated the original results along with providing further insights into the efficacy of the training process. Finally, we present a temporal Capsule Network architecture to encode motion in videos as an instantiation parameter. The extracted motion is used to perform motion-compensated error concealment. We modify the original architecture and use a carefully curated dataset to enable the training of capsules spatially and temporally. First, we add the temporal dimension by taking co-located “patches” from three consecutive frames obtained from standard video sequences to form input data “cubes”. Second, the network is designed with an initial feature-extraction layer that operates on all three dimensions to generate spatiotemporal features. Additionally, we implement the PrimaryCaps module with a recurrent layer, instead of a conventional convolutional layer, to extract short-term motion-related temporal dependencies and encode them as activation vectors in the capsule output. Finally, the capsule output is combined with the most-recent past frame and passed through a fully connected reconstruction network to perform motion-compensated error concealment. We study the effectiveness of temporal capsules by comparing the proposed model with architectures that do not include capsules. Although the quality of the reconstruction shows room for improvement, we successfully demonstrate that capsules-based architectures can be designed to operate in the temporal dimension to encode motion-related attributes as instantiation parameters. The accuracy of motion estimation is evaluated by comparing both the reconstructed frame outputs and the corresponding optical-flow estimates with ground-truth data.

Creator