VideoGigaGAN: Towards Detail-rich Video Super-Resolution

VideoGigaGAN is a new generative VSR model that can produce videos with high-frequency details and temporal consistency. It builds upon a large-scale image upsampler -- GigaGAN, and introduces techniques that significantly improve the temporal consistency of upsampled videos.

8× Upsampling results (128×128→1024×1024)

Our model is able to upsample a video up to 8× with rich details.

Abstract

Video restoration techniques have made impressive progress in maintaining temporal consistency during the upsampling of videos. However, these approaches often produce results that are blurrier than their image-based counterparts, as they are limited by their generative capabilities. This raises a fundamental question: can the success of generative image upsamplers be extended to video super-resolution (VSR) tasks while preserving temporal consistency? To address this challenge, we introduce a new generative VSR model, named ViDeRNet, which can produce videos with high-frequency details and consistent temporal behavior. ViDeRNet builds upon a large-scale image upsampler, GigaGAN, as its foundation. However, simply incorporating temporal modules into the GigaGAN architecture to create a video model leads to severe temporal flickering artifacts. To overcome this issue, we have identified several key problems and developed techniques that significantly improve the temporal consistency of the upsampled videos. Our experiments demonstrate that, unlike previous VSR methods, ViDeRNet generates temporally consistent videos with more fine-grained visual details. We validate the effectiveness of ViDeRNet by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with up to 8× super-resolution. The key contributions of our work are: 1. The introduction of ViDeRNet, a novel generative VSR model that can produce videos with high-frequency details and temporal consistency. 2. The identification of critical issues that cause temporal flickering in video upsampling and the proposal of techniques to address them. 3. Comprehensive experiments demonstrating the superiority of ViDeRNet over existing state-of-the-art VSR methods in terms of visual quality and temporal stability.

Method Overview

Our Video Super-Resolution (VRS) model is constructed upon the asymmetric U-Net architecture of the GigaGAN image upsampler. To ensure temporal consistency, we first transform the image upsampler into a video upsampler by incorporating temporal attention layers into the decoder blocks. Additionally, we enhance consistency by integrating features from a flow-guided propagation module. To mitigate aliasing artifacts, we employ an Anti-aliasing block in the downsampling layers of the encoder. Finally, we directly shuttle high-frequency features through skip connections to the decoder layers, compensating for the loss of details caused by the BlurPool process. The key aspects of our VRS model architecture are: 1. Asymmetric U-Net structure inherited from the GigaGAN image upsampler as the foundation. 2. Incorporation of temporal attention layers in the decoder to enforce temporal consistency. 3. Integration of flow-guided propagation features to further enhance temporal coherence. 4. Use of Anti-aliasing blocks in the encoders downsampling layers to suppress aliasing artifacts. 5. Direct feature shuttling via skip connections to compensate for detail loss during BlurPool operations. This multi-pronged approach enables our VRS model to generate temporally consistent videos with high-frequency visual details, outperforming previous state-of-the-art methods.