Video Completion Benchmark

What’s new

30.09.16

The official opening.

Overview

Introduction

The VideoCompletion project introduces the first benchmark for video-completion methods. We present results for different methods on a range of diverse test sequences which are available for viewing on a player equipped with a movable zoom region. Additionally, we provide the results of an objective analysis using quality metrics that were carefully selected in our study on video-completion perceptual quality. We believe that our work can help rank existing methods and assist developers of new general-purpose video-completion methods.

Data set

Our current data set consists of 7 video sequences with ground-truth completion results. We consider object removal, so the test sequences are constructed by composing various foreground objects over a set of background videos. Some of these background videos include left-view sequences from the stereoscopic-video data set RMIT3dv [1]. As foreground objects we use those employed in the video-matting benchmark [2] as well as several 3D models. To seamlessly insert a 3D model in a background video we use Blender [3] motion-tracking tools. Each video-completion method takes the composited sequence and the corresponding object mask as input.

A 3D model inserted in the background video using motion tracking to construct a test sequence with ground truth.

Evaluation Methodology

Video-completion results are seldom explicitly expected to adhere to ground truth and are usually judged only by their plausibility, which is assessed by a human observer. It makes objective quality assessment of video completion an inherent problem. However, by relaxing the requirement of complete adherence to ground truth we can increase correlation with perceptual completion quality. This benchmark employs four quality metrics: MS-DSSIM, MS-DSSIMdt, MS-C_DSSIM, MS-C_DSSIMdt. Thorough description and comparative analysis of these and other metrics can be found in our paper (to be published soon).

MS-DSSIM metric measures adherence of completion result V to ground truth video V_ref in a multi-scale fashion with scale weights determined using perceptual quality data. It is based on the structural similarity index (SSIM) 4 values computed for all 9×9 luminance patches P(x) within the spatio-temporal hole $\Omega$ .

$\mathrm{DSSIM}(V,V_{ref}) = \frac{1}{|\Omega|} \sum_{x \in \Omega} 1 - \mathrm{SSIM}\big(P(x),P_{ref}(x)\big),\\ \mathrm{MS-DSSIM}(V,V_{ref}) = \sum_{i=0}^{M-1} w_{i}\cdot\mathrm{DSSIM}(V^i,V^i_{ref}),\\ [w_{i}] = [0.05, 0.12, 0.23, 0.30, 0.30].$

Here superscript i denotes the level of the Gaussian pyramid—that is, V_ref⁰ is the original ground-truth video, and V_ref¹ is the video blurred and subsampled by a factor of two in both spatial dimensions.

MS-DSSIMdt metric captures temporal coherency along ground-truth optical-flow vectors $s_x = (vx,vy,-1)$ .

$\mathrm{DSSIMdt}(V,V_{ref}) =\\ \frac{1}{|\Omega|} \sum_{x \in \Omega} \max\Big( \mathrm{SSIM}\big(P_{ref}(x),P_{ref}(x+s_x)\big)-\mathrm{SSIM}\big(P(x),P(x+s_x)\big),0\Big),\\ \mathrm{MS-DSSIMdt}(V,V_{ref}) = \sum_{i=0}^{M-1} w_{i}\cdot\mathrm{DSSIMdt}(V^i,V^i_{ref}),\\ [w_{i}] = [0.00, 0.00, 0.30, 0.32, 0.38].$

MS-C_DSSIM relies on the assumption that completion result should be locally similar to the ground truth—that is, each patch P(x) within the spatio-temporal hole $\Omega$ should have a similar ground-truth patch P_ref(y).

$\mathrm{C_{DSSIM}}(V,V_{ref}) = \frac{1}{|\Omega|} \sum_{x \in \Omega} \min_{y}\Big(1 - \mathrm{SSIM}\big(P(x),P_{ref}(y)\big)\Big),\\ \mathrm{MS-C_{DSSIM}}(V,V_{ref}) = \sum_{i=0}^{M-1} w_{i}\cdot\mathrm{C_{DSSIM}}(V^i,V^i_{ref}),\\ [w_{i}] = [0.04, 0.11, 0.21, 0.29, 0.35].$

MS-C_DSSIMdt is a temporal stability metric that uses the same assumptions as MS-C_DSSIM. Essentially it captures the changes in patch appearance from frame to frame, as opposed to evaluating consistency with ground-truth optical flow using MS-DSSIMdt. To do so we find for a given patch the most similar patch from the previous frame within a certain window, compute the distances from these patches to the most similar ground-truth patches and then compare the respective distances.

$\mathrm{C_{DSSIMdt}}(V,V_{ref}) =\\ \frac{1}{|\Omega|} \sum_{x \in \Omega} \left|\min_{y} \Big(1-\mathrm{SSIM}\big(P(x), P_{ref}(y)\big)\Big) - \min_{y} \Big(1-\mathrm{SSIM}\big(P(x_{prev}), P_{ref}(y)\big)\Big)\right|, \\ x_{prev} = \underset{y \in \Omega^{w\times w}_{prev}(x)}{\arg\min} \Big(1-\mathrm{SSIM}\big(P(x),P(y)\big)\Big),\\ \mathrm{MS-C_{DSSIMdt}}(V,V_{ref}) = \sum_{i=0}^{M-1} w_{i}\cdot\mathrm{C_{DSSIMdt}}(V^i,V^i_{ref}),\\ [w_{i}] = [0.00, 0.00, 0.30, 0.32, 0.38].$

Here $\Omega^{w\times w}_{prev}(x)$ denotes a square window of $w\times w$ pixels (we use $w$ equal to 1/10th of the frame width) spatially centered at $x$ and located in the previous frame.

Exact computation of MS-C_DSSIM and MS-C_DSSIMdt quickly becomes impractical for larger spatio-temporal holes, so we resort to approximate solutions based on the PatchMatch [5] algorithm.

Participate

We invite developers of video-completion methods to use our benchmark. We can evaluate the submitted data and report quality scores to the developer. In cases where the developer specifically grants permission, we will publish the results on our site. The test sequences with the respective completion masks are available for download: Deck, Library, Fountain, Wires, Tower, Skyscrapers, Sign.

For evaluation requests or if you have any questions or suggestions please feel free to contact us by email: abokov@graphics.cs.msu.ru.

Evaluation

Objective metric values
	rank	Deck	Library	Fountain	Wires	Tower	Skyscrapers	Sign
Background Reconstruction⁺ [6]	1.9	0.221¹	0.192²	0.217⁴	0.070¹	0.090²	0.108²	0.083¹
PFClean Remove Rig⁺ [7]	2.7	0.307⁴	0.187¹	0.077¹	0.094²	0.143⁴	0.163⁴	0.106³
Planar Structure Guidanceⁱ [8]	5.4	0.318⁵	0.603⁵	0.682⁶	0.240⁶	0.177⁵	0.302⁵	0.438⁶
Nuke F_RigRemoval⁺ [9]	2.0	0.291²	0.211³	0.078²	0.120³	0.068¹	0.091¹	0.104²
Telea Inpaintingⁱ [10]	4.7	0.333⁶	0.623⁶	0.614⁵	0.206⁵	0.141³	0.133³	0.367⁵
Complex Scenes^m [11]	4.3	0.307³	0.252⁴	0.116³	0.162⁴	0.195⁶	0.355⁶	0.237⁴
Background Reconstruction⁺ [6]	1.7	0.013¹	0.005¹	0.036⁴	0.007¹	0.007²	0.009²	0.004¹
PFClean Remove Rig⁺ [7]	2.9	0.018³	0.007²	0.003¹	0.011³	0.013⁵	0.016⁴	0.009²
Planar Structure Guidanceⁱ [8]	6.0	0.156⁶	0.301⁶	0.458⁶	0.092⁶	0.067⁶	0.145⁶	0.186⁶
Nuke F_RigRemoval⁺ [9]	2.9	0.020⁵	0.012⁴	0.004²	0.012⁴	0.003¹	0.006¹	0.009³
Telea Inpaintingⁱ [10]	4.4	0.013²	0.092⁵	0.197⁵	0.016⁵	0.010⁴	0.019⁵	0.046⁵
Complex Scenes^m [11]	3.1	0.018⁴	0.009³	0.008³	0.011²	0.009³	0.016³	0.021⁴
Background Reconstruction⁺ [6]	2.0	0.118¹	0.056³	0.119⁴	0.040¹	0.055²	0.081²	0.043¹
PFClean Remove Rig⁺ [7]	2.4	0.141⁴	0.045¹	0.020¹	0.056²	0.088³	0.122⁴	0.046²
Planar Structure Guidanceⁱ [8]	5.9	0.200⁶	0.293⁶	0.409⁶	0.158⁶	0.118⁶	0.227⁵	0.288⁶
Nuke F_RigRemoval⁺ [9]	2.4	0.140³	0.075⁴	0.027²	0.072³	0.041¹	0.069¹	0.050³
Telea Inpaintingⁱ [10]	4.6	0.183⁵	0.286⁵	0.313⁵	0.130⁵	0.092⁴	0.102³	0.212⁵
Complex Scenes^m [11]	3.7	0.119²	0.053²	0.028³	0.090⁴	0.117⁵	0.278⁶	0.099⁴
Background Reconstruction⁺ [6]	1.9	0.015¹	0.007²	0.024⁴	0.009¹	0.009²	0.010²	0.007¹
PFClean Remove Rig⁺ [7]	2.0	0.017²	0.007¹	0.004¹	0.011²	0.015³	0.013³	0.009²
Planar Structure Guidanceⁱ [8]	6.0	0.071⁶	0.105⁶	0.128⁶	0.060⁶	0.039⁶	0.081⁶	0.091⁶
Nuke F_RigRemoval⁺ [9]	2.6	0.020⁴	0.015⁴	0.006²	0.014³	0.008¹	0.009¹	0.011³
Telea Inpaintingⁱ [10]	4.9	0.023⁵	0.065⁵	0.096⁵	0.027⁵	0.019⁵	0.016⁴	0.049⁵
Complex Scenes^m [11]	3.7	0.018³	0.011³	0.008³	0.018⁴	0.017⁴	0.023⁵	0.021⁴

⁺ regions that weren't reconstructed by the algorithm were filled afterwards using Telea image inpainting [10]

^m owing to prohibitively high memory consumption the test sequences were downscaled to 1280×720 resolution

ⁱ image inpainting algorithms

Deck
Library
Fountain
Wires
Tower
Skyscrapers
Sign

Source
Mask
BGR [6]
PFClean [7]
Planar [8]
RigRemoval [9]
Telea [10]
Complex [11]

0 %

Note: Make sure you are using the latest version of your web browser (we recommend to use chromium-based web browsers)

Overall Plots

X-axis:Y-axis:Sequence:

References

[1]	E. Cheng, P. Burton, J. Burton, A. Joseski, and I. Burnett. RMIT3DV: Pre-announcement of a creative commons uncompressed HD 3D video database. Fourth International Workshop on Quality of Multimedia Experience (QoMEX), pages 212–217, 2012.
[2]	Mikhail Erofeev, Yury Gitman, Dmitriy Vatolin, Alexey Fedorov, Jue Wang. Perceptually Motivated Benchmark for Video Matting. British Machine Vision Conference (BMVC), pages 99.1–99.12, 2015. [ doi , project page ]
[3]	Blender https://www.blender.org/
[4]	Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), pages 600–612, 2004.
[5]	C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 2009.
[6]	YUVSoft Background Reconstruction http://www.yuvsoft.com/stereo-3d-technologies/background-reconstruction/
[7]	Pixel Farm PFClean http://www.thepixelfarm.co.uk/pfclean/
[8]	J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf. Image completion using planar structure guidance. ACM Transactions on Graphics (TOG), 2014.
[9]	The Foundry Nuke https://www.thefoundry.co.uk/products/nuke/
[10]	A. Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, pages 23–34, 2004.
[11]	A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Perez Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, pages 1993–2019, 2014.