|30.09.16||The official opening.|
The VideoCompletion project introduces the first benchmark for video-completion methods. We present results for different methods on a range of diverse test sequences which are available for viewing on a player equipped with a movable zoom region. Additionally, we provide the results of an objective analysis using quality metrics that were carefully selected in our study on video-completion perceptual quality. We believe that our work can help rank existing methods and assist developers of new general-purpose video-completion methods.
Our current data set consists of 7 video sequences with ground-truth completion results. We consider object removal, so the test sequences are constructed by composing various foreground objects over a set of background videos. Some of these background videos include left-view sequences from the stereoscopic-video data set RMIT3dv . As foreground objects we use those employed in the video-matting benchmark  as well as several 3D models. To seamlessly insert a 3D model in a background video we use Blender  motion-tracking tools. Each video-completion method takes the composited sequence and the corresponding object mask as input.
Video-completion results are seldom explicitly expected to adhere to ground truth and are usually judged only by their plausibility, which is assessed by a human observer. It makes objective quality assessment of video completion an inherent problem. However, by relaxing the requirement of complete adherence to ground truth we can increase correlation with perceptual completion quality. This benchmark employs four quality metrics: MS-DSSIM, MS-DSSIMdt, MS-CDSSIM, MS-CDSSIMdt. Thorough description and comparative analysis of these and other metrics can be found in our paper (to be published soon).
MS-DSSIM metric measures adherence of completion result V to ground truth video Vref in a multi-scale fashion with scale weights determined using perceptual quality data. It is based on the structural similarity index (SSIM)  values computed for all 9×9 luminance patches P(x) within the spatio-temporal hole .
Here superscript i denotes the level of the Gaussian pyramid—that is, Vref0 is the original ground-truth video, and Vref1 is the video blurred and subsampled by a factor of two in both spatial dimensions.
MS-DSSIMdt metric captures temporal coherency along ground-truth optical-flow vectors .
MS-CDSSIM relies on the assumption that completion result should be locally similar to the ground truth—that is, each patch P(x) within the spatio-temporal hole should have a similar ground-truth patch Pref(y).
MS-CDSSIMdt is a temporal stability metric that uses the same assumptions as MS-CDSSIM. Essentially it captures the changes in patch appearance from frame to frame, as opposed to evaluating consistency with ground-truth optical flow using MS-DSSIMdt. To do so we find for a given patch the most similar patch from the previous frame within a certain window, compute the distances from these patches to the most similar ground-truth patches and then compare the respective distances.
Here denotes a square window of pixels (we use equal to 1/10th of the frame width) spatially centered at and located in the previous frame.
Exact computation of MS-CDSSIM and MS-CDSSIMdt quickly becomes impractical for larger spatio-temporal holes, so we resort to approximate solutions based on the PatchMatch  algorithm.
We invite developers of video-completion methods to use our benchmark. We can evaluate the submitted data and report quality scores to the developer. In cases where the developer specifically grants permission, we will publish the results on our site. The test sequences with the respective completion masks are available for download: Deck, Library, Fountain, Wires, Tower, Skyscrapers, Sign.
For evaluation requests or if you have any questions or suggestions please feel free to contact us by email: firstname.lastname@example.org.
|Objective metric values|
|Background Reconstruction+ ||1.9||0.2211||0.1922||0.2174||0.0701||0.0902||0.1082||0.0831|
|PFClean Remove Rig+ ||2.7||0.3074||0.1871||0.0771||0.0942||0.1434||0.1634||0.1063|
|Planar Structure Guidancei ||5.4||0.3185||0.6035||0.6826||0.2406||0.1775||0.3025||0.4386|
|Nuke F_RigRemoval+ ||2.0||0.2912||0.2113||0.0782||0.1203||0.0681||0.0911||0.1042|
|Telea Inpaintingi ||4.7||0.3336||0.6236||0.6145||0.2065||0.1413||0.1333||0.3675|
|Complex Scenesm ||4.3||0.3073||0.2524||0.1163||0.1624||0.1956||0.3556||0.2374|
|Background Reconstruction+ ||1.7||0.0131||0.0051||0.0364||0.0071||0.0072||0.0092||0.0041|
|PFClean Remove Rig+ ||2.9||0.0183||0.0072||0.0031||0.0113||0.0135||0.0164||0.0092|
|Planar Structure Guidancei ||6.0||0.1566||0.3016||0.4586||0.0926||0.0676||0.1456||0.1866|
|Nuke F_RigRemoval+ ||2.9||0.0205||0.0124||0.0042||0.0124||0.0031||0.0061||0.0093|
|Telea Inpaintingi ||4.4||0.0132||0.0925||0.1975||0.0165||0.0104||0.0195||0.0465|
|Complex Scenesm ||3.1||0.0184||0.0093||0.0083||0.0112||0.0093||0.0163||0.0214|
|Background Reconstruction+ ||2.0||0.1181||0.0563||0.1194||0.0401||0.0552||0.0812||0.0431|
|PFClean Remove Rig+ ||2.4||0.1414||0.0451||0.0201||0.0562||0.0883||0.1224||0.0462|
|Planar Structure Guidancei ||5.9||0.2006||0.2936||0.4096||0.1586||0.1186||0.2275||0.2886|
|Nuke F_RigRemoval+ ||2.4||0.1403||0.0754||0.0272||0.0723||0.0411||0.0691||0.0503|
|Telea Inpaintingi ||4.6||0.1835||0.2865||0.3135||0.1305||0.0924||0.1023||0.2125|
|Complex Scenesm ||3.7||0.1192||0.0532||0.0283||0.0904||0.1175||0.2786||0.0994|
|Background Reconstruction+ ||1.9||0.0151||0.0072||0.0244||0.0091||0.0092||0.0102||0.0071|
|PFClean Remove Rig+ ||2.0||0.0172||0.0071||0.0041||0.0112||0.0153||0.0133||0.0092|
|Planar Structure Guidancei ||6.0||0.0716||0.1056||0.1286||0.0606||0.0396||0.0816||0.0916|
|Nuke F_RigRemoval+ ||2.6||0.0204||0.0154||0.0062||0.0143||0.0081||0.0091||0.0113|
|Telea Inpaintingi ||4.9||0.0235||0.0655||0.0965||0.0275||0.0195||0.0164||0.0495|
|Complex Scenesm ||3.7||0.0183||0.0113||0.0083||0.0184||0.0174||0.0235||0.0214|
+ regions that weren't reconstructed by the algorithm were filled afterwards using Telea image inpainting 
m owing to prohibitively high memory consumption the test sequences were downscaled to 1280×720 resolution
i image inpainting algorithms
|||E. Cheng, P. Burton, J. Burton, A. Joseski, and I. Burnett. RMIT3DV: Pre-announcement of a creative commons uncompressed HD 3D video database. Fourth International Workshop on Quality of Multimedia Experience (QoMEX), pages 212–217, 2012.|
|||Mikhail Erofeev, Yury Gitman, Dmitriy Vatolin, Alexey Fedorov, Jue Wang. Perceptually Motivated Benchmark for Video Matting. British Machine Vision Conference (BMVC), pages 99.1–99.12, 2015. [ doi , project page ]|
|||Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), pages 600–612, 2004.|
|||C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 2009.|
|||YUVSoft Background Reconstruction http://www.yuvsoft.com/stereo-3d-technologies/background-reconstruction/|
|||Pixel Farm PFClean http://www.thepixelfarm.co.uk/pfclean/|
|||J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf. Image completion using planar structure guidance. ACM Transactions on Graphics (TOG), 2014.|
|||The Foundry Nuke https://www.thefoundry.co.uk/products/nuke/|
|||A. Telea. An image inpainting technique based on the fast marching method. Journal of graphics tools, pages 23–34, 2004.|
|||A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Perez Video inpainting of complex scenes. SIAM Journal on Imaging Sciences, pages 1993–2019, 2014.|