SVD:A Large-Scale Short Video Dataset for Near-Duplicate Video Retrieval Qing-Yuan Jiang',Yi Het,Gen Lit,Jian Lint,Lei Lit and Wu-Jun Lit fNational Key Laboratory for Novel Software Technology, Department of Computer Science and Technology,Nanjing University,Nanjing,China ByteDance AI Lab,Beijing,China jiangqy@lamda.nju.edu.cn,[heyi,ligen.lab,lileilab}@bytedance.com, linj@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract from a large-scale video database.NDVR aims to retrieve the near-duplicate videos from a massive video database. With the explosive growth of video data in real appli- where near-duplicate videos are defined as videos that are cations,near-duplicate video retrieval(NDVR)has become visually close to the original videos [321.For example,the indispensable and challenging,especially for short videos. videos might be slightly modified by the users to bypass However,all existing NDVR datasets are introduced for the detection,and the modified videos can be treated as long videos.Furthermore,most of them are small-scale and near-duplicate videos of the original videos.These modi- lack of diversity due to the high cost of collecting and la- fications can be caption insertion,border insertion and so beling near-duplicate videos.In this paper,we introduce on.An NDVR system has been a necessity on content plat- a large-scale short video dataset,called SVD,for the ND- forms with many applications,including video recommen- VR task.SVD contains over 500,000 short videos and over dation,video search,and copyright infringement detection. 30,000 labeled videos of near-duplicates. We use multi- Hence,NDVR has become a hot research topic,and there ple video mining techniques to construct positive/negative have appeared a lot of methods for NDVR [32,10,8,4,33, pairs.Furthermore,we design temporal and spatial trans- 29,1,24,16,18,2,23,13,30,19,6. formations to mimic user-attack behavior in real applica- Existing NDVR methods can be divided as video-level tions for constructing more difficult variants of SVD.Ex- methods and frame-level methods.Video-level method- periments show that existing state-of-the-art NDVR method- s,including layer-wise convolutional neural network (C- s,including real-value based and hashing based methods, NNL)[12],vector-wise convolutional neural network (C- fail to achieve satisfactory performance on this challenging NNV)[12]and deep metric learning (DML)[13],try to dataset.The release of SVD dataset will foster research and represent each video as a global feature.Frame-level meth- system engineering in the NDVR area.The SVD dataset is ods,including spatio-temporal post-filtering [4],circulant available at https://svdbase.github.io. temporal encoding (CTE)[24]and temporal matching k- ernel (TMK)[23],extract features for each frame of the video.In the meantime,to advance the research of ND- 1.Introduction VR,several video datasets have been introduced in recen- t years,including CCWEB [32].UQ_VIDEO [29],VCD- Over the past decades,we have witnessed the explosive B [9],MUSCLE_VCD [14],TRECVID [22]and so on. growth of video data in a variety of video sharing web- However,all of them are for long videos with average dura- sites like YouTube,Instagram2,and TikTok3.For exam- tion longer than 60 seconds. ple,400 hours of new videos were uploaded to Youtube ev- In recent years,short videos with duration less than 60 ery minute and one billion hours of content was watched seconds have become increasingly popular on social me- on YouTube every day in February 20174.With billions of dia platforms.Users have strong incentive to copy a hot videos being available on the internet,it becomes a major short video and upload a modified version on these plat- challenge to perform near-duplicate video retrieval(NDVR) forms to gain attention.With the increasing in short video data,there appear new difficulties and challenges for detect- Ihttps://www.youtube.com 2https://www.instagram.com ing near-duplicate short videos.Some of the new difficul- 3https://www.tiktok.com ties and challenges are listed as follows.Firstly,most long 4https://en.wikipedia.org/wiki/YouTube videos are generated by professional photographers with
SVD: A Large-Scale Short Video Dataset for Near-Duplicate Video Retrieval Qing-Yuan Jiang† , Yi He‡ , Gen Li‡ , Jian Lin† , Lei Li‡ and Wu-Jun Li† †National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, China ‡ByteDance AI Lab, Beijing, China jiangqy@lamda.nju.edu.cn,{heyi,ligen.lab,lileilab}@bytedance.com, linj@lamda.nju.edu.cn,liwujun@nju.edu.cn Abstract With the explosive growth of video data in real applications, near-duplicate video retrieval (NDVR) has become indispensable and challenging, especially for short videos. However, all existing NDVR datasets are introduced for long videos. Furthermore, most of them are small-scale and lack of diversity due to the high cost of collecting and labeling near-duplicate videos. In this paper, we introduce a large-scale short video dataset, called SVD, for the NDVR task. SVD contains over 500,000 short videos and over 30,000 labeled videos of near-duplicates. We use multiple video mining techniques to construct positive/negative pairs. Furthermore, we design temporal and spatial transformations to mimic user-attack behavior in real applications for constructing more difficult variants of SVD. Experiments show that existing state-of-the-art NDVR methods, including real-value based and hashing based methods, fail to achieve satisfactory performance on this challenging dataset. The release of SVD dataset will foster research and system engineering in the NDVR area. The SVD dataset is available at https://svdbase.github.io. 1. Introduction Over the past decades, we have witnessed the explosive growth of video data in a variety of video sharing websites like YouTube1 , Instagram2 , and TikTok3 . For example, 400 hours of new videos were uploaded to Youtube every minute and one billion hours of content was watched on YouTube every day in February 20174 . With billions of videos being available on the internet, it becomes a major challenge to perform near-duplicate video retrieval (NDVR) 1https://www.youtube.com 2https://www.instagram.com 3https://www.tiktok.com 4https://en.wikipedia.org/wiki/YouTube from a large-scale video database. NDVR aims to retrieve the near-duplicate videos from a massive video database, where near-duplicate videos are defined as videos that are visually close to the original videos [32]. For example, the videos might be slightly modified by the users to bypass the detection, and the modified videos can be treated as near-duplicate videos of the original videos. These modi- fications can be caption insertion, border insertion and so on. An NDVR system has been a necessity on content platforms with many applications, including video recommendation, video search, and copyright infringement detection. Hence, NDVR has become a hot research topic, and there have appeared a lot of methods for NDVR [32, 10, 8, 4, 33, 29, 1, 24, 16, 18, 2, 23, 13, 30, 19, 6]. Existing NDVR methods can be divided as video-level methods and frame-level methods. Video-level methods, including layer-wise convolutional neural network (CNNL) [12], vector-wise convolutional neural network (CNNV) [12] and deep metric learning (DML) [13], try to represent each video as a global feature. Frame-level methods, including spatio-temporal post-filtering [4], circulant temporal encoding (CTE) [24] and temporal matching kernel (TMK) [23], extract features for each frame of the video. In the meantime, to advance the research of NDVR, several video datasets have been introduced in recent years, including CCWEB [32], UQ VIDEO [29], VCDB [9], MUSCLE VCD [14], TRECVID [22] and so on. However, all of them are for long videos with average duration longer than 60 seconds. In recent years, short videos with duration less than 60 seconds have become increasingly popular on social media platforms. Users have strong incentive to copy a hot short video and upload a modified version on these platforms to gain attention. With the increasing in short video data, there appear new difficulties and challenges for detecting near-duplicate short videos. Some of the new difficulties and challenges are listed as follows. Firstly, most long videos are generated by professional photographers with
cameras,while most short videos are generated by amateurs videos.Then the authors collect 12,790 videos as labeled with mobile devices.Hence,the short videos might contain set.The average duration for this dataset is 151.02 seconds. some new types of near-duplicates,e.g.,horizontal/vertical In this dataset,over half of the queries are about dancing screen videos and camera shaking videos.Secondly,as the and singing,which is lack of diversity. cost of editing a short video is cheaper,users might prefer UQ_VIDEO [29]is an extended dataset of CCWEB.The to edit a short video.Hence,the number of near-duplicate authors utilize 24 query videos and 12,790 labeled videos short videos is larger than that of near-duplicate long videos. of CCWEB as the query set and labeled set for UQ_VIDEO Therefore,there is an urgent need of a large-scale short dataset,respectively.Then the authors construct a back- video dataset for NDVR task. ground distraction set with 119,833 videos.The videos in In this paper,we introduce a new large-scale short video background distraction set are usually treated as negative, dataset,called SVD,to foster research of NDVR for short but the labels are not verified by humans.In the end,the au- videos.The main contributions of this paper are listed as thors collect 132.647 videos in total.Although UO_VIDEO follows: is larger than CCWEB,it is also lack of diversity due to the The introduced SVD dataset contains over 500.000 limited number of queries.Furthermore,for all background short videos and over 30,000 labeled videos for ND- distraction videos,this dataset only provides HSV [26]fea- VR task.To the best of our knowledge,SVD is the first tures and LBP [7]features of all key frames,and the original large-scale short video dataset for NDVR task.Com- videos are not publically available. pared with existing NDVR datasets,SVD dataset is the VCDB [9]dataset utilizes the same 528 videos to con- largest one. struct both query set and labeled set.Furthermore,the au- thors provide 100,000 background distraction videos.Thus ● With hard labeled positive/negative videos mined by this dataset contains 100.528 videos in total.Furthermore. multiple strategies,SVD dataset is challenging for VCDB dataset is originally proposed for copyright detec- NDVR.Furthermore,we design some temporal and s- tion task,and only provides 9,236 copied segment label- patial transformations to mimic user behavior in real s.However,for NDVR task,we need video-level pair- applications and construct more difficult and challeng- wise labels to denote whether a candidate video is the ing variants of SVD near-duplicate video of the query video or not.Hence, We perform two categories of retrieval to evaluate the we filter redundant copied segment pairwise labels and get performance of existing state-of-the-art NDVR meth- 6,139 video-level pairwise labels for NDVR task.Please ods on SVD dataset,i.e.,real-value based retrieval and note that all 6,139 video-level pairwise labels are positive. hashing based retrieval.Experiments demonstrate that The average duration of the VCDB dataset is 72.77 seconds. these NDVR methods cannot achieve satisfactory re- MUSCLE_VCD [14]collects 18 videos to construc- trieval performance on SVD dataset. Hence,the re- t query set.Then the authors utilize query videos to gen- lease of SVD dataset will foster the research of the erate 101 videos as labeled set based on some predefined NDVR area. transformations.Thus MUSCLE_VCD dataset collects 119 videos in total. The rest of this paper is organized as follows.In Sec- TRECVID [22]dataset utilizes 11,256 query videos to tion 2,we briefly review the related work.In Section 3, construct query set.Then the authors use query videos to we describe the dataset collection strategies in detail.In generate 11,503 videos as labeled set based on some pre- Section 4,we introduce some temporal and spatial trans- defined transformations.Thus TRECVID dataset collects formations applied to SVD dataset.In Section 5,we carry 22,759 videos in total. out experiments on SVD dataset.At last,we conclude our The above datasets have been widely used for ND- paper in Section 6. VR task.All of these datasets are long video datasets 2.Related Work and have different shortcomings.Specifically,the videos of TRECVID and UQ_VIDEO datasets are not publicly We briefly review the datasets for NDVR task in this sec- available.MUSCLE_VCD and TRECVID datasets are tion.Specifically,related datasets include CCWEB [32]. small-scale and the labeled videos of these two datasets are UQ_VIDEO [29],VCDB [9],MUSCLE_VCD [14],and generated by the authors of the datasets rather than the users TRECVID [22]datasets. of real video platforms.CCWEB and UQ_VIDEO datasets CCWEB [32]dataset contains 24 query videos and are lack of diversity.VCDB dataset only contains positive 12,790 labeled videos.The authors utilize 24 text queries, pairwise labels.The second to the sixth columns of Table 1 eg,“The lion sleeps tonight'"and“Evolution of dance”,to list the statistics of the aforementioned datasets.From Ta- retrieve the videos from Youtube,Google Video,and Ya- ble 1,we can find that all existing NDVR datasets are long hoo!Video.The returned videos contain 27%redundant videos with average duration longer than 60 seconds
cameras, while most short videos are generated by amateurs with mobile devices. Hence, the short videos might contain some new types of near-duplicates, e.g., horizontal/vertical screen videos and camera shaking videos. Secondly, as the cost of editing a short video is cheaper, users might prefer to edit a short video. Hence, the number of near-duplicate short videos is larger than that of near-duplicate long videos. Therefore, there is an urgent need of a large-scale short video dataset for NDVR task. In this paper, we introduce a new large-scale short video dataset, called SVD, to foster research of NDVR for short videos. The main contributions of this paper are listed as follows: • The introduced SVD dataset contains over 500,000 short videos and over 30,000 labeled videos for NDVR task. To the best of our knowledge, SVD is the first large-scale short video dataset for NDVR task. Compared with existing NDVR datasets, SVD dataset is the largest one. • With hard labeled positive/negative videos mined by multiple strategies, SVD dataset is challenging for NDVR. Furthermore, we design some temporal and spatial transformations to mimic user behavior in real applications and construct more difficult and challenging variants of SVD. • We perform two categories of retrieval to evaluate the performance of existing state-of-the-art NDVR methods on SVD dataset, i.e., real-value based retrieval and hashing based retrieval. Experiments demonstrate that these NDVR methods cannot achieve satisfactory retrieval performance on SVD dataset. Hence, the release of SVD dataset will foster the research of the NDVR area. The rest of this paper is organized as follows. In Section 2, we briefly review the related work. In Section 3, we describe the dataset collection strategies in detail. In Section 4, we introduce some temporal and spatial transformations applied to SVD dataset. In Section 5, we carry out experiments on SVD dataset. At last, we conclude our paper in Section 6. 2. Related Work We briefly review the datasets for NDVR task in this section. Specifically, related datasets include CCWEB [32], UQ VIDEO [29], VCDB [9], MUSCLE VCD [14], and TRECVID [22] datasets. CCWEB [32] dataset contains 24 query videos and 12,790 labeled videos. The authors utilize 24 text queries, e.g., “The lion sleeps tonight” and “Evolution of dance”, to retrieve the videos from Youtube, Google Video, and Yahoo! Video. The returned videos contain 27% redundant videos. Then the authors collect 12,790 videos as labeled set. The average duration for this dataset is 151.02 seconds. In this dataset, over half of the queries are about dancing and singing, which is lack of diversity. UQ VIDEO [29] is an extended dataset of CCWEB. The authors utilize 24 query videos and 12,790 labeled videos of CCWEB as the query set and labeled set for UQ VIDEO dataset, respectively. Then the authors construct a background distraction set with 119,833 videos. The videos in background distraction set are usually treated as negative, but the labels are not verified by humans. In the end, the authors collect 132,647 videos in total. Although UQ VIDEO is larger than CCWEB, it is also lack of diversity due to the limited number of queries. Furthermore, for all background distraction videos, this dataset only provides HSV [26] features and LBP [7] features of all key frames, and the original videos are not publically available. VCDB [9] dataset utilizes the same 528 videos to construct both query set and labeled set. Furthermore, the authors provide 100,000 background distraction videos. Thus this dataset contains 100,528 videos in total. Furthermore, VCDB dataset is originally proposed for copyright detection task, and only provides 9,236 copied segment labels. However, for NDVR task, we need video-level pairwise labels to denote whether a candidate video is the near-duplicate video of the query video or not. Hence, we filter redundant copied segment pairwise labels and get 6,139 video-level pairwise labels for NDVR task. Please note that all 6,139 video-level pairwise labels are positive. The average duration of the VCDB dataset is 72.77 seconds. MUSCLE VCD [14] collects 18 videos to construct query set. Then the authors utilize query videos to generate 101 videos as labeled set based on some predefined transformations. Thus MUSCLE VCD dataset collects 119 videos in total. TRECVID [22] dataset utilizes 11,256 query videos to construct query set. Then the authors use query videos to generate 11,503 videos as labeled set based on some predefined transformations. Thus TRECVID dataset collects 22,759 videos in total. The above datasets have been widely used for NDVR task. All of these datasets are long video datasets and have different shortcomings. Specifically, the videos of TRECVID and UQ VIDEO datasets are not publicly available. MUSCLE VCD and TRECVID datasets are small-scale and the labeled videos of these two datasets are generated by the authors of the datasets rather than the users of real video platforms. CCWEB and UQ VIDEO datasets are lack of diversity. VCDB dataset only contains positive pairwise labels. The second to the sixth columns of Table 1 list the statistics of the aforementioned datasets. From Table 1, we can find that all existing NDVR datasets are long videos with average duration longer than 60 seconds
Table 1.Comparison between SVD and existing datasets.As the original videos in background distraction set of UQ_VIDEO are not publiclly available and we cannot access MUSCLE_VCD and TRECVID datasets,some statistics of these three datasets are N/A. Item CCWEB UO_VIDEO VCDB MUSCLE VCD TRECVID SVD #query videos 24 24 528 18 11,256 1,206 #labeled videos 12.790 12.790 528 101 11,503 34.020 #positive pairs 3.481 3.481 6.139 N/A N/A 10,211 #negative pairs 9311 9.311 0 N/A N/A 26.927 #background distraction videos 0 119.833 100.000 0 0 0 #probable negative unlabeled videos 0 0 0 0 0 526.787 #total videos 12,814 132,647 100,528 119 22,759 562.013 Average duration (in second) 151.02 N/A 72.77 3.564.36 131.44 17.33 Total duration (in hour) 539.95 NIA 2027.60 100 420 2704.96 Video publically available × W V CCWEB dataset VCDB dataset 4000 1.4 les SVD dataset 1.2 3000 1.0 200 0.8 0.6 0 100 0.4 0.0 100 200300400 50 25 5075100125150175200 2 3040 50 Video duration Video duration Video duration Figure 1.Video duration comparison on CCWEB,VCDB and SVD datasets.Note the average duration of our constructed SVD is signifi- cantly shorter than that of CCWEB and VCDB. 3.SVD:A Large-Scale Short Video Dataset in UQ_VIDEO and VCDB datasets,we utilize a filtering s- trategy to ensure that the videos in the probable negative un- In this section.we describe the dataset collection strate- labeled set are not near-duplicate videos of the query videos gies for constructing our large-scale short video dataset called SVD with high probability.Hence,the videos in probable nega- tive unlabeled set are more suitable to be treated as negative All videos in SVD dataset are crawled from a large video than those in background distraction set.In the last colum- website Douyin5 and the video format is".mp4".The dura- n of Table 1,we present the statistics about SVD dataset. tion of most videos is less than 60 seconds.We crawled From Table 1,we can find that the average duration of the an ambient set containing over 100 million short videos, from which we select videos and construct SVD.The SVD SVD dataset is only 17.33 seconds,which is shorter than dataset is divided into three subsets,i.e.,the query set,the other datasets.Furthermore,SVD is the largest dataset a- labeled set and the probable negative unlabeled set.First, mong all datasets in Table 1.In Figure 1,we further illus- trate the distribution of durations for CCWEB.VCDB and we collect 1.206 videos as the query set.Then we uti- SVD datasets.From Figure 1.we can see that most of the lize multiple strategies to mine hard positive/negative can- didate videos for annotation.Unlike the candidate videos videos are short in SVD dataset compared with CCWEB and VCDB.In the rest of this section,we will describe the which are randomly crawled in existing datasets.the candi- detailed construction strategies. date videos in SVD are hard by using multiple strategies for selection.Hence,we call these candidate videos as hard positive/negative candidate videos.After human annota- 3.1.Query Set tion,we collect 34,020 labeled videos to get the labeled set,which includes 10,211/26,927 labeled positive/negative We crawl 1,206 videos,each with more than 30,000 video pairs.Besides this,by utilizing a pairwise similari- "likes",as the query set.All of these queries were upload- ty filtering strategy,we collect 526,787 videos as probable ed in November 2018.To ensure diversity,the contents and negative unlabeled set rather than background distraction types of these query videos are made as diverse as possible. Specifically,the video contents of the query videos contain set.Here,the videos in probable negative unlabeled set are the negative videos which aren't verified by humans.Unlike portrait,landscape,game video,animation and so on.The background distraction videos which are crawled randomly query videos also contain a variety of video types includ- ing vertical screen video,horizontal screen video and so on http://www.douyin.com Figure 2 illustrates some randomly sampled query videos
Table 1. Comparison between SVD and existing datasets. As the original videos in background distraction set of UQ VIDEO are not publiclly available and we cannot access MUSCLE VCD and TRECVID datasets, some statistics of these three datasets are N/A. Item CCWEB UQ VIDEO VCDB MUSCLE VCD TRECVID SVD #query videos 24 24 528 18 11,256 1,206 #labeled videos 12,790 12,790 528 101 11,503 34,020 #positive pairs 3,481 3,481 6,139 N/A N/A 10,211 #negative pairs 9,311 9,311 0 N/A N/A 26,927 #background distraction videos 0 119,833 100,000 0 0 0 #probable negative unlabeled videos 0 0 0 0 0 526,787 #total videos 12,814 132,647 100,528 119 22,759 562,013 Average duration (in second) 151.02 N/A 72.77 3,564.36 131.44 17.33 Total duration (in hour) 539.95 N/A 2027.60 100 420 2704.96 Video publically available √ × √ √ × √ 0 100 200 300 400 500 600 Video duration 0 100 200 300 400 500 600 700 #videos CCWEB dataset 0 25 50 75 100 125 150 175 200 Video duration 0 1000 2000 3000 4000 #videos VCDB dataset 0 10 20 30 40 50 60 Video duration 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 #videos 1e5 SVD dataset Figure 1. Video duration comparison on CCWEB, VCDB and SVD datasets. Note the average duration of our constructed SVD is signifi- cantly shorter than that of CCWEB and VCDB. 3. SVD: A Large-Scale Short Video Dataset In this section, we describe the dataset collection strategies for constructing our large-scale short video dataset called SVD. All videos in SVD dataset are crawled from a large video website Douyin5 and the video format is “.mp4”. The duration of most videos is less than 60 seconds. We crawled an ambient set containing over 100 million short videos, from which we select videos and construct SVD. The SVD dataset is divided into three subsets, i.e., the query set, the labeled set and the probable negative unlabeled set. First, we collect 1,206 videos as the query set. Then we utilize multiple strategies to mine hard positive/negative candidate videos for annotation. Unlike the candidate videos which are randomly crawled in existing datasets, the candidate videos in SVD are hard by using multiple strategies for selection. Hence, we call these candidate videos as hard positive/negative candidate videos. After human annotation, we collect 34,020 labeled videos to get the labeled set, which includes 10,211/26,927 labeled positive/negative video pairs. Besides this, by utilizing a pairwise similarity filtering strategy, we collect 526,787 videos as probable negative unlabeled set rather than background distraction set. Here, the videos in probable negative unlabeled set are the negative videos which aren’t verified by humans. Unlike background distraction videos which are crawled randomly 5http://www.douyin.com in UQ VIDEO and VCDB datasets, we utilize a filtering strategy to ensure that the videos in the probable negative unlabeled set are not near-duplicate videos of the query videos with high probability. Hence, the videos in probable negative unlabeled set are more suitable to be treated as negative than those in background distraction set. In the last column of Table 1, we present the statistics about SVD dataset. From Table 1, we can find that the average duration of the SVD dataset is only 17.33 seconds, which is shorter than other datasets. Furthermore, SVD is the largest dataset among all datasets in Table 1. In Figure 1, we further illustrate the distribution of durations for CCWEB, VCDB and SVD datasets. From Figure 1, we can see that most of the videos are short in SVD dataset compared with CCWEB and VCDB. In the rest of this section, we will describe the detailed construction strategies. 3.1. Query Set We crawl 1,206 videos, each with more than 30,000 “likes”, as the query set. All of these queries were uploaded in November 2018. To ensure diversity, the contents and types of these query videos are made as diverse as possible. Specifically, the video contents of the query videos contain portrait, landscape, game video, animation and so on. The query videos also contain a variety of video types including vertical screen video, horizontal screen video and so on. Figure 2 illustrates some randomly sampled query videos
Figure 2.Example of query videos in SVD.Each block represents a video with multiple frames 3.2.Labeled Set To construct the labeled set,we first choose some videos as candidate videos for annotation.All candidate videos are divided into positive (near-duplicate)candidate videos Query video and negative candidate videos,which respectively denote the videos we expect to be annotated (labeled)as positive and negative videos of the corresponding query videos. To mine hard positive/negative candidate videos for an- notation.we utilize multiple strategies to select candidate videos from the ambient set.The strategies include iterative retrieval,transformed retrieval,and feature based mining. Among these strategies,the first two strategies are mainly used for mining hard positive candidate videos and the last strategy is used for mining hard negative candidate videos. Oucry vidco Positive candidate We collect nearly 50.000 video pairs for annotation. Figure 3.Example of hard positive candidate videos.Top row: These video pairs are labeled by human annotators.Annota- side mirrored.color-filtered,and watermasked.Middle row:hori- tion costs over 800 hours in total.After removing the videos zontal screen changed to vertical screen with large black margins. inappropriate for public release,we collect 1,206 queries Bottom row:rotated. and 34,020 labeled videos.In the rest of this subsection,we will describe the details of the three strategies for selecting candidate videos. videos as queries to search over the ambient set.Specifical- Iterative Retrieval To mine hard positive candidate videos, ly,we utilize LBP,BSIFT,and deep features based retrieval we utilize an interactive retrieval method to annotate the methods to select the candidate videos.Then we select the positive candidate videos.This method can be divided in- top-5 to top-10 results as candidate videos for further hu- to the following three steps.Firstly,for a given query man annotation. video,it retrieves through the ambient set to get the can- In Figure 3.we show some query videos and their hard didates by using a variety of methods,including LBP[21] positive candidate videos mined by interactive retrieval and and BSIFT [35]feature based retrieval methods.Secondly, transformed retrieval.In Figure 3,the candidate videos are human annotators label these candidates for each query and near-duplicate videos by various transformations including select the positive ones.Lastly,the selected positive videos mirror transformation,color-filtered transformation,black are further fed into the first step to retrieve more positive border insertion,and rotation transformation. candidates.The whole process is repeated for several times Feature based Mining To mine hard negative candidate until no more positive videos can be found for a given query. videos,we select 30,000 videos as candidate videos from Because the interactive retrieval procedure requires low the ambient set which were uploaded from June 2018 to latency,we only employ LBP [21]and BSIFT [35]features August 2018.As the uploading dates of these candidate during this procedure.More advanced features and similar- videos are earlier than those of the videos in our query ity calculation methods are utilized for the following trans- set,we can expect that most candidate videos are not near- formed retrieval procedure. duplicate videos of the query videos.We extract different Transformed Retrieval We also apply various transforma- types of features to calculate the similarity between candi- tions,such as rotation and cropping,on query videos to dates and query videos.The features include hand-crafted get transformed videos.And then we use the transformed features (LBP and BSIFT)and deep features.For each
Portrait, multiple screens Landscape, horizontal screen Game video, vertical screen Building, vertical screen Animation, vertical screen Pet, vertical screen Portrait, vertical screen Animation, horizontal screen Figure 2. Example of query videos in SVD. Each block represents a video with multiple frames. 3.2. Labeled Set To construct the labeled set, we first choose some videos as candidate videos for annotation. All candidate videos are divided into positive (near-duplicate) candidate videos and negative candidate videos, which respectively denote the videos we expect to be annotated (labeled) as positive and negative videos of the corresponding query videos. To mine hard positive/negative candidate videos for annotation, we utilize multiple strategies to select candidate videos from the ambient set. The strategies include iterative retrieval, transformed retrieval, and feature based mining. Among these strategies, the first two strategies are mainly used for mining hard positive candidate videos and the last strategy is used for mining hard negative candidate videos. We collect nearly 50,000 video pairs for annotation. These video pairs are labeled by human annotators. Annotation costs over 800 hours in total. After removing the videos inappropriate for public release, we collect 1,206 queries and 34,020 labeled videos. In the rest of this subsection, we will describe the details of the three strategies for selecting candidate videos. Iterative Retrieval To mine hard positive candidate videos, we utilize an interactive retrieval method to annotate the positive candidate videos. This method can be divided into the following three steps. Firstly, for a given query video, it retrieves through the ambient set to get the candidates by using a variety of methods, including LBP [21] and BSIFT [35] feature based retrieval methods. Secondly, human annotators label these candidates for each query and select the positive ones. Lastly, the selected positive videos are further fed into the first step to retrieve more positive candidates. The whole process is repeated for several times until no more positive videos can be found for a given query. Because the interactive retrieval procedure requires low latency, we only employ LBP [21] and BSIFT [35] features during this procedure. More advanced features and similarity calculation methods are utilized for the following transformed retrieval procedure. Transformed Retrieval We also apply various transformations, such as rotation and cropping, on query videos to get transformed videos. And then we use the transformed Query video Positive candidate Query video Positive candidate Query video Positive candidate Figure 3. Example of hard positive candidate videos. Top row: side mirrored, color-filtered, and watermasked. Middle row: horizontal screen changed to vertical screen with large black margins. Bottom row: rotated. videos as queries to search over the ambient set. Specifically, we utilize LBP, BSIFT, and deep features based retrieval methods to select the candidate videos. Then we select the top-5 to top-10 results as candidate videos for further human annotation. In Figure 3, we show some query videos and their hard positive candidate videos mined by interactive retrieval and transformed retrieval. In Figure 3, the candidate videos are near-duplicate videos by various transformations including mirror transformation, color-filtered transformation, black border insertion, and rotation transformation. Feature based Mining To mine hard negative candidate videos, we select 30,000 videos as candidate videos from the ambient set which were uploaded from June 2018 to August 2018. As the uploading dates of these candidate videos are earlier than those of the videos in our query set, we can expect that most candidate videos are not nearduplicate videos of the query videos. We extract different types of features to calculate the similarity between candidates and query videos. The features include hand-crafted features (LBP and BSIFT) and deep features. For each
dure are truly probable negative,we randomly sample 100 videos from the probable negative unlabeled set and invite human annotators to label them against each of the query videos.None of these videos is labeled as near-duplicate of cative candidat the queries.Therefore,the videos in the probable negative unlabeled set are not near-duplicates of the query videos with high probability. 4.Transformations Query vide Negative candidate In real applications,users might prefer to copy hot videos to gain attention.At the same time,these users usu- ally choose to modify their copied videos slightly to bypass the detection.These modifications contain video cropping, Ouery video Negative candidate border insertion and so on Figure 4.Example of hard negative videos.All the candidates are To mimic such user behavior.we define one temporal visually similar to the query but not near-duplicates. transformation,i.e.,video speeding,and three spatial trans- formations,i.e.,video cropping,black border insertion,and video rotation.Specifically,the video speeding transforma- query video,we select the top-5 to top-10 similar videos tion contains video speeding up and speeding down.This as candidate videos for human annotation. type of transformation is designed to simulate video accel- Figure 4 illustrates some examples of query videos and eration or deceleration.In real applications,users might the corresponding negative candidate videos,where the can- crop the videos to zoom in or out the original videos,which didate videos are mined based on deep features.In the ex- can be performed by frame cropping.Furthermore,users ample at the top row,a man is casting a net into the water.In might insert borders,like black borders,to fit different video the example at the middle row,a girl is doing her hairstyle size.In addition,there exist many mobile-phone videos in a barbershop.In the example at the bottom row,a girl is which are taken horizontally or vertically.When users u- playing in a room decorated with illuminations.However, pload these videos,they might rotate their videos. as the persons in each video pair are different,all of these These transformations are widely applied in the video video pairs are not near-duplicate videos although they are re-creation procedure.By performing these transformation- very similar s,harder candidates can be generated and we can construct more challenging datasets.Please note that the above trans- 3.3.Probable Negative Unlabeled Set formations are used as illustrating examples,and users can define their own transformations based on their needs. We first select a subset of 700.000 videos from the ambi- ent set as candidates for probable negative unlabeled videos. 5.Experiments which are defined as negative videos without human annota- tion.After extracting a variety of frame and video features, We perform experiments to study the retrieval perfor- we calculate the pairwise similarity between query videos mance on SVD dataset and other NDVR datasets.We adop- and the candidate videos.The candidate videos which t two categories of NDVR methods,i.e.,real-value based might be the near-duplicate videos of query videos with NDVR methods and hashing based NDVR methods.In real high probability will be filtered.Then the remaining can- applications,real-value based NDVR methods usually suf- didate videos are selected as probable negative unlabeled fer from high storage cost and low query speed.To avoid videos.Specifically,we utilize BSIFT features and aggre- high storage cost and enable fast query speed,hashing based gated deep features to calculate similarity between query methods [3,31,34,29,11,27,6]have also been adopted for videos and candidate videos.The BSIFT features are used NDVR to calculate the Jaccard similarity,and only those videos whose similarities to all queries are 0 can be selected as 5.1.Datasets candidate videos.Then the aggregated deep features are As TRECVID and MUSCLE_VCD are too smal- used to calculate video-level similarity based on Euclidean I and the original videos in background distraction set distance,and we further filter about 5%videos which have are not available for UQ_VIDEO,we select CCWE- the smallest similarities to all queries.In the end.we obtain B [32]and VCDB [9]for comparison with SVD. 526,787 videos for the probable negative unlabeled set. We adopt four transformations defined in Section 4 to To verify that the videos obtained by the above proce- construct more challenging variants of SVD.Specifi-
Query video Negative candidate Query video Negative candidate Query video Negative candidate Figure 4. Example of hard negative videos. All the candidates are visually similar to the query but not near-duplicates. query video, we select the top-5 to top-10 similar videos as candidate videos for human annotation. Figure 4 illustrates some examples of query videos and the corresponding negative candidate videos, where the candidate videos are mined based on deep features. In the example at the top row, a man is casting a net into the water. In the example at the middle row, a girl is doing her hairstyle in a barbershop. In the example at the bottom row, a girl is playing in a room decorated with illuminations. However, as the persons in each video pair are different, all of these video pairs are not near-duplicate videos although they are very similar. 3.3. Probable Negative Unlabeled Set We first select a subset of 700,000 videos from the ambient set as candidates for probable negative unlabeled videos, which are defined as negative videos without human annotation. After extracting a variety of frame and video features, we calculate the pairwise similarity between query videos and the candidate videos. The candidate videos which might be the near-duplicate videos of query videos with high probability will be filtered. Then the remaining candidate videos are selected as probable negative unlabeled videos. Specifically, we utilize BSIFT features and aggregated deep features to calculate similarity between query videos and candidate videos. The BSIFT features are used to calculate the Jaccard similarity, and only those videos whose similarities to all queries are 0 can be selected as candidate videos. Then the aggregated deep features are used to calculate video-level similarity based on Euclidean distance, and we further filter about 5% videos which have the smallest similarities to all queries. In the end, we obtain 526,787 videos for the probable negative unlabeled set. To verify that the videos obtained by the above procedure are truly probable negative, we randomly sample 100 videos from the probable negative unlabeled set and invite human annotators to label them against each of the query videos. None of these videos is labeled as near-duplicate of the queries. Therefore, the videos in the probable negative unlabeled set are not near-duplicates of the query videos with high probability. 4. Transformations In real applications, users might prefer to copy hot videos to gain attention. At the same time, these users usually choose to modify their copied videos slightly to bypass the detection. These modifications contain video cropping, border insertion and so on. To mimic such user behavior, we define one temporal transformation, i.e., video speeding, and three spatial transformations, i.e., video cropping, black border insertion, and video rotation. Specifically, the video speeding transformation contains video speeding up and speeding down. This type of transformation is designed to simulate video acceleration or deceleration. In real applications, users might crop the videos to zoom in or out the original videos, which can be performed by frame cropping. Furthermore, users might insert borders, like black borders, to fit different video size. In addition, there exist many mobile-phone videos which are taken horizontally or vertically. When users upload these videos, they might rotate their videos. These transformations are widely applied in the video re-creation procedure. By performing these transformations, harder candidates can be generated and we can construct more challenging datasets. Please note that the above transformations are used as illustrating examples, and users can define their own transformations based on their needs. 5. Experiments We perform experiments to study the retrieval performance on SVD dataset and other NDVR datasets. We adopt two categories of NDVR methods, i.e., real-value based NDVR methods and hashing based NDVR methods. In real applications, real-value based NDVR methods usually suffer from high storage cost and low query speed. To avoid high storage cost and enable fast query speed, hashing based methods [3, 31, 34, 29, 11, 27, 6] have also been adopted for NDVR. 5.1. Datasets As TRECVID and MUSCLE VCD are too small and the original videos in background distraction set are not available for UQ VIDEO, we select CCWEB [32] and VCDB [9] for comparison with SVD. We adopt four transformations defined in Section 4 to construct more challenging variants of SVD. Specifi-