IEEE TRANSACTIONS ON MULTIMEDIA.VOL 12.NO.7.NOVEMBER 2010 717 Multi-View Video Summarization Yanwei Fu,Yanwen Guo,Yanshu Zhu,Feng Liu,Chuanming Song,and Zhi-Hua Zhou,Senior Member;IEEE Abstract-Previous video summarization studies focused on instance,watching a large number of videos to grasp important monocular videos,and the results would not be good if they were information quickly is a big challenge. applied to multi-view videos directly,due to problems such as the redundancy in multiple views.In this paper,we present a method Video summarization,as an important video content service, for summarizing multi-view videos.We construct a spatio-tem- produces a condensed and succinct representation of video poral shot graph and formulate the summarization problem as content,which facilitates the browsing,retrieval,and storage a graph labeling task.The spatio-temporal shot graph is derived of the original videos.There has been a rich literature on from a hypergraph,which encodes the correlations with different summarizing a long video into a concise representation,such as attributes among multi-view video shots in hyperedges.We then partition the shot graph and identify clusters of event-centered a key-frame sequence [1]-[6]and a video skim [7]-[20].These shots with similar contents via random walks.The summarization existing methods provide effective solutions to summarization. result is generated through solving a multi-objective optimization However,they focus on monocular videos.Multi-view video problem based on shot importance evaluated using a Gaussian en- summarization has been rarely addressed,though multi-view tropy fusion scheme.Different summarization objectives,such as minimum summary length and maximum information coverage, videos are widely used in surveillance systems equipped in can be accomplished in the framework.Moreover,multi-level sum- offices,banks,factories,and crossroads of cities for private and marization can be achieved easily by configuring the optimization public securities.For the all-weather,day,and night multi-view parameters.We also propose the multi-view storyboard and event surveillance systems,video data recorded increases dramati- board for presenting multi-view summaries.The storyboard naturally reflects correlations among multi-view summarized cally every day.In addition to surveillance,multi-view videos shots that describe the same important event.The event-board are also popular in sports broadcast.For example,in the soccer serially assembles event-centered multi-view shots in temporal match,the cameramen usually replay the goals recorded by dif- order.Single video summary which facilitates quick browsing of ferent cameras distributed in the football stadium.Multi-view the summarized multi-view video can be easily generated based on the event board representation. video summarization refers to the problem of summarizing multi-view videos into informative video summaries,usually Index Terms-Multi-objective optimization,multi-view video, presented as dynamic video shots coming from multi-views,by random walks,spatio-temporal graph,video summarization. considering content correlations within each view and among multiple views.The multi-view summaries will provide salient I.INTRODUCTION events with more rich information than less salient ones.This T ITH the rapid development of computation,com- will allow the user to grasp the important information from mul- tiple perspectives of the multi-view videos without watching munication,and storage infrastructures,multi-view the whole of them.Multi-view summarization will also benefit video systems that simultaneously capture a group of videos the storage,analysis,and management of multi-view video and record the video content of the occurrence of events with content. considerable overlapping field of views(FOVs)across multiple Applying the existing monocular video summarization cameras have become more and more popular.In contrast to the methods to each component of a multi-view video group could rapid development of video collection and storage techniques, lead to a redundant summarization result as each component consuming these multi-view videos still remains a problem.For has overlapping information with the others.To generate a concise multi-view video summary,information correlations Manuscript received August 28,2009;revised December 21,2009 and as well as discrepancies among multi-view videos should be March 26.2010:accepted May 18,2010.Date of publication June 07. 2010:date of current version October 15.2010.This work was supported in taken into account.It is also not good to directly apply previous part by the National Science Foundation of China under Grants 60703084. methods to the video sequence formed by simply combining 60723003,and 60721002,the National Fundamental Research Program of the multi-view videos.Furthermore.since multi-view videos China (2010CB327903),the Jiangsu Science Foundation (BK2009081),and often suffer from different lighting conditions in distinctive the Jiangsu 333 High-Level Talent Cultivation Program.The associate editor coordinating the review of this manuscript and approving it for publication was views.it is nontrivial to evaluate the importance of shots in Dr.Zhu Liu. each view video and to merge each component into an integral Y.Fu,Y.Zhu,C.Song,and Z-H.Zhou are with the National Key Lab video summary in a robust way,especially when the multi-view for Novel Software Technology,Nanjing University,Nanjing 210093,China (e-mail:ztwztq2006@gmail.com;yszhu@cs.hku.hk;chmsong @graphics.nju. videos are captured nonsynchronously.It is thus important to edu.cn:zhouzh@nju.edu.cn). have effective multi-view summarization techniques. Y.Guo is with the National Key Lab for Novel Software Technology,Nan- In this paper,we present a method for the summarization of jing University,Nanjing 210093,China,and also with the Jiangyin Information Technology Research Institute of Nanjing University (e-mail:ywguo@nju.edu. multi-view videos.We first parse the video from each view into cn). shots.Content correlations among multi-view shots are impor- F.Liu is with the Department of Computer Sciences.University of Wis- tant to produce an informative and compact summary.We use consin-Madison.Madison.WI 53562 USA (e-mail:fliu@cs.wisc.edu). Color versions of one or more of the figures in this paper are available online a hypergraph to model such correlations,in which each kind at http://ieeexplore.ieee.org. of hyperedge characterizes a kind of correlation among shots. Digital Object Identifier 10.1109/TMM.2010.2052025 By converting the hypergraph into a spatio-temporal shot graph, 1520-9210/S26.00©2010IEEE
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 717 Multi-View Video Summarization Yanwei Fu, Yanwen Guo, Yanshu Zhu, Feng Liu, Chuanming Song, and Zhi-Hua Zhou, Senior Member, IEEE Abstract—Previous video summarization studies focused on monocular videos, and the results would not be good if they were applied to multi-view videos directly, due to problems such as the redundancy in multiple views. In this paper, we present a method for summarizing multi-view videos. We construct a spatio-temporal shot graph and formulate the summarization problem as a graph labeling task. The spatio-temporal shot graph is derived from a hypergraph, which encodes the correlations with different attributes among multi-view video shots in hyperedges. We then partition the shot graph and identify clusters of event-centered shots with similar contents via random walks. The summarization result is generated through solving a multi-objective optimization problem based on shot importance evaluated using a Gaussian entropy fusion scheme. Different summarization objectives, such as minimum summary length and maximum information coverage, can be accomplished in the framework. Moreover, multi-level summarization can be achieved easily by configuring the optimization parameters. We also propose the multi-view storyboard and event board for presenting multi-view summaries. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same important event. The event-board serially assembles event-centered multi-view shots in temporal order. Single video summary which facilitates quick browsing of the summarized multi-view video can be easily generated based on the event board representation. Index Terms—Multi-objective optimization, multi-view video, random walks, spatio-temporal graph, video summarization. I. INTRODUCTION WITH the rapid development of computation, communication, and storage infrastructures, multi-view video systems that simultaneously capture a group of videos and record the video content of the occurrence of events with considerable overlapping field of views (FOVs) across multiple cameras have become more and more popular. In contrast to the rapid development of video collection and storage techniques, consuming these multi-view videos still remains a problem. For Manuscript received August 28, 2009; revised December 21, 2009 and March 26, 2010; accepted May 18, 2010. Date of publication June 07, 2010; date of current version October 15, 2010. This work was supported in part by the National Science Foundation of China under Grants 60703084, 60723003, and 60721002, the National Fundamental Research Program of China (2010CB327903), the Jiangsu Science Foundation (BK2009081), and the Jiangsu 333 High-Level Talent Cultivation Program. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Zhu Liu. Y. Fu, Y. Zhu, C. Song, and Z.-H. Zhou are with the National Key Lab for Novel Software Technology, Nanjing University, Nanjing 210093, China (e-mail: ztwztq2006@gmail.com; yszhu@cs.hku.hk; chmsong@graphics.nju. edu.cn; zhouzh@nju.edu.cn). Y. Guo is with the National Key Lab for Novel Software Technology, Nanjing University, Nanjing 210093, China, and also with the Jiangyin Information Technology Research Institute of Nanjing University (e-mail: ywguo@nju.edu. cn). F. Liu is with the Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53562 USA (e-mail: fliu@cs.wisc.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2010.2052025 instance, watching a large number of videos to grasp important information quickly is a big challenge. Video summarization, as an important video content service, produces a condensed and succinct representation of video content, which facilitates the browsing, retrieval, and storage of the original videos. There has been a rich literature on summarizing a long video into a concise representation, such as a key-frame sequence [1]–[6] and a video skim [7]–[20]. These existing methods provide effective solutions to summarization. However, they focus on monocular videos. Multi-view video summarization has been rarely addressed, though multi-view videos are widely used in surveillance systems equipped in offices, banks, factories, and crossroads of cities for private and public securities. For the all-weather, day, and night multi-view surveillance systems, video data recorded increases dramatically every day. In addition to surveillance, multi-view videos are also popular in sports broadcast. For example, in the soccer match, the cameramen usually replay the goals recorded by different cameras distributed in the football stadium. Multi-view video summarization refers to the problem of summarizing multi-view videos into informative video summaries, usually presented as dynamic video shots coming from multi-views, by considering content correlations within each view and among multiple views. The multi-view summaries will provide salient events with more rich information than less salient ones. This will allow the user to grasp the important information from multiple perspectives of the multi-view videos without watching the whole of them. Multi-view summarization will also benefit the storage, analysis, and management of multi-view video content. Applying the existing monocular video summarization methods to each component of a multi-view video group could lead to a redundant summarization result as each component has overlapping information with the others. To generate a concise multi-view video summary, information correlations as well as discrepancies among multi-view videos should be taken into account. It is also not good to directly apply previous methods to the video sequence formed by simply combining the multi-view videos. Furthermore, since multi-view videos often suffer from different lighting conditions in distinctive views, it is nontrivial to evaluate the importance of shots in each view video and to merge each component into an integral video summary in a robust way, especially when the multi-view videos are captured nonsynchronously. It is thus important to have effective multi-view summarization techniques. In this paper, we present a method for the summarization of multi-view videos. We first parse the video from each view into shots. Content correlations among multi-view shots are important to produce an informative and compact summary. We use a hypergraph to model such correlations, in which each kind of hyperedge characterizes a kind of correlation among shots. By converting the hypergraph into a spatio-temporal shot graph, 1520-9210/$26.00 © 2010 IEEE
718 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 the edge weights can qualitatively measure similarities among [3]divided video sequence into clusters and selected optimal shots.We associate the value of a graph node with shot impor- ones using an unsupervised procedure for cluster-validity anal- tance computed by a Gaussian entropy fusion scheme.Such ysis.The centroids of clusters are chosen as key frames.Li a scheme can calculate the importance of shots in the pres- et al.[4]formulated key frame extraction as a rate-distortion ence of brightness difference and conspicuous noises,by em- Min-max optimization problem.The optimal solution is solved phasizing useful information and precluding redundancy among by dynamic programming.Besides,Orriols et al.[5]addressed video features.With the graph representation,the final summary summarization under a Bayesian framework.An EM algorithm is generated through the event clustering based on random walks with a generative model is developed to generate representative and a multi-objective optimization process. frames.Note that key frames can be transformed into skim by To the best of our knowledge,this paper presents the first joining up the segments that enclose them,and vice versa. multi-view video summarization method.It has the following In contrast to key frames,an advantage of video skim is that features. signals in other modalities such as audio information can be in- A spatio-temporal shot graph is used for the representa- cluded.Furthermore,skim preserves the time-evolving nature tion of multi-view videos.Such a representation makes the of the original video,making it more interesting and impres- multi-view summarization problem tractable in the light sive.Video saliency is necessary for summarization to produce of graph theory.The shot graph is derived from a hyper- the representative skim.For static image,Ma et al.[22]calcu- graph which embeds different correlations among video lated visual feature contrast as saliency.A normalized saliency shots within each view as well as across multiple views. value for each pixel is computed.To evaluate saliency of video Random walks are used to cluster the event-centered shot sequence,multi-modal features such as motion vector and audio clusters,and the final summary is generated by multi-ob- frequency should be considered [11],[16],[19].Ma et al.[11] jective optimization.The multi-objective optimization can presented a generic framework of user attention model through be flexibly configured to meet different summarization multiple sensory perceptions.Visual and aural attentions are requirements.Additionally,multi-level summaries can be fused into an attention curve,based on which key frames and achieved easily through setting different parameters.In video skims are extracted around the crests.Recently,You et al. contrast,most previous methods can only summarize the [19]also introduced a method for human perception analysis videos from a specific perspective on the summaries. by combining motion,contrast,special scenes,and statistical The multi-view video storyboard and the event-board are rhythm cues.They constructed a perception curve for labeling presented for representing multi-view video summary. three-level summary,namely,keywords,key frames,and video skim. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same Various mechanisms have been used to generate video skim. important event.The event-board serially assembles Nam et al.[12]proposed to adaptively sample the video with visual activity-based sampling rate.Semantically meaningful event-centered multi-view shots in temporal order.With summaries are achieved through an event-oriented abstraction. the event-board,a single video summary that facilitates quick browsing of the summarized video can be easily By measuring shots'visual complexity and analyzing speech data,Sundaram et al.[17]generated audio-visual skims with generated. constrained utility maximization that maximizes information The rest of this paper is organized as follows.We briefly re- view previous work in Section II.In Section III,we present a content and coherence.Since summarization can be viewed as a dimension reduction problem,Gong and Liu proposed high-level overview of our method.The two key components of to summarize video by using singular value decomposition our method,spatio-temporal shot graph construction and multi- (SVD)[9].The SVD properties they derived help to output view summarization,are presented in Sections IV and V,re- the skim with user-specified length.Gong's another method spectively.We evaluate our method in Section VI and conclude [8]produces video summary by minimizing visual content the paper in Section VII. redundancy of the input video.Previous viewers'browsing log will assist in future viewers.Yu et al.'s method [20]learns user Ⅱ.RELATED WORK understanding of video content.A ShotRank is constructed to This paper is made possible by many inspirations from pre- measure importance of video shot.The top ranking shots are vious work on video summarization.A comprehensive review chosen as video skim. of the state-of-the-art video summarization methods can be Some techniques for generating video skims are domain-de- found in [21].In general,two basic forms of video summaries pendent.For example,Babaguchi [7]presented an approach exist,i.e.,the static key frames and dynamic video skim.The for abstracting soccer game videos by highlights.Using event- former consists of a collection of salient images fetched from based indexing,an abstracted video clip is automatically cre- the original video sequence,while the latter is composed of the ated based on impact factors of events.Soccer events can be most representative video segments extracted from the video detected by using temporal logic models [23]or goalmouth de- source. tection [24].Much attention has been paid to rush video sum- Key frame extraction should take into account the underlying marization [25-27.Rush videos often contain redundant and dynamics of video content.DeMenthon et al.[1]regarded video repetitive contents,by exploring which a concise summary can sequence as a curve in high-dimensional space.The curve is be generated.The methods in [15]and [18]focus on summa- recursively simplified with a tree structure representation.The rizing music videos via the analysis of audio,visual,and text frames corresponding to junctions between curve segments at The summary is generated based on the alignment of boundaries different tree levels are viewed as key frames.Hanjalic et al. of the chorus,shot class,and repeated lyrics of the music video
718 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 the edge weights can qualitatively measure similarities among shots. We associate the value of a graph node with shot importance computed by a Gaussian entropy fusion scheme. Such a scheme can calculate the importance of shots in the presence of brightness difference and conspicuous noises, by emphasizing useful information and precluding redundancy among video features. With the graph representation, the final summary is generated through the event clustering based on random walks and a multi-objective optimization process. To the best of our knowledge, this paper presents the first multi-view video summarization method. It has the following features. • A spatio-temporal shot graph is used for the representation of multi-view videos. Such a representation makes the multi-view summarization problem tractable in the light of graph theory. The shot graph is derived from a hypergraph which embeds different correlations among video shots within each view as well as across multiple views. • Random walks are used to cluster the event-centered shot clusters, and the final summary is generated by multi-objective optimization. The multi-objective optimization can be flexibly configured to meet different summarization requirements. Additionally, multi-level summaries can be achieved easily through setting different parameters. In contrast, most previous methods can only summarize the videos from a specific perspective on the summaries. • The multi-view video storyboard and the event-board are presented for representing multi-view video summary. The storyboard naturally reflects correlations among multi-view summarized shots that describe the same important event. The event-board serially assembles event-centered multi-view shots in temporal order. With the event-board, a single video summary that facilitates quick browsing of the summarized video can be easily generated. The rest of this paper is organized as follows. We briefly review previous work in Section II. In Section III, we present a high-level overview of our method. The two key components of our method, spatio-temporal shot graph construction and multiview summarization, are presented in Sections IV and V, respectively. We evaluate our method in Section VI and conclude the paper in Section VII. II. RELATED WORK This paper is made possible by many inspirations from previous work on video summarization. A comprehensive review of the state-of-the-art video summarization methods can be found in [21]. In general, two basic forms of video summaries exist, i.e., the static key frames and dynamic video skim. The former consists of a collection of salient images fetched from the original video sequence, while the latter is composed of the most representative video segments extracted from the video source. Key frame extraction should take into account the underlying dynamics of video content. DeMenthon et al. [1] regarded video sequence as a curve in high-dimensional space. The curve is recursively simplified with a tree structure representation. The frames corresponding to junctions between curve segments at different tree levels are viewed as key frames. Hanjalic et al. [3] divided video sequence into clusters and selected optimal ones using an unsupervised procedure for cluster-validity analysis. The centroids of clusters are chosen as key frames. Li et al. [4] formulated key frame extraction as a rate-distortion Min-max optimization problem. The optimal solution is solved by dynamic programming. Besides, Orriols et al. [5] addressed summarization under a Bayesian framework. An EM algorithm with a generative model is developed to generate representative frames. Note that key frames can be transformed into skim by joining up the segments that enclose them, and vice versa. In contrast to key frames, an advantage of video skim is that signals in other modalities such as audio information can be included. Furthermore, skim preserves the time-evolving nature of the original video, making it more interesting and impressive. Video saliency is necessary for summarization to produce the representative skim. For static image, Ma et al. [22] calculated visual feature contrast as saliency. A normalized saliency value for each pixel is computed. To evaluate saliency of video sequence, multi-modal features such as motion vector and audio frequency should be considered [11], [16], [19]. Ma et al. [11] presented a generic framework of user attention model through multiple sensory perceptions. Visual and aural attentions are fused into an attention curve, based on which key frames and video skims are extracted around the crests. Recently, You et al. [19] also introduced a method for human perception analysis by combining motion, contrast, special scenes, and statistical rhythm cues. They constructed a perception curve for labeling three-level summary, namely, keywords, key frames, and video skim. Various mechanisms have been used to generate video skim. Nam et al. [12] proposed to adaptively sample the video with visual activity-based sampling rate. Semantically meaningful summaries are achieved through an event-oriented abstraction. By measuring shots’ visual complexity and analyzing speech data, Sundaram et al. [17] generated audio-visual skims with constrained utility maximization that maximizes information content and coherence. Since summarization can be viewed as a dimension reduction problem, Gong and Liu proposed to summarize video by using singular value decomposition (SVD) [9]. The SVD properties they derived help to output the skim with user-specified length. Gong’s another method [8] produces video summary by minimizing visual content redundancy of the input video. Previous viewers’ browsing log will assist in future viewers. Yu et al.’s method [20] learns user understanding of video content. A ShotRank is constructed to measure importance of video shot. The top ranking shots are chosen as video skim. Some techniques for generating video skims are domain-dependent. For example, Babaguchi [7] presented an approach for abstracting soccer game videos by highlights. Using eventbased indexing, an abstracted video clip is automatically created based on impact factors of events. Soccer events can be detected by using temporal logic models [23] or goalmouth detection [24]. Much attention has been paid to rush video summarization [25]–[27]. Rush videos often contain redundant and repetitive contents, by exploring which a concise summary can be generated. The methods in [15] and [18] focus on summarizing music videos via the analysis of audio, visual, and text. The summary is generated based on the alignment of boundaries of the chorus, shot class, and repeated lyrics of the music video
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 719 View 1 Video View 2 Video Video Multi-view Importance Spatio-temporal RandomShot Multi-objective Parsing Video Shots Computation Shot Graph Walks Clusters Optimization View N Video Fig.1.Overview of our multi-view video summarization method. Besides,automatic music summarization has been considered gether a set of intrinsic video features.The multi-view shots usu in[28]. ally have diverse correlations with different attributes,such as Graph model has also been used for video summarization. temporal adjacency and content similarity.We use a hypergraph Lu et al.[10]developed a graph optimization method that to systematically characterize the correlations among shots.A computes optimal video skim in each scene via dynamic hypergraph is a graph in which an edge,usually named as a hy- programming.Ngo et al.[13]used temporal graph analysis peredge,can link a subset of nodes.Each kind of correlation to effectively capsulate information for video structure and among multi-view shots is thus represented with a kind of hy- highlight.Through modeling the video evolution by temporal peredge in the hypergraph.The hypergraph is further converted graph,their method can automatically detect scene changes and into a spatio-temporal shot graph where correlations of shots in generate summaries.Lee et al.[29]presented a scenario-based each view and across multi-views are mapped to edge weights. dynamic video abstraction method using graph matching. To implement multi-view summarization on the spatio-tem- Multi-level scenarios generated by a graph-based video seg- poral graph,we employ random walks to cluster those event- mentation and a hierarchical segment are used to segment a centered similar shots.Using them as the anchor points,final video into shots.Dynamic video abstractions are accomplished summarized multi-view shots are generated by a multi-objective by accessing the hierarchy level-by-level.Another graph-based optimization model that supports different user requirements as video summarization method is given by Peng and Ngo [14]. well as multi-level summarization. Highlighted events can be detected by a graph clustering al- We use the multi-view video storyboard and the event-board gorithm,incorporating an effective similarity metric of video to represent the multi-view summaries.The multi-view story- clips.Comparing with their methods,we focus on multi-view board demonstrates the event-centered summarized shots in a videos.Due to content correlations among multi-views,the multi-view manner as shown in Fig.5.In contrast,the event- spatio-temporal shot graph we constructed has more compli- board shown in Fig.6 assembles those summarized shots along cated node connections,making summarization challenging. the timeline. The above methods provide many effective solutions to mono-view video summarization.However,to the best of our IV.SPATIO-TEMPORAL SHOT GRAPH knowledge,few methods are dedicated to multi-view video summarization.Multi-view video coding (MVC)algorithms It is difficult to directly generate summarization,especially the video skims from multi-view videos.A common idea is to [30]-[32]also deal with the multi-view videos.Using tech- first parse the videos into shots.In this way,video summariza- niques such as motion estimation,disparity estimation,and tion is transformed into a problem of selecting a set of repre- so on,MVC removes information redundancy in spatial and sentative shots.Obviously,the selected shots should favor inter- temporal domains.The video content is however unchanged. esting events.Meanwhile,these shots should be nontrivial.To Therefore,MVC could not remove redundancy at the semantic achieve this,content correlations as well as disparities among level.In contrast,our multi-view video summarization method shots are taken into account.In previous methods for mono-view makes an effort to pave the way for this,by exploring the con- video summarization,each shot only correlates with its sim- tent correlations among multi-view video shots and selecting ilar shots along the temporal axis.The correlations are simple, those most representative shots for summary. and easily modeled.However,for the multi-view videos,each shot correlates closely with not only the temporally adjacent III.OVERVIEW shots in its own view but also the spatially neighboring shots We construct a spatio-temporal shot graph to represent the in other views.Relationships among shots increase exponen- multi-view videos.Multi-view summarization is achieved tially relative to the mono-view video,and the correlations are through event-centered shot clustering via random walks and thus very complicated.To better explore such correlations,we multi-objective optimization.Spatio-temporal shot graph con-consider them with different attributes,for instance,temporal struction and the multi-view summarization are the two key adjacency,content similarity,and high-level semantic correla- components.The overview of our method is shown in Fig.1. tion separately.A hypergraph is initially introduced to systemat- To construct the shot graph,we first parse the input multi- ically model the correlations in which each graph node denotes a view videos into content-consistent video shots.Dynamic and shot resulting from video parsing,while each type of hyperedge important static shots are reserved as a result.The preserved characterizes the relationship among shots.We then transform shots are used as graph nodes and the corresponding shot impor-the hypergraph into a weighted spatio-temporal shot graph.The tance values are used as node values.For evaluating the impor- weights on graph edges thus qualitatively evaluate correlations tance,a Gaussian entropy fusion model is developed to fuse to- among multi-view shots
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 719 Fig. 1. Overview of our multi-view video summarization method. Besides, automatic music summarization has been considered in [28]. Graph model has also been used for video summarization. Lu et al. [10] developed a graph optimization method that computes optimal video skim in each scene via dynamic programming. Ngo et al. [13] used temporal graph analysis to effectively capsulate information for video structure and highlight. Through modeling the video evolution by temporal graph, their method can automatically detect scene changes and generate summaries. Lee et al. [29] presented a scenario-based dynamic video abstraction method using graph matching. Multi-level scenarios generated by a graph-based video segmentation and a hierarchical segment are used to segment a video into shots. Dynamic video abstractions are accomplished by accessing the hierarchy level-by-level. Another graph-based video summarization method is given by Peng and Ngo [14]. Highlighted events can be detected by a graph clustering algorithm, incorporating an effective similarity metric of video clips. Comparing with their methods, we focus on multi-view videos. Due to content correlations among multi-views, the spatio-temporal shot graph we constructed has more complicated node connections, making summarization challenging. The above methods provide many effective solutions to mono-view video summarization. However, to the best of our knowledge, few methods are dedicated to multi-view video summarization. Multi-view video coding (MVC) algorithms [30]–[32] also deal with the multi-view videos. Using techniques such as motion estimation, disparity estimation, and so on, MVC removes information redundancy in spatial and temporal domains. The video content is however unchanged. Therefore, MVC could not remove redundancy at the semantic level. In contrast, our multi-view video summarization method makes an effort to pave the way for this, by exploring the content correlations among multi-view video shots and selecting those most representative shots for summary. III. OVERVIEW We construct a spatio-temporal shot graph to represent the multi-view videos. Multi-view summarization is achieved through event-centered shot clustering via random walks and multi-objective optimization. Spatio-temporal shot graph construction and the multi-view summarization are the two key components. The overview of our method is shown in Fig. 1. To construct the shot graph, we first parse the input multiview videos into content-consistent video shots. Dynamic and important static shots are reserved as a result. The preserved shots are used as graph nodes and the corresponding shot importance values are used as node values. For evaluating the importance, a Gaussian entropy fusion model is developed to fuse together a set of intrinsic video features. The multi-view shots usually have diverse correlations with different attributes, such as temporal adjacency and content similarity. We use a hypergraph to systematically characterize the correlations among shots. A hypergraph is a graph in which an edge, usually named as a hyperedge, can link a subset of nodes. Each kind of correlation among multi-view shots is thus represented with a kind of hyperedge in the hypergraph. The hypergraph is further converted into a spatio-temporal shot graph where correlations of shots in each view and across multi-views are mapped to edge weights. To implement multi-view summarization on the spatio-temporal graph, we employ random walks to cluster those eventcentered similar shots. Using them as the anchor points, final summarized multi-view shots are generated by a multi-objective optimization model that supports different user requirements as well as multi-level summarization. We use the multi-view video storyboard and the event-board to represent the multi-view summaries. The multi-view storyboard demonstrates the event-centered summarized shots in a multi-view manner as shown in Fig. 5. In contrast, the eventboard shown in Fig. 6 assembles those summarized shots along the timeline. IV. SPATIO-TEMPORAL SHOT GRAPH It is difficult to directly generate summarization, especially the video skims from multi-view videos. A common idea is to first parse the videos into shots. In this way, video summarization is transformed into a problem of selecting a set of representative shots. Obviously, the selected shots should favor interesting events. Meanwhile, these shots should be nontrivial. To achieve this, content correlations as well as disparities among shots are taken into account. In previous methods for mono-view video summarization, each shot only correlates with its similar shots along the temporal axis. The correlations are simple, and easily modeled. However, for the multi-view videos, each shot correlates closely with not only the temporally adjacent shots in its own view but also the spatially neighboring shots in other views. Relationships among shots increase exponentially relative to the mono-view video, and the correlations are thus very complicated. To better explore such correlations, we consider them with different attributes, for instance, temporal adjacency, content similarity, and high-level semantic correlation separately. A hypergraph is initially introduced to systematically model the correlations in which each graph node denotes a shot resulting from video parsing, while each type of hyperedge characterizes the relationship among shots. We then transform the hypergraph into a weighted spatio-temporal shot graph. The weights on graph edges thus qualitatively evaluate correlations among multi-view shots
720 IEEE TRANSACTIONS ON MULTIMEDIA.VOL.12.NO.7.NOVEMBER 2010 1 View I video S22 S23 View 2 video S31 View N video■ Fig.2.Spatio-temporal shot graph G(V.E.W).Each node in G represents a shot,and its value is the shot importance.Each edge connects a pair of nodes (shots) with correlation which is evaluated by shots'similarity.Without losing generality,only three shots in each view are given for illustration.To expose clearly the graph,the edges in graph are with several colors,and each shot is represented as an orange segment. A.Graph Construction a kind of correlation.By converting the hypergraph into the We first parse the multi-view videos into shots.Various algo- spaito-temporal graph,the edge weights quantitatively evaluate rithms have been proposed for shot detection [33]-[37].In [34], correlations among shots.Note that the shot graph is called a Ngo et al.proposed an approach through the analysis of slices spatio-temporal graph in the sense that it embeds the scene in- extracted by partitioning video and collecting temporal signa- formation coming from different spatial views.The"spatio-tem- ture.It has proven effective in detecting camera breaks such as poral"here differs from its traditional definition on the monoc- cuts,wipes,and dissolves.Xiang et al.[36]used a cumulative ular video sequence. multi-event histogram over time to represent video content.An By representing the multi-view videos as the spatio-temporal online segmentation algorithm named forward-backward rele- shot graph,correlations among shots are naturally and intu- vance is developed to detect breaks in video content.For multi- itively reflected in the graph.Moreover,the graph nodes carry view videos,especially those surveillance videos,the cameras shot importance,which is necessary to create a concise and rep- remain nearly stable,and the videos recorded only contain the resentative summary.We describe the Gaussian entropy fusion same scene in most cases.The shot mainly contains those tem- model and hypergraph in Sections IV-B and IV-C separately. porally contiguous frames which share the same semantic con- cept with relatively higher probability.To detect the shots,we B.Shot Importance Computation basically adopt the algorithm proposed in [36],and further dis- By representing multi-view videos with graph,multi-view card those shots with lower activities.In particular,for every video summarization is converted into a task of selecting the shot detected,we first compute the differential image sequence most representative video shots.The selection of represen- of adjacent frames.Each image can then be converted into a tative shots often varies with different people.In this sense, binary image by comparing the absolute value of each pixel detecting representative shots generally involves understanding against a threshold.We compute for each shot a normalized ac- video content based on human perception and is very difficult. tivity value through counting the total number of its nonzero To make it computationally tractable,we instead quantita- pixels and dividing it by the product of frame number and frame tively evaluate the shot importance by considering low-level resolution.We sort the activity values of all shots and select the image features as well as high-level semantics.We introduce activity threshold interactively. a Gaussian entropy fusion model to fuse a set of low-level Parsing the multi-view videos into shots allows us to seek features such as color histogram and wavelet coefficients,and solution of summarization in a more compact shot space.Actu-compute an importance score.For high-level semantics,we ally,each shot correlates with the similar shots in its own view mainly consider human faces now.Moreover,we take into as well as the ones in other views.This characteristic makes the account the interesting events for specific types of videos,since weighted graph a suitable representation of multi-view videos, video summarization is often domain-specific. by viewing shots as nodes and converting the correlations 1)Shot Importance by Low-Level Features:We develop a between shots into edge weights.We extend the graph model Gaussian entropy fusion model to measure shot information by for mono-view video summarization [10,[13,[14]and seg- integrating low-level features.In contrast,previous mono-view mentation38]to a spatio-temporal shot graph.Connectivity of video summarization methods generally combine features with the graph we constructed is inherently complicated due to the linear or nonlinear fusion schemes.Such schemes would not spatio-temporal correlations among multi-view shots necessarily lead to the optimal performance for our multi-view The multi-view videos are treated as a weighted undirected videos when the videos are contaminated by noises.This is espe- shot graph G(V,E,W)as illustrated in Fig.2.Each node in cially true for multi-view surveillance videos which often suffer V represents a shot resulting from video parsing.Its value is from different lighting conditions across multiple views.Under the importance of shot calculated by the Gaussian entropy fu- such circumstance,we should robustly and fairly evaluate the sion model.The edge set E connects every pair of nodes if importance of the shots that may capture the same interesting they are closely correlated.The edge weight W measures node event in multiple views under different illuminations.To ac- similarity by taking into account their correlations in terms of count for this,we need to emphasize the portion of shot-related different attributes.We model such correlations among shots useful information in multi-view videos,and depress the influ- with a hypergraph in which each type of hyperedge denotes ence of noises simultaneously.Based upon such observation,we
720 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 7, NOVEMBER 2010 Fig. 2. Spatio-temporal shot graph . Each node in represents a shot, and its value is the shot importance. Each edge connects a pair of nodes (shots) with correlation which is evaluated by shots’ similarity. Without losing generality, only three shots in each view are given for illustration. To expose clearly the graph, the edges in graph are with several colors, and each shot is represented as an orange segment. A. Graph Construction We first parse the multi-view videos into shots. Various algorithms have been proposed for shot detection [33]–[37]. In [34], Ngo et al. proposed an approach through the analysis of slices extracted by partitioning video and collecting temporal signature. It has proven effective in detecting camera breaks such as cuts, wipes, and dissolves. Xiang et al. [36] used a cumulative multi-event histogram over time to represent video content. An online segmentation algorithm named forward-backward relevance is developed to detect breaks in video content. For multiview videos, especially those surveillance videos, the cameras remain nearly stable, and the videos recorded only contain the same scene in most cases. The shot mainly contains those temporally contiguous frames which share the same semantic concept with relatively higher probability. To detect the shots, we basically adopt the algorithm proposed in [36], and further discard those shots with lower activities. In particular, for every shot detected, we first compute the differential image sequence of adjacent frames. Each image can then be converted into a binary image by comparing the absolute value of each pixel against a threshold. We compute for each shot a normalized activity value through counting the total number of its nonzero pixels and dividing it by the product of frame number and frame resolution. We sort the activity values of all shots and select the activity threshold interactively. Parsing the multi-view videos into shots allows us to seek solution of summarization in a more compact shot space. Actually, each shot correlates with the similar shots in its own view as well as the ones in other views. This characteristic makes the weighted graph a suitable representation of multi-view videos, by viewing shots as nodes and converting the correlations between shots into edge weights. We extend the graph model for mono-view video summarization [10], [13], [14] and segmentation [38] to a spatio-temporal shot graph. Connectivity of the graph we constructed is inherently complicated due to the spatio-temporal correlations among multi-view shots. The multi-view videos are treated as a weighted undirected shot graph as illustrated in Fig. 2. Each node in represents a shot resulting from video parsing. Its value is the importance of shot calculated by the Gaussian entropy fusion model. The edge set connects every pair of nodes if they are closely correlated. The edge weight measures node similarity by taking into account their correlations in terms of different attributes. We model such correlations among shots with a hypergraph in which each type of hyperedge denotes a kind of correlation. By converting the hypergraph into the spaito-temporal graph, the edge weights quantitatively evaluate correlations among shots. Note that the shot graph is called a spatio-temporal graph in the sense that it embeds the scene information coming from different spatial views. The “spatio-temporal” here differs from its traditional definition on the monocular video sequence. By representing the multi-view videos as the spatio-temporal shot graph, correlations among shots are naturally and intuitively reflected in the graph. Moreover, the graph nodes carry shot importance, which is necessary to create a concise and representative summary. We describe the Gaussian entropy fusion model and hypergraph in Sections IV-B and IV-C separately. B. Shot Importance Computation By representing multi-view videos with graph, multi-view video summarization is converted into a task of selecting the most representative video shots. The selection of representative shots often varies with different people. In this sense, detecting representative shots generally involves understanding video content based on human perception and is very difficult. To make it computationally tractable, we instead quantitatively evaluate the shot importance by considering low-level image features as well as high-level semantics. We introduce a Gaussian entropy fusion model to fuse a set of low-level features such as color histogram and wavelet coefficients, and compute an importance score. For high-level semantics, we mainly consider human faces now. Moreover, we take into account the interesting events for specific types of videos, since video summarization is often domain-specific. 1) Shot Importance by Low-Level Features: We develop a Gaussian entropy fusion model to measure shot information by integrating low-level features. In contrast, previous mono-view video summarization methods generally combine features with linear or nonlinear fusion schemes. Such schemes would not necessarily lead to the optimal performance for our multi-view videos when the videos are contaminated by noises. This is especially true for multi-view surveillance videos which often suffer from different lighting conditions across multiple views. Under such circumstance, we should robustly and fairly evaluate the importance of the shots that may capture the same interesting event in multiple views under different illuminations. To account for this, we need to emphasize the portion of shot-related useful information in multi-view videos, and depress the influence of noises simultaneously. Based upon such observation, we
FU et al:MULTI-VIEW VIDEO SUMMARIZATION 721 first extract from the videos a set of intrinsic low-level features The entropy H is a measure of information encoded by the which are often correlated with each other shot S.We take it as the importance.An additional advantage We now mainly take into account the visual features.They of the Gaussian entropy fusion scheme is that it works well as are color histogram feature,edge histogram feature,and wavelet long as the union of feature vector groups covers most useful feature [9],[39],[40].The features in other modalities,such information of multi-view videos.Therefore,instead of using as textual and aural features used in previous video analysis all the feature sets,it would be sufficient if some well-defined methods [11],[19],however,can also be integrated into our feature sets are available. method.Without losing generality,for shot S with n frames, 2)Shot Importance by High-Level Semantics:Humans are suppose that overall M feature vector sets are extracted.We ex-usually important content in video sequence.We employ the pand each featureF into a one-column vector.Two arbitrary Viola-Jones face detector [42]to detect faces in each frame.In featuresF and Fi may have different dimensions.We denote addition,video summarization is often domain-specific.Defini- he featre sets by{E当 tion of shot importance may vary according to different video The feature sets contain shot-related useful information.Be- genres.For instance,in a baseball game video,the shots that sides,they are often contaminated by noises.The Gaussian en- contain"home run”,“catch”,and“hit”usually catch much user tropy fusion model aims at emphasizing the useful information attention.Many methods have been suggested to detect inter- of feature sets and simultaneously minimizing noise influence. esting events for specific type videos,such as abnormal detec- We can relatively safely assume that different features have un- tion [43]in surveillance video,excitement and interestingness correlated noise characteristics.The interaction of feature sets detection in sports video [7],[23],[24],brilliant music detec- is shot-related information expressed as tion [28],and so on.A detailed description of these methods is beyond the scope of this paper.However,for specific type ⑨≈-1(g of multi-view videos,interesting event detection can be inte- grated into our method.For those shots that contain faces in most frames or interest events,the importance scores are set to In the above formula,importance of shot S is measured by 1 adding up information amount of the individual features and subtracting information amount of their union.Since noises con- C.Correlations Among Shots by Hypergraph tained in different features are uncorrelated,the above formula weakens noise influence and the useful information is empha- Basically,three kinds of relationships among multi-view sized.According to information theory,a measure of the amount shots are considered: of information is entropy.We then add up the entropy values of Temporal adjacency.Two shots are likely to describe the all feature sets and subtract the entropy of their union from the same event if one shot is temporally adjacent to the other. sum Visual similarity.Two shots are related to each other if they M are visually similar. H(S) >H(F)-H(,F2,,M》 (2) Semantic correlation.Two shots may correlate with each other due to the same event or semantic object such as a whereF=(..fi)Tfij is the ith feature set for face occurs in both shots. the jth frame of shot S.H()denotes entropy of the feature. Temporal adjacency implies that adjacent video shots may To estimate the probability of pF)and p(F,F2,...,FM). share the same semantic concepts with relatively higher proba- a common idea is to approximate them with the Gaussian dis- bility.For two shots Si and Si,the temporal similarity is defined tribution as pF)W(0,) (3) Wi(SiS)=on+o2d+as. (7) (F,2,,FM)W(0,) (4) where is the covariance matrix off(=1,....M). where d computes the temporal distance.i and and is the one offF,....FM are normalized tj are the time stamp of their middle frames.d is further inte- by grated into a light attenuation function,in which the parameters o1,02,and 03 control the temporal similarity.We use the fol- lowing ways to set their values.Given the training shots parsed (5) from a mono-view video,we first compute temporal similarities of each shot pair given initial values.Then we modify the values until the temporal similarities computed are in accordance with our observation.Through experiments,a1,02,and a3 are set as By virtual of nonlinear time series analysis [41],the Gaussian 1,0.01,and 0.0001,respectively.Different settings of the three entropy of shot S is finally expressed as parameters will have the same effect on summarization if the values are given with regard to an invariant relative magnitude. H(S Clog2(②)-2lg回 (6) We use similar ways to set other parameters in similarity com- putation. Visual similarity is computed by where ii is the jth element in the diagonal of matrix >> T=l-入is eigenvalue of见. Wu(Si,Sj)=e-kwVisSim(S:,Sj) (8)
FU et al.: MULTI-VIEW VIDEO SUMMARIZATION 721 first extract from the videos a set of intrinsic low-level features which are often correlated with each other. We now mainly take into account the visual features. They are color histogram feature, edge histogram feature, and wavelet feature [9], [39], [40]. The features in other modalities, such as textual and aural features used in previous video analysis methods [11], [19], however, can also be integrated into our method. Without losing generality, for shot with frames, suppose that overall feature vector sets are extracted. We expand each feature into a one-column vector. Two arbitrary features and may have different dimensions. We denote the feature sets by . The feature sets contain shot-related useful information. Besides, they are often contaminated by noises. The Gaussian entropy fusion model aims at emphasizing the useful information of feature sets and simultaneously minimizing noise influence. We can relatively safely assume that different features have uncorrelated noise characteristics. The interaction of feature sets is shot-related information expressed as (1) In the above formula, importance of shot is measured by adding up information amount of the individual features and subtracting information amount of their union. Since noises contained in different features are uncorrelated, the above formula weakens noise influence and the useful information is emphasized. According to information theory, a measure of the amount of information is entropy. We then add up the entropy values of all feature sets and subtract the entropy of their union from the sum (2) where . is the th feature set for the th frame of shot . denotes entropy of the feature. To estimate the probability of and , a common idea is to approximate them with the Gaussian distribution (3) (4) where is the covariance matrix of , and is the one of . are normalized by (5) By virtual of nonlinear time series analysis [41], the Gaussian entropy of shot is finally expressed as (6) where is the th element in the diagonal of matrix . . is eigenvalue of . The entropy is a measure of information encoded by the shot . We take it as the importance. An additional advantage of the Gaussian entropy fusion scheme is that it works well as long as the union of feature vector groups covers most useful information of multi-view videos. Therefore, instead of using all the feature sets, it would be sufficient if some well-defined feature sets are available. 2) Shot Importance by High-Level Semantics: Humans are usually important content in video sequence. We employ the Viola-Jones face detector [42] to detect faces in each frame. In addition, video summarization is often domain-specific. Definition of shot importance may vary according to different video genres. For instance, in a baseball game video, the shots that contain “home run”, “catch”, and “hit” usually catch much user attention. Many methods have been suggested to detect interesting events for specific type videos, such as abnormal detection [43] in surveillance video, excitement and interestingness detection in sports video [7], [23], [24], brilliant music detection [28], and so on. A detailed description of these methods is beyond the scope of this paper. However, for specific type of multi-view videos, interesting event detection can be integrated into our method. For those shots that contain faces in most frames or interest events, the importance scores are set to 1. C. Correlations Among Shots by Hypergraph Basically, three kinds of relationships among multi-view shots are considered: • Temporal adjacency. Two shots are likely to describe the same event if one shot is temporally adjacent to the other. • Visual similarity. Two shots are related to each other if they are visually similar. • Semantic correlation. Two shots may correlate with each other due to the same event or semantic object such as a face occurs in both shots. Temporal adjacency implies that adjacent video shots may share the same semantic concepts with relatively higher probability. For two shots and , the temporal similarity is defined as (7) where computes the temporal distance. and are the time stamp of their middle frames. is further integrated into a light attenuation function, in which the parameters , , and control the temporal similarity. We use the following ways to set their values. Given the training shots parsed from a mono-view video, we first compute temporal similarities of each shot pair given initial values. Then we modify the values until the temporal similarities computed are in accordance with our observation. Through experiments, , , and are set as 1, 0.01, and 0.0001, respectively. Different settings of the three parameters will have the same effect on summarization if the values are given with regard to an invariant relative magnitude. We use similar ways to set other parameters in similarity computation. Visual similarity is computed by (8)