ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval Quan Cuil3,Qing-Yuan Jiang2,Xiu-Shen Wei3(),Wu-Jun Li2, and Osamu Yoshiel 1 Graduate School of IPS,Waseda University,Fukuoka,Japan cui-quan@toki.waseda.jp,yoshie@waseda.jp 2 National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology,Nanjing University,Nanjing,China qyjiang24@gmail.com,liwujun@nju.edu.cn 3 Megvii Research Nanjing,Megvii Technology,Nanjing,China weixs.gm@gmail.com Abstract.Retrieving content relevant images from a large-scale fine- grained dataset could suffer from intolerably slow query speed and highly redundant storage cost,due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper,we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images,leveraging the search and storage efficiency of hash learning to alleviate the aforementioned prob- lems.Specifically,we propose a unified end-to-end trainable network, termed as ExchNet.Based on attention mechanisms and proposed atten- tion constraints,ExchNet can firstly obtain both local and global features to represent object parts and the whole fine-grained objects,respectively. Furthermore,to ensure the discriminative ability and semantic meaning's consistency of these part-level features across images,we design a local feature alignment approach by performing a feature exchanging opera- tion.Later,an alternating learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes.Val- idated by extensive experiments,our ExchNet consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets. Moreover,compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction,revealing its efficiency and practicality. Keywords:Fine-Grained Image Retrieval.Learning to hash. Feature alignment.Large-scale image search Q.Cui,Q.-Y.Jiang-Equal contribution. Electronic supplementary material The online version of this chapter (https:/ doi.org/10.1007/978-3-030-58580-8_12)contains supplementary material,which is available to authorized users. Springer Nature Switzerland AG 2020 A.Vedaldi et al.(Eds.):ECCV 2020,LNCS 12348,pp.189-205,2020. https://doi.org/10.1007/978-3-030-58580-8_12
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval Quan Cui1,3, Qing-Yuan Jiang2, Xiu-Shen Wei3(B), Wu-Jun Li2, and Osamu Yoshie1 1 Graduate School of IPS, Waseda University, Fukuoka, Japan cui-quan@toki.waseda.jp, yoshie@waseda.jp 2 National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, China qyjiang24@gmail.com, liwujun@nju.edu.cn 3 Megvii Research Nanjing, Megvii Technology, Nanjing, China weixs.gm@gmail.com Abstract. Retrieving content relevant images from a large-scale finegrained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, ExchNet can firstly obtain both local and global features to represent object parts and the whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning’s consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternating learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our ExchNet consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality. Keywords: Fine-Grained Image Retrieval · Learning to hash · Feature alignment · Large-scale image search Q. Cui, Q.-Y. Jiang—Equal contribution. Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-030-58580-8 12) contains supplementary material, which is available to authorized users. c Springer Nature Switzerland AG 2020 A. Vedaldi et al. (Eds.): ECCV 2020, LNCS 12348, pp. 189–205, 2020. https://doi.org/10.1007/978-3-030-58580-8_12
190 Q.Cui et al. ner-class vananoe ■:1口:0 Artic Tem ··① Bna时 Code oon Tem ■ ommon Tem Dissimilar Binary Codes Green Jay 口 Gircen Jay Fig.1.Illustration of the fine-grained hashing task.Fine-grained images could share large intra-class variances but small inter-class variances.Fine-grained hashing aims to generate compact binary codes with tiny Hamming distances for images of the same sub-category,as well as distinct codes for images from different sub-categories. 1 Introduction Fine-Grained Image Retrieval(FGIR)[19,26,31,36,41,42]is a practical but chal- lenging computer vision task.It aims to retrieve images belonging to various sub-categories of a certain meta-category (e.g.,birds,cars and aircrafts)and return images with the same sub-category as the query image.In real FGIR applications,previous methods could suffer from slow query speed and redun- dant storage costs due to both the explosive growth of massive fine-grained data and high-dimensional real-valued features. Learning to hash [3,6,7,10,14,16,17,21,22,34,35]has proven to be a promis- ing solution for large-scale image retrieval because it can greatly reduce the storage cost and increase the query speed.As a representative research area of approximate nearest neighbor (ANN)search [1,6,13],hashing aims to embed data points as similarity-preserving binary codes.Recently,hashing has been successfully applied in a wide range of image retrieval tasks,e.g.,face image retrieval [18],person re-identification [5,43],etc.We hereby explore the effec- tiveness of hashing for fine-grained image retrieval. To the best of our knowledge,this is the first work to study the fine-grained hashing problem,which refers to the problem of designing hashing for fine- grained objects.As shown in Fig.1,the task is desirable to generate compact binary codes for fine-grained images sharing both large intra-class variances and small inter-class variances.To deal with the challenging task,we propose a uni- fied end-to-end trainable network ExchNet to first learn fine-grained tailored features and then generate the final binary hash codes. In concretely,our ExchNet consists of three main modules,including rep- resentation learning,local feature alignment and hash code learning,as shown in Fig.2.In the representation learning module,beyond obtaining the holistic image representation (i.e.,global features),we also employ the attention mech- anism to capture the part-level features (i.e.,local features)for representing fine-grained objects'parts.Localizing parts and embedding part-level cues are
190 Q. Cui et al. Artic Tern Common Tern Green Jay Intra-class variance Inter-class variance Artic Tern Common Tern Green Jay Feature Extractor Similar Binary Codes : 1 : 0 Dissimilar Binary Codes Hashing Network Fig. 1. Illustration of the fine-grained hashing task. Fine-grained images could share large intra-class variances but small inter-class variances. Fine-grained hashing aims to generate compact binary codes with tiny Hamming distances for images of the same sub-category, as well as distinct codes for images from different sub-categories. 1 Introduction Fine-Grained Image Retrieval (FGIR) [19,26,31,36,41,42] is a practical but challenging computer vision task. It aims to retrieve images belonging to various sub-categories of a certain meta-category (e.g., birds, cars and aircrafts) and return images with the same sub-category as the query image. In real FGIR applications, previous methods could suffer from slow query speed and redundant storage costs due to both the explosive growth of massive fine-grained data and high-dimensional real-valued features. Learning to hash [3,6,7,10,14,16,17,21,22,34,35] has proven to be a promising solution for large-scale image retrieval because it can greatly reduce the storage cost and increase the query speed. As a representative research area of approximate nearest neighbor (ANN) search [1,6,13], hashing aims to embed data points as similarity-preserving binary codes. Recently, hashing has been successfully applied in a wide range of image retrieval tasks, e.g., face image retrieval [18], person re-identification [5,43], etc. We hereby explore the effectiveness of hashing for fine-grained image retrieval. To the best of our knowledge, this is the first work to study the fine-grained hashing problem, which refers to the problem of designing hashing for finegrained objects. As shown in Fig. 1, the task is desirable to generate compact binary codes for fine-grained images sharing both large intra-class variances and small inter-class variances. To deal with the challenging task, we propose a uni- fied end-to-end trainable network ExchNet to first learn fine-grained tailored features and then generate the final binary hash codes. In concretely, our ExchNet consists of three main modules, including representation learning, local feature alignment and hash code learning, as shown in Fig. 2. In the representation learning module, beyond obtaining the holistic image representation (i.e., global features), we also employ the attention mechanism to capture the part-level features (i.e., local features) for representing fine-grained objects’ parts. Localizing parts and embedding part-level cues are
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 191 Representation Learning Local Features Alignment Hash Codes Learning Training Puc0的 Fig.2.Framework of our proposed ExchNet,which consists of three modules.(1) The representation learning module,as well as the attention mechanism with spatial and channel diversity learning constraints,is designed to obtain both local and global features of fine-grained objects.(2)The local feature alignment module is used to align obtained local features w.r.t.object parts across different fine-grained images.(3)The hash codes learning module is performed to generate the compact binary codes. crucial for fine-grained tasks,since these discriminative but subtle parts (e.g., bird heads or tails)play a major role to distinguish different sub-categories. Moreover,we also develop two kinds of attention constraints,i.e.,spatial and channel constraints,to collaboratively work together for further improving the discriminative ability of these local features.In the following,to ensure that these part-level features can correspond to their own corresponding parts across differ- ent fine-grained images,we design an anchor based feature alignment approach to align these local features.Specifically,in the local feature alignment module, we treat the anchored local features as the "prototype"w.r.t.its sub-category by averaging all the local features of that part across images.Once local features are well aligned for their own parts,even if we exchange one specific part's local feature of an input image with the same part's local feature of the prototype, the image meanings derived from the image representations and also the final hash codes should be both extremely similar.Inspired by this motivation,we perform a feature exchanging operation upon the anchored local features and other learned local features,which is illustrated in Fig.3.After that,for effec- tively training the network with our feature alignment fashion,we utilize an alternating algorithm to solve the hashing learning problem and update anchor features simultaneously. To quantitatively prove both effectiveness and efficiency of our ExchNet,we conduct comprehensive experiments on five fine-grained benchmark datasets, including the large-scale ones,i.e.,NABirds [11,VegFru [12 and Food101 [23. Particularly,compared with competing approximate nearest neighbor methods, our ExchNet achieves up to hundreds times speedup for large-scale fine-grained image retrieval without significant accuracy drops.Meanwhile,compared with state-of-the-art generic hashing methods,ExchNet could consistently outperform these methods by a large margin on all the fine-grained datasets.Additionally, ablation studies and visualization results justify the effectiveness of our tailored model designs like local feature alignment and proposed attention approach
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 191 Backbone CNN Hashing Network Attention Generation ! Global Features Refinement Local Features Refinement M Attention Maps Spatial Diversity Channel Diversity GAP GAP Representation Learning Local Features Alignment Hash Codes Learning Anchor Local Features Local Feature Extractor Global Feature Extractor (Training Phase Only) " #! ! $" # "%! Fig. 2. Framework of our proposed ExchNet, which consists of three modules. (1) The representation learning module, as well as the attention mechanism with spatial and channel diversity learning constraints, is designed to obtain both local and global features of fine-grained objects. (2) The local feature alignment module is used to align obtained local features w.r.t. object parts across different fine-grained images. (3) The hash codes learning module is performed to generate the compact binary codes. crucial for fine-grained tasks, since these discriminative but subtle parts (e.g., bird heads or tails) play a major role to distinguish different sub-categories. Moreover, we also develop two kinds of attention constraints, i.e., spatial and channel constraints, to collaboratively work together for further improving the discriminative ability of these local features. In the following, to ensure that these part-level features can correspond to their own corresponding parts across different fine-grained images, we design an anchor based feature alignment approach to align these local features. Specifically, in the local feature alignment module, we treat the anchored local features as the “prototype” w.r.t. its sub-category by averaging all the local features of that part across images. Once local features are well aligned for their own parts, even if we exchange one specific part’s local feature of an input image with the same part’s local feature of the prototype, the image meanings derived from the image representations and also the final hash codes should be both extremely similar. Inspired by this motivation, we perform a feature exchanging operation upon the anchored local features and other learned local features, which is illustrated in Fig. 3. After that, for effectively training the network with our feature alignment fashion, we utilize an alternating algorithm to solve the hashing learning problem and update anchor features simultaneously. To quantitatively prove both effectiveness and efficiency of our ExchNet, we conduct comprehensive experiments on five fine-grained benchmark datasets, including the large-scale ones, i.e., NABirds [11], VegFru [12] and Food101 [23]. Particularly, compared with competing approximate nearest neighbor methods, our ExchNet achieves up to hundreds times speedup for large-scale fine-grained image retrieval without significant accuracy drops. Meanwhile, compared with state-of-the-art generic hashing methods, ExchNet could consistently outperform these methods by a large margin on all the fine-grained datasets. Additionally, ablation studies and visualization results justify the effectiveness of our tailored model designs like local feature alignment and proposed attention approach
192 Q.Cui et al. The contributions of this paper are summarized as follows: We study the novel fine-grained hashing topic to leverage the search and storage efficiency of hash codes for solving the challenging large-scale fine- grained image retrieval problem. We propose a unified end-to-end trainable network,i.e.,ExchNet,to first learn fine-grained tailored features and then generate the final binary hash codes Particularly,the proposed attention constraints,local feature alignment and anchor-based learning fashion contribute well to obtain discriminative fine- grained representations. We conduct extensive experiments on five fine-grained datasets to validate both effectiveness and efficiency of our proposed ExchNet.Especially for the results on large-scale datasets,ExchNet exhibits its outperforming retrieval performance on either speedup,memory usages and retrieval accuracy. 2 Related Work Fine-Grained Image Retrieval.Fine-Grained Image Retrieval(FGIR)is an active research topic emerged in recent years,where the database and query images could share small inter-class variance but large intra-class variance.In previous works [36,handcrafted features were initially utilized to tackle the FGIR problem.Powered by deep learning techniques,more and more deep learn- ing based FGIR.methods [19,26,31-33,36,41,42]were proposed.These deep methods can be roughly divided into two parts,i.e.,supervised and unsupervised methods.In supervised methods,FGIR is defined as a metric learning problem. Zheng et al.[41]designed a novel ranking loss and a weakly-supervised attrac- tive feature extraction strategy to facilitate the retrieval performance.Zheng et al.[42]improved their former work [41]with a normalize-scale layer and de- correlated ranking loss.As to unsupervised methods,Selective Convolutional Descriptor Aggregation (SCDA)[31]was proposed to localize the main object in fine-grained images firstly,and then discard the noisy background and keep useful deep descriptors for fine-grained image retrieval. Deep Hashing.Hashing methods can be divided into two categories,i.e., data-independent methods [6]and data-dependent methods [10,17],based on whether training points are used to learn hash functions.Generally speaking, data-dependent methods,also named as Learning to Hash(L2H)methods,can achieve better retrieval performance with the help of the learning on training data.With the rise of deep learning,some L2H methods integrate deep feature learning into hash frameworks and achieve promising performance.As previous work,many deep hashing methods [2,3,7,14,16,17,21,22,30,35,38,39]for large- scale image retrieval have been proposed.Compared with deep unsupervised hashing methods [7,14,21],deep supervised hashing methods [14,16,17,35]can achieve superior retrieval accuracy as they can fully explore the semantic infor- mation.Specifically,the previous work 35 was essentially a two-stage method
192 Q. Cui et al. The contributions of this paper are summarized as follows: – We study the novel fine-grained hashing topic to leverage the search and storage efficiency of hash codes for solving the challenging large-scale finegrained image retrieval problem. – We propose a unified end-to-end trainable network, i.e., ExchNet, to first learn fine-grained tailored features and then generate the final binary hash codes. Particularly, the proposed attention constraints, local feature alignment and anchor-based learning fashion contribute well to obtain discriminative finegrained representations. – We conduct extensive experiments on five fine-grained datasets to validate both effectiveness and efficiency of our proposed ExchNet. Especially for the results on large-scale datasets, ExchNet exhibits its outperforming retrieval performance on either speedup, memory usages and retrieval accuracy. 2 Related Work Fine-Grained Image Retrieval. Fine-Grained Image Retrieval (FGIR) is an active research topic emerged in recent years, where the database and query images could share small inter-class variance but large intra-class variance. In previous works [36], handcrafted features were initially utilized to tackle the FGIR problem. Powered by deep learning techniques, more and more deep learning based FGIR methods [19,26,31–33,36,41,42] were proposed. These deep methods can be roughly divided into two parts, i.e., supervised and unsupervised methods. In supervised methods, FGIR is defined as a metric learning problem. Zheng et al. [41] designed a novel ranking loss and a weakly-supervised attractive feature extraction strategy to facilitate the retrieval performance. Zheng et al. [42] improved their former work [41] with a normalize-scale layer and decorrelated ranking loss. As to unsupervised methods, Selective Convolutional Descriptor Aggregation (SCDA) [31] was proposed to localize the main object in fine-grained images firstly, and then discard the noisy background and keep useful deep descriptors for fine-grained image retrieval. Deep Hashing. Hashing methods can be divided into two categories, i.e., data-independent methods [6] and data-dependent methods [10,17], based on whether training points are used to learn hash functions. Generally speaking, data-dependent methods, also named as Learning to Hash (L2H) methods, can achieve better retrieval performance with the help of the learning on training data. With the rise of deep learning, some L2H methods integrate deep feature learning into hash frameworks and achieve promising performance. As previous work, many deep hashing methods [2,3,7,14,16,17,21,22,30,35,38,39] for largescale image retrieval have been proposed. Compared with deep unsupervised hashing methods [7,14,21], deep supervised hashing methods [14,16,17,35] can achieve superior retrieval accuracy as they can fully explore the semantic information. Specifically, the previous work [35] was essentially a two-stage method
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 193 Local Features 山 Similzr Codes Fig.3.Key idea of our local feature alignment approach:given an image pair of a fine-grained category,exchanging their local features of the same object parts should not change their corresponding hash codes,i.e.,these hash codes should be the same as those generated without local feature exchanging and their Hamming distance should be still close also. which tried to learn binary codes in the first stage and employed feature learning guided by the learned binary codes in the second stage.Then,there appeared numerous one-stage deep supervised hashing methods,including Deep Pairwise Supervised Hashing (DPSH)[17],Deep Supervised Hashing (DSH)[22],and Deep Cauchy Hashing (DCH)[3],which aimed to integrate feature learning and hash code learning into an end-to-end framework. 3 Methodology The framework of our ExchNet is presented in Fig.2,which contains three key modules,i.e.,the representation learning module,local feature alignment module,and hash code learning module. 3.1 Representation Learning The learning of discriminative and meaningful local features is mutually cor- related with fine-grained tasks 9,15,20,37,40],since these local features can greatly benefit the distinguishing of sub-categories with subtle visual differences deriving from the discriminative fine-grained parts(e.g.,bird heads or tails).In consequence,as shown in Fig.2,beyond the global feature extractor,we also introduce a local feature extractor in the representation learning module.Specif- ically,by considering model efficiency,we hereby propose to learn local features with the attention mechanism,rather than other fine-grained techniques with tremendous computation cost,e.g.,second-order representations [15,20]or com- plicated network architectures [9,37,40]. Given an input image xi,a backbone CNN is utilized to extract a holistic deep feature EE RHxWxC,which serves as the appetizer for both the local feature extractor and the global feature extractor
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 193 Fig. 3. Key idea of our local feature alignment approach: given an image pair of a fine-grained category, exchanging their local features of the same object parts should not change their corresponding hash codes, i.e., these hash codes should be the same as those generated without local feature exchanging and their Hamming distance should be still close also. which tried to learn binary codes in the first stage and employed feature learning guided by the learned binary codes in the second stage. Then, there appeared numerous one-stage deep supervised hashing methods, including Deep Pairwise Supervised Hashing (DPSH) [17], Deep Supervised Hashing (DSH) [22], and Deep Cauchy Hashing (DCH) [3], which aimed to integrate feature learning and hash code learning into an end-to-end framework. 3 Methodology The framework of our ExchNet is presented in Fig. 2, which contains three key modules, i.e., the representation learning module, local feature alignment module, and hash code learning module. 3.1 Representation Learning The learning of discriminative and meaningful local features is mutually correlated with fine-grained tasks [9,15,20,37,40], since these local features can greatly benefit the distinguishing of sub-categories with subtle visual differences deriving from the discriminative fine-grained parts (e.g., bird heads or tails). In consequence, as shown in Fig. 2, beyond the global feature extractor, we also introduce a local feature extractor in the representation learning module. Specifically, by considering model efficiency, we hereby propose to learn local features with the attention mechanism, rather than other fine-grained techniques with tremendous computation cost, e.g., second-order representations [15,20] or complicated network architectures [9,37,40]. Given an input image xi, a backbone CNN is utilized to extract a holistic deep feature Ei ∈ RH×W×C , which serves as the appetizer for both the local feature extractor and the global feature extractor