194 Q.Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor.Since,in the shallow layers of deep neural networks,low-level context information (e.g.,colors and edges,etc.)are well preserved,which is crucial for distinguish subtle visual differences of fine-grained objects.Then,by feeding E into the attention generation module,M pieces of attention maps AiE RMxHxW are generated and we use AiE RHxW to denote the attentive region of the j-th (j{1,...,M))part cues for xi.After that,the obtained part-level attention map A is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part,which is formulated as: =E⑧A, (1) where ERHxwxc represents the j-th attentive local feature of ri,and denotes the Hadamard product on each channel.For simplification,we use =[E,...,EM)to denote a set of local features and,subsequently,is fed into the later Local Features Refinement (LFR)network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: F=f九R(8), (2) where the output of the network is denoted as Fi={F,...,FM),which represents the final local feature maps w.r.t.high-level semantics.We denote fRC as the local feature vector after applying global average pooling(GAP) onFy∈RH'xw'xC'as: f月=feaP(F). (3) On the other side,as to the global feature extractor,for xi,we directly adopt a Global Features Refinement (GFR)network composed of conventional convolutional operations to embed Ei,which is presented by: Fglobal =fCrR(Ei). (4) We use Fslobal E RIxW'xc'and fslobal eRC to denote the learned global feature and the corresponding holistic feature vector after GAP,respectively. Furthermore,to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts),we impose the spatial diversity and channel diver- sity constraints over the local features in Fi. Specifically,it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40].However,it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps.Instead,in our method,we design and apply constraints on the local features.In concretely,for the local feature Fy,we obtain its“aggregation map”A∈RH'xw'by adding all C feature maps through the channel dimension and apply the softmax function on it for
194 Q. Cui et al. It is worth mentioning that the attention is engaged in the middle of the feature extractor. Since, in the shallow layers of deep neural networks, low-level context information (e.g., colors and edges, etc.) are well preserved, which is crucial for distinguish subtle visual differences of fine-grained objects. Then, by feeding Ei into the attention generation module, M pieces of attention maps Ai ∈ RM×H×W are generated and we use Aj i ∈ RH×W to denote the attentive region of the j-th (j ∈ {1,...,M}) part cues for xi. After that, the obtained part-level attention map Aj i is element-wisely multiplied on Ei to select the attentive local feature corresponding to the j-th part, which is formulated as: Eˆj i = Ei ⊗ Aj i , (1) where Eˆj i ∈ RH×W×C represents the j-th attentive local feature of xi, and “⊗” denotes the Hadamard product on each channel. For simplification, we use Eˆ i = {Eˆ1 i ,..., EˆM i } to denote a set of local features and, subsequently, Eˆ i is fed into the later Local Features Refinement (LFR) network composed of a stack of convolution layers to embed these attentive local features into higher-level semantic meanings: Fi = fLFR(Eˆ i), (2) where the output of the network is denoted as Fi = {F1 i ,...,F M i }, which represents the final local feature maps w.r.t. high-level semantics. We denote fj i ∈ RC as the local feature vector after applying global average pooling (GAP) on Fj i ∈ RH ×W ×C as: fj i = fGAP(Fj i ). (3) On the other side, as to the global feature extractor, for xi, we directly adopt a Global Features Refinement (GFR) network composed of conventional convolutional operations to embed Ei, which is presented by: Fglobal i = fGFR(Ei). (4) We use Fglobal i ∈ RH ×W ×C and f global i ∈ RC to denote the learned global feature and the corresponding holistic feature vector after GAP, respectively. Furthermore, to facilitate the learning of localizing local feature cues (i.e., capturing fine-grained parts), we impose the spatial diversity and channel diversity constraints over the local features in Fi. Specifically, it is a natural choice to increase the diversity of local features by differentiating the distributions of attention maps [40]. However, it might cause a problem that the holistic feature can not be activated in some spatial positions, while the attention map has large activation values on them due to over-applied constraints upon the learned attention maps. Instead, in our method, we design and apply constraints on the local features. In concretely, for the local feature Fj i , we obtain its “aggregation map” Aˆj i ∈ RH ×W by adding all C feature maps through the channel dimension and apply the softmax function on it for
ExchNet:A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 195 converting it into a valid distribution,then flat it into a vector a:.Based on the Hellinger distance,we propose a spatial diversity induced loss as: C即(x)=1- - (5) where (is used to denote the combinatorial number of ways to pick 2 unordered outcomes from M possibilities.The spatial diversity constraint drives the aggregation maps to be activated in spatial positions as diverse as possible. As to the channel diversity constraint,we first convert the local feature vector f into a valid distribution,which can be formulated by p=softmax(f),j∈{1,,M} (6) Subsequently,we propose a constraint loss overp as: c-卜a-网风 (7) where t E [0,1]is a hyper-parameter to adjust the diversity and+denotes max(.,0).Equipping with the channel diversity constraint could benefit the net- work to depress redundancies in features through channel dimensions.Overall, our spatial diversity and channel diversity constraints can work in a collaborative way to obtain discriminative local features. 3.2 Learning to Align by Local Feature Exchanging Upon the representation learning module,the alignment on local features is necessary for confirming that they represent and more importantly correspond to common fine-grained parts across images,which are essential to fine-grained tasks.Hence,we propose an anchor-based local features alignment approach assisted with our feature exchanging operation. Intuitively,local features from the same object part (e.g.,bird heads of a bird species)should be embedded with almost the same semantic meaning.As illustrated by Fig.3,our key idea is that,if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes.Inspired by that, we propose a local feature alignment strategy by leveraging the feature exchang- ing operation,which happens between learned local features and anchored local features.As a foundation for feature exchanging,a set of dynamic anchored local featuresC)for classshould be maintained,in which the j-th anchored local feature cis obtained by averaging all j-th part's local features of training samples from class yi.At the end of each training epoch,anchored local features will be recalculated and updated.Subsequently,as shown in Fig.4
ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Retrieval 195 converting it into a valid distribution, then flat it into a vector aˆj i . Based on the Hellinger distance, we propose a spatial diversity induced loss as: Lsp(xi)=1 − 1 √2 M 2 M l,k=1 aˆl i − aˆk i 2 , (5) where M 2 is used to denote the combinatorial number of ways to pick 2 unordered outcomes from M possibilities. The spatial diversity constraint drives the aggregation maps to be activated in spatial positions as diverse as possible. As to the channel diversity constraint, we first convert the local feature vector fj i into a valid distribution, which can be formulated by pj i = softmax(fj i ), ∀j ∈ {1,...,M}. (6) Subsequently, we propose a constraint loss over {pj i }M j=1 as: Lcp(xi) = ⎡ ⎣t − 1 √2 M 2 M l,k=1 pl i − pk i 2 ⎤ ⎦ + , (7) where t ∈ [0, 1] is a hyper-parameter to adjust the diversity and [·]+ denotes max(·, 0). Equipping with the channel diversity constraint could benefit the network to depress redundancies in features through channel dimensions. Overall, our spatial diversity and channel diversity constraints can work in a collaborative way to obtain discriminative local features. 3.2 Learning to Align by Local Feature Exchanging Upon the representation learning module, the alignment on local features is necessary for confirming that they represent and more importantly correspond to common fine-grained parts across images, which are essential to fine-grained tasks. Hence, we propose an anchor-based local features alignment approach assisted with our feature exchanging operation. Intuitively, local features from the same object part (e.g., bird heads of a bird species) should be embedded with almost the same semantic meaning. As illustrated by Fig. 3, our key idea is that, if local features were well aligned, exchanging the features of identical parts for two input images belonging to the same sub-category should not change the generated hash codes. Inspired by that, we propose a local feature alignment strategy by leveraging the feature exchanging operation, which happens between learned local features and anchored local features. As a foundation for feature exchanging, a set of dynamic anchored local features Cyi = {c1 yi ,..., cM yi } for class yi should be maintained, in which the j-th anchored local feature cj yi is obtained by averaging all j-th part’s local features of training samples from class yi. At the end of each training epoch, anchored local features will be recalculated and updated. Subsequently, as shown in Fig. 4,