An improved deep hashing model for image retrieval with binary code similarities

,


Introduction
The advent of information technology has led to an exponential increase in data accumulation across a multitude of domains [1].For example, Facebook users upload over a billion images each month, and the daily generation of log files approximates 300 TB.Furthermore, the duration of videos uploaded by YouTube users in the past month surpasses the total video content broadcasted by ABC, NBC, and CBS since 1948.This surge in big data presents a formidable challenge to both academia and industry: how to effectively and efficiently harness such a vast amount of data to extract and analyze valuable information or knowledge for various applications [2].This challenge has greatly catalyzed an increasing interest in information retrieval, i.e., the development of novel techniques to explore and analyze big data effectively and efficiently [1,3].These techniques have demonstrated potential advantages in assisting businesses and scholars to extract valuable insights from big data across a wide range of real-world applications, from healthcare [4], privacy protection [5], financial analysis [6], image retrieval, to natural language processing [7], among others.
It is important to note that the mere collection and storage of massive amounts of data are not the ultimate objectives.The full potential benefits of big data should be exploited to address specific real-world problems.However, as the scale of data increases exponentially, the expansion of useful information underlying big data is relatively mild.This implies that the density of information or knowledge deeply embedded in big data is extremely low, leading to a scenario characterized by data bloom but knowledge scarcity [8].
The extraction of interesting or valuable information from large-scale data is a challenging yet fundamental task in big data analysis and information retrieval [8].Given a query, it can be relatively straightforward to precisely identify its proximate or similar objects from a small-scale and low-dimensional data collection using the technique of k nearest neighbors (kNN), which is a classic and popular neighbor search technique in information retrieval.However, as the scale of data increases, locating desirable information precisely using traditional search techniques becomes unfeasible.For instance, the efficiency of kNN dramatically deteriorates even on a mid-scale data collection, albeit its computational complexity is linear.This greatly limits its practical applications.In many scenarios, approximate search methods are preferred over exact search techniques as they offer high efficiency and robustness to large-scale data without significantly compromising retrieval performance.
Hash learning is a representative and popular approximate search technique for big data.It primarily transforms high-dimensional data into binary representations via projection technology [8].In the context of binary representations, the storage cost of big data can be significantly reduced, making it possible to store big data within the main memory.Moreover, the objective of neighbor search can be achieved by computing the distance or similarity between binary codes using bit operations such as XOR and POPCNT, thereby making the search process extremely fast.Owing to these advantages, hash learning has attracted increasing attention in various domains, including image retrieval, information retrieval, and natural language processing [9].Over the past decades, numerous hashing methods have been developed.Generally, they can be categorized into two types: data-oblivious (also known as data-independent) and data-aware (also known as data-dependent) hashing techniques [9].
Data-oblivious hashing techniques project data objects from the Euclidean space into binary representations in the Hamming space in a straightforward manner, while data-aware hashing techniques derive binary representations based on the inherent properties of data objects [9].Notable examples of this kind of techniques include LSH (Locality-Sensitive Hashing) [10] and ITQ (ITerative Quantization) [11], respectively.It is noticeable that both of them construct retrieval models based on handcrafted features, constraining their retrieval performance greatly.
In contrast, deep hashing leverages semantic features of data to construct retrieval models using deep neural networks, such as CNN, AlexNet, ResNet, and BERT [12,13].
Deep neural networks are known to effectively extract high-level and rich semantic features from data in an unsupervised manner [14,15].For instance, Fig. 1 visualizes the feature map (i.e., the output) of the fifth layer of a convolution neural network.It is evident that the extracted features effectively capture sketch information of images.This suggests that they contain rich semantic information, which cannot be fully captured by those handcrafted ones [16].Owing to this fact, deep hashing usually outperforms conventional hashing models and can derive more compact binary codes [17].
Deep hashing has emerged as a new trend, and a variety of deep hashing techniques have been developed [13].Prominent deep hashing methods include CNNH (Convolutional Neural Network Hashing) [18], DPSH (Deep Pairwise Supervised Hashing) [19] and DHN (Deep Hashing Network) [20].For instance, UDQH-IQ (Unsupervised Deep Quadruplet Hashing with Isometric Quantization) [21] uses the quadruplet-based loss as the input of a deep network to explore the underlying semantic similarity of images and employs Hamming-isometric quantization to maximize the consistency of semantic similarity.SPL-UDH (Soft-Pseudo-Label-based Unsupervised Deep Hashing) [22] utilizes an auto-encoder to derive soft pseudo-labels and local similarities of images simultaneously.Based on these, inter-and intra-cluster similarities of images can be further learned by a deep hashing network.Despite the Fig. 1 The visualization of feature map output by the fifth layer of convolution neural network popularity of deep hashing, there are limitations that require further exploration and more endeavors.For example, early deep hashing techniques separate deep feature learning and binary code generation.Moreover, unsupervised deep hashing does not fully consider semantic information.Although supervised ones take label and semantic similarity information into consideration when generating binary codes, they do not consider semantic structures in Hamming space [17].
In this work, we propose a novel end-to-end deep hashing model for image retrieval with binary code similarities, dubbed CSDH, to address the problems above.It extracts deep features to capture semantic structural information of images using a pre-trained deep convolution neural network (CNN).To generate binary codes, a hidden and fully connected layer is attached at the end of the deep network.This hidden layer is used to construct hash functions and is referred to as a hash layer, where the activation status of each unit is served as a hash bit.Unlike deep hashing models that directly use a classification layer to generate binary codes, our hashing model with the hash layer can better preserve the semantic similarities of images because the hash layer can capture high-level semantic information related to labels.Furthermore, to enhance the consistency of similarity preservation, the embedding distances of images in Hamming space are incorporated into the loss function.This embedding property ensures that the generated binary codes are of higher quality and possess greater capability.
To sum up, the main contributions of this work are briefly highlighted as follows: • The proposed model, CSDH, is an end-to-end deep hashing model for image retrieval that uses a pre-trained deep convolution neural network (CNN) to extract deep features and capture semantic structural information of images.• A hidden and fully connected layer, referred to as a hash layer, is attached at the end of the deep network to generate binary codes.This approach better preserves the semantic similarities of images compared to models that directly use a classification layer; that is, the images with more similar codes have a higher probability of becoming the same category.• The model incorporates the embedding distances of images in Hamming space into the loss function to enhance the consistency of similarity preservation, resulting in higher quality binary codes with greater capability.
The remainder of the paper is organized as follows.Section briefly discusses related studies about hash learning.Section presents the proposed hashing method for image retrieval in detail, followed by the experimental results and discussions in Sect. .Finally, the conclusion and future studies are given in Sect. .

Related work
Hash learning, with its many potential advantages including high efficiency and low storage cost, has quickly become a leading technique in image retrieval and big data analysis.A multitude of hash learning algorithms have been witnessed to date.These hash learning techniques can be grouped into different categories, such as supervised and unsupervised hashing, data-oblivious and data-aware (also known as data-independent and data-dependent, respectively) hashing, based on their properties or perspectives [13,17].Depending on which type of features has been adopted, hashing models can also be classified into shallow hashing and deep hashing in a board way.Shallow hashing techniques rely solely on handcrafted features for the training of hashing models and the generation of binary codes.One of the most notable examples of this is Locality-Sensitive Hashing (LSH), which encodes data into a binary representation via random projections [10].LSH is extremely efficient, but the retrieved neighbors are random to some extent and may not exact neighbors.To this end, Iterative Quantization (ITQ) generates binary codes predicated on the principal components of data [11], while Spectral Hashing (SH) seeks the eigenvectors of graph Laplacian to achieve data projection [23].Unfortunately, the similarity structure of data may not be preserved in ITQ, and the optimization problem of SH is NP-hard one.Kernelized Supervised Hashing (KSH) employs kernel functions to handle non-linear data when designing hash function design [24].Although it requires less supervised information, its performance heavily relies on kernel functions, incurring some cumbersome model training.Fast Supervised Hashing (FastH) utilizes boosting trees, a simple yet effective regression of the class labels, to tackle the highdimensional problem [25], and Robust Supervised Discrete Hashing (RSDH) employs the Cauchy loss to measure the error of label matrix decomposition, thereby enhancing model robustness [26].However, large quantization errors and suboptimal solutions may be inevitably induced when relaxing the discrete constraint on hash codes.Moreover, as previously discussed, the shallow hashing algorithms concern the data with handcrafted features, which are designed for specific tasks during the process of collecting data, without involving strongly semantic information.Thus, the performance of constructed hashing models is limited, albeit some of them take label information into account.
In contrast, deep hashing leverages deep features extracted by deep neural networks, such as CNN, VGG, AlexNet, ResNet, and BERT, to construct hashing models and generate binary codes [12,13].These deep features encapsulate rich semantic and structural information, exhibiting strong discriminative capabilities.As a result, deep hashing has been extensively studied and widely used in image retrieval.Representative examples of deep hashing include CNNH [18], DPSH [19], DHN [20] and DQN (Deep Quantization Network) [27].Despite that these deep hashing models have competitive performance, they still have some limitations required to be addressed.For example, CNNH can not handle those images with different scales and positions.DPSH requires a large amount of labeled data to train hashing models, while the optimal policy of DQN is non-deterministic.
Recently, HashSIM [28] guides the generation of binary codes by using semantically structural similarities derived from highly confident images.Besides, the independent property of hash bits has also been considered.However, it likely requires a great number of highly confident images, which are difficult to obtain in reality.Cui et al. [29] first extracted binarized representation embeddings of data via metric learning, then constructed a hashing model using a group similarity preservation strategy.A limitation of this hashing model is that when the length of binary codes is extremely short, the discriminability of deep representations is relatively poor.DUDH (Deep Uncoupled Discrete Hashing) [30] adopts a similarity-transfer matrix to bridge the gap between query and image similarities, thereby reducing quantization error and preserving image semantic similarities.However, only the process of preserving similarity takes the quantization error into account, resulting in the limited improvement of retrieval performance.It should be pointed out that the unsupervised hashing models above, despite enhancing retrieval performance to some extent, have not considered label information.
Supervised deep hashing methodologies, which leverage both deep features and expert-provided labels as semantic information to help the generation of binary codes, have been demonstrated to outperform their unsupervised counterparts in retrieval performance, thereby garnering significant interest in the field of image retrieval.Noteworthy supervised hashing algorithms include DSHTL (Deep Supervised Hashing with Triplet Labels) [31] and DSDH (Deep Discrete Supervised Hashing) [32], where DSHTL maximizes the likelihood of triplet similarities of labels to learn deep features and binary codes simultaneously.However, obtaining the triplet similarities of labels is non-trivial.For DSDH, it harnesses pairwise supervised information to directly extract deep features and promote the generation of discrete codes.
DDH-LDL (Deep Discrete Hashing for Label Distribution Learning) [33] incorporates label distribution learning to model implicit semantic relationships, which were subsequently preserved through message aggregation operations on a graph convolutional network.As we know, the feature distribution may be inevitably distorted by the aggregation operations, leading to retrieval performance decline.DAHP (Deep Attention-Guided Hashing With Pairwise Labels) [34] utilized anchors as supervised information to extract the contextual information of features, thereby enhancing the representational capacity of the hashing model constructed on the ResNet with position and channel attention mechanisms.Although the attention-guided hash codes can instruct the training of hashing network, they may contain repetitive and highly correlated information.Hu et al. [35] employed the cosine similarities of images to preserve semantic distributions and utilized cosine-distance entropy to mitigate quantization errors for imbalanced data.However, local features had not been considered when learning the contextual information.Much more recently, SPL-UDH (Soft-Pseudo-Label-based Unsupervised Deep Hashing) [36] obtains binary representations by performing Bayesian theory on local similarities and soft pseudo-labels, which are derived by a deep auto-encoder network.PLDH (Pseudo-Labels Deep Hashing) [37] also exploits pseudo-labels extracted by a deep neural network to guide the generation of hash codes.It is worth noting that pseudo-labels are not really ones and may contains inconsistent semantic information.Moreover, existing supervised algorithms primarily focus on the semantic information in Euclidean space, neglecting those in Hamming space.

Problem statement
Assume that x = {(x i , y i )} n i=1 is an image (or data) collection comprising of n images (or data objects), where x i ∈ R d (i=1..n) is the i-th image, represented as a vector of d dimensions.The vector y i ∈ {0, 1} l refers to the label information corresponding to the i- th image x i , where l is the number of labels.X is called a single-label or normal collection if there is only one label marked to x i (i=1..n); that is, for each label vector y i of x i , we have l j=1 y ij = 1 .Otherwise, X is a multi-label collection for supervised learning.Let h(X ) be a hash function of X .Mathematically, it is defined as follows: where b is a binary value.From the definition, we know that the hash function h(X ) encodes the image X into a binary value, i.e., 0 or 1, which is also called hash bit in the literature.If we have m hash functions h i ( i = 1..m ) and perform them on X , we can receive m binary values (i.e., hash bits) b i ( i = 1..m ).In this case, the image X can be transformed to a vector of binary representations Hash learning aims to construct a variety of hash functions H = {h 1 , h 2 , .., h m } , so that each image X can be represented as a binary vector , where b i is the binary representation of x i , after the hash functions H are performed on X .Generally, the number of binary values is far less than the quantities of image dimensions, i.e., m ≪ d .From this perspective, hash learning can effectively benefit big data analysis in the aspects of storage cost and computational efficiency.For the sake of discussion, hereafter the binary values are represented as −1 and 1, rather than 0 and 1; that is, b ∈ {−1, 1} n×m .
Generally, the objective function of hash learning is formally represented as follows.
where b = h(X) and ℓ h (X, b) is the quantization error of X after the hash function h exerted.This definition implies that the error should be minimized when learning the hash functions H ; that is, the information loss should be less as much as possible.

Model architecture
It is still a formidable challenge to learn and construct effective hashing functions.A multitude of techniques for constructing hash function have been proposed, including Locality-Sensitive Hashing (LSH), which employs projection techniques to randomly generate H , and Iterative Quantization (ITQ), which utilizes the principal components of X as H .However, these shallow hashing methods do not take into account the inher- ent properties of the data, resulting in the relatively poor quality of binary codes generated by them.Furthermore, these methods rely solely on handcrafted features, which contain limited semantic information, making their retrieval performance less competitive.In contrast, deep hashing techniques extract deep features from data, encapsulating rich semantic and structural information for hash function generation.Consequently, deep hashing methods typically outperform their shallow counterparts.
In this work, we introduce a novel end-to-end deep hashing method, termed Code Similarity-based Deep Hashing (CSDH), for image retrieval.CSDH employs a tailored deep convolutional neural network to represent images and extract deep features for binary code generation.The specific framework of CSDH, illustrated in Fig. 2, serves two (1) primary objectives: data representation and similarity preservation.The former extracts high-level semantic features from images, while the latter ensures that the semantic similarities of images can be preserved when generating corresponding binary codes.In other words, the embedded similarities of images should be preserved and consistent when transitioning from Euclidean space to Hamming space.
Specifically, we utilize a tailored AlexNet architecture, a renowned eight-layer Convolutional Neural Network (CNN), as the backbone of CSDH to extract high-level semantic features from images.Due to its superior performance, the popularity of AlexNet is quickly increasing since it has been introduced.Indeed, the features captured by the AlexNet network contain significantly more semantic information than manuallydesigned features.Traditionally, the architecture of AlexNet comprises five convolutional layers and two fully connected layers, along with one prediction layer.In this work, we use the AlexNet network to represent images and extract their deep features for discrimination.
To achieve the purpose of hashing, we further customize AlexNet by supplying a hidden hash layer between the second fully connected layer and the prediction layer.This hash layer takes the output of the second fully connected layer as input and outputs binary values by transforming continuous values into binary ones via a given quantization function.The hash layer contains k units, where k refers to the length of the desired binary codes.For each unit within the hash layer, it is assumed to be associated with a latent attribute, which will be used to determine the ultimate category of images.If a unit is activated, its output is 1; otherwise, its status is −1 .Consequently, each image can be represented as a binary code with k bits based on the activated status of the k units.For two similar images, we expect their binary codes to be similar or proximate after the binary representation or activation operations are performed; conversely, if they are dissimilar, their corresponding codes should also be dissimilar or far from each other in Hamming space.With this kind of similarity preservation, the probability of belonging to the same category of two images becomes higher if their binary codes are similar.
It is noticeable that information will be lost inevitably during the quantization process.To mitigate quantization errors and preserve similarities, selecting an appropriate activation function for the hash layer plays a crucial role.By now, a plethora of activation functions, such as Tanh, Sigmoid, ReLU, Softmax, Swish, among others, have been proposed and are ready-made in the literature [38].Considering gradient vanishing and smooth properties, we take the softsign function, whose definition is given as follows [38], as the activation function for the hash layer.
where x is a continuous value.

Loss function
The final layer of AlexNet is utilized to predict the category to which an image belongs via a loss function.From this perspective, loss functions play a pivotal role in deep learning as they directly influence the semantic information of deep features and the prediction performance of deep neural networks.To generate high-quality of binary codes, here we adopt a cross-entropy loss, in conjunction with a Hamming-embedding loss, for the prediction layer of the customized AlexNet network.
The cross-entropy loss quantifies the divergence degree between two probability distributions.Given two distinct probability distributions of a feature (a.k.a.variable) X, denoted as p(x) and q(x), their cross-entropy H(p, q) is formally defined as: From this definition, it can be inferred that if two distributions are similar or proximate, their cross-entropy is small.Based on this rationale, we incorporate this concept into our loss function to predict the category labels of images.Specifically, let X be a nor- mal image collection, where each image is tagged with a single label.For the i-th image x i ∈ x , the output of prediction layer is represented as: where W ∈ R l×k is the weight matrix of the prediction layer, b i is the output of hash layer for x i .v ∈ R l is the bias of prediction.Thus, the cross-entropy loss of the neural network can be summarized as follows.
In a similar vein, if X is a multi-label image collection, where each image can be associ- ated to multiple labels, we can treat it as l independent binary classifiers.In this case, the cross-entropy loss can be represented as As the cross-entropy loss only concerns the prediction performance of deep features, it alone cannot guarantee the quality of binary codes generated by the hashing model.As discussed earlier, the property of similarity-preservation is also very crucial for the (3) (5 ). ( ).
generation of binary codes.To achieve this purpose, we also take Hamming embedding into consideration when constructing the hashing model.Assume b i and b j are two binary codes derived from the hash layer for x i and x j images, respectively.The Hamming distance between them is where k is the length of binary codes.According to the definition, the Hamming distance is an inverse proportion of cosine value.This property can be exploited to measure the similarity of binary codes.In Hamming space, two binary codes, b i and b j , are con- sidered to be semantically similar, if dist H (b i , b j ) is small enough; Otherwise, they are semantically dissimilar to each other.Let s ij be a semantically label between b i and b j , where s ij = 1 when b i is semantically similar to b j ; Otherwise, s ij = 0 .Under this con- text, this kind of Hamming embedding can also be used to estimate the quality of binary codes.
Suppose that S ∈ {0, 1} n×n is the semantically pairwise similarity of binary codes, each entry s ij ∈ S denotes the semantic label between b i and b j .The Hamming embedding loss refers to the total summary of pairwise Hamming distances, i.e., where |S| denotes the total number of semantic similarity labels.H(b i , b j ) is the Ham- ming embedded distance shown as follows.
According to Eq. ( 3), we know that the output of the hash layer is a vector of continuous values.Thus, the Hamming embedded distance above can be represented as where h i is the output vector of hash layer for the i-th image x i .
Based on the Hamming embedded loss, the customized AlexNet network can be iteratively updated by the technique of gradient descent.For dist H (h i , h j ) , its gradient can be easily estimated.When s ij = 1 , its gradient is when h i is not semantically similar to h j , i.e., s ij = 0 , the gradient of

Algorithm details
In summary, the objective function of our deep hashing model is to minimize the following loss function where denotes hyper-parameters of the deep neural network.γ is a trade-off factor to make a balance between the cross-entropy loss and the Hamming embedded loss.
Based on the statement above, the implementation details of deep hashing model with binary code similarities is given as follows.

Experimental results and discussion
To evaluate the competitiveness of CSDH, we conducted a series of comparative experiments with five classical hashing algorithms and six popular deep hashing algorithms on two public image collections.This section elucidates the experimental results and discussions.

Experimental settings
Two frequently-used benchmark image datasets, CIFAR-10 and NUS-WIDE, were employed to evaluate the performance of CSDH against the baseline algorithms.The CIFAR-10 dataset comprises 60,000 colorful images, each of size 32 × 32 pixels, dis- tributed evenly across ten classes.These classes involve various animals (e.g., bird, dog, cat, deer, bird, horse and frog) and public transport vehicles (like truck, car, airplane and ship).Each class contains 6000 images.As CIFAR-10 is a single-label dataset, each image is associated with only a single label.For our experiments, we (13 randomly selected 100 images per class as query images and 500 images per class for training.Consequently, the training dataset comprised 5000 colorful images while the query dataset contained 1000 images.
NUS-WIDE is a multi-label image dataset where each image is simultaneously associated with multiple labels.It consists of 269,648 color images tagged with eightyone class labels, such as dog, bird, car, and so on, totally.These images were collected from Flickr.Following conventional practices in the literature, we selected 195,834 images tagged with the top-21 labels from the eighty-one ground-truth labels for our experiments.Each class contained at least 5000 images.We adopted the similar strategy to generate query and training data as with CIFAR-10: 100 images were randomly selected as queries and 500 images were used as training data for each class label.As a result, the query dataset contained 2100 images while the database included 193,734 images of which 10,500 were designated as training data.
Following traditional strategies in comparative experiments for hash learning, we considered image similarities as the ground truth in the following manner: Two single-label images were deemed to be similar if they were tagged with the same label; otherwise they were considered dissimilar.For multi-label images, they were considered similar if they shared at least one class label; otherwise they were dissimilar to each other.
Three frequently-used evaluation protocols were adopted to testify retrieval performance of the hashing models in the experiments.They were mean average precision (mAP), precision and precision-recall curve.Let Q = {q i } t i=1 be a query collection.For each query q ∈ Q , its retrieval precision refers to the ratio of the quantity of similar images to the total number of retrieval results, i.e., where R is the total number of retrieval results, and ℓ r denotes the label of the r-th retrieval result.δ(•) is an indication function.δ(ℓ r =ℓ q ) =1 if the r-th retrieval result has the same label to the query q.On the contrary, δ(ℓ r =ℓ q )=0, if there is no same label between them.In our experiments, we retrieved 5,000 images for each query; that is, (15) Pre q (R) = R r=1 δ(ℓ r = ℓ q ) R , R = 5000 .Based on the above formula, the mean average precision of the query set Q is formally represented as The Precision-Recall curve delineates the interplay between precision and recall, two pivotal metrics in information retrieval.Precision quantifies the relevance of retrieved results, while recall measures the proportion of truly relevant results that are successfully retrieved.An optimal retrieval model is characterized by high precision, ensuring the accuracy of results, and high recall, guaranteeing the retrieval of a substantial fraction of positive results.The Precision-Recall curve, therefore, serves as a critical tool for evaluating the performance of retrieval models.
We implemented the CSDH model by using PyTorch, an open-source machine learning framework.For the hyperparameters of the deep neural network, they were meticulously fine-tuned in a back-propagation way.Specifically, we utilized the mini-batch stochastic gradient descent as the optimizer for CSDH.Throughout the experimental process, we set the size of batch, weight decay, and momentum of the optimizer to 32, 0.0005, and 0.9, respectively.Meanwhile, the learning rate was initially assigned to 0.001 and subsequently decayed a time after 40 training epochs.

Results and discussion
As we know, the mean Average Precision (mAP) is one of the most widely-used metric for evaluating retrieval performance of hashing models.In line with this convention, here we also adopted the mAP metric to make a comparison of the retrieval performance between CSDH and the baselines.Figures 3 and 4 present the comparison scores of mAP for 5000 query results returned by the shallow baseline models and the deep baseline models with varying quantities of hash bits, respectively.
From the experimental results presented in Figs. 3, 4, we can easily conclude that the proposed hashing method exhibits competitive performance compared to the baseline models.For example, when compared to FastH, CSDH boosted the retrieval performance of mAP on the CIFAR-10 and NUS-WIDE collections by 46.77% and 17.92%, respectively.In a similar vein, the mAP scores of CSDH were higher than (16) . Fig. 3 The mAP comparison of CSDH to the shallow hashing algorithms with different quantities of hash bits those of DSDH, a state-of-the-art deep hashing model, by 3.27% and 2.67% on these two image collections, respectively.Another interesting fact is that the mAP scores of supervised hashing algorithms, e.g., RSDH, KSH and FastH, were significantly higher than those of unsupervised hashing ones, e.g., SH and ITQ.This can be attributed to the fact that the class labels embody a kind of semantic information that aids hashing models in generating informative hash bits.Note that the deep hashing models were generally superior to the shallow ones, particularly on the single-label data.It sounds reasonable because the deep techniques leveraged high-level semantic features to train hashing models.Among the deep models, DSDH, DAHP, DSHTL and DPSH achieved comparable performance to CNNH, DHN, PLDH and DQN.This can be attributed to their use of data or label similarities to guide binary code generation.Since our hashing model, CSDH, exploited the cross-entropy information, as well as semantic similarities, within the loss function, it achieved better retrieval performance than other models, especially when fewer hash bits (e.g., 12 or 24 bits) were used.Indeed, the cross-entropy loss has been shown to effectively capture rich semantic information in the literature.
As stated above, convolution neural network can extract deep features to capture semantic information, thereby enhancing model performance.To validate this assertion, we carried out additional experiments using the shallow hashing algorithms with deep features on the image collections.Specifically, we extracted deep features with 4096 dimensions from images using VGG-F, i.e., the output of the last layer of VGG-F.Then the shallow hashing algorithms were performed with these deep features to generate hash bits.The experimental results are provided in Fig. 5.
According to the mAP scores in Fig. 5, one can observe that the deep features could significantly strength the retrieval performance of the shallow hashing models.Broadly speaking, the shallow hashing algorithms with deep features significantly outperformed those without deep features.For example, the mAP score of FastH increased from 0.305 to 0.553 on CIFAR-10 when the number of hash bits was 12. Particularly on multi-label collections like NUS-WIDE, the performance of the shallow hashing algorithms was comparable to that of the deep hashing ones (see Fig. 4).

Ablation analysis
The aforementioned discussions show that the image similarity can bring benefits to the performance of hashing models, as evidenced by DSDH, DAHP, DSHTL and DPSH.Unlike those deep hashing models only with label similarities, our model extends beyond by integrating Hamming embedding distances into the loss function.
To testify the contribution of Hamming embedding distances to retrieval performance, we conducted additional experiments on image collections using our proposed model with and without Hamming embedding distances, denoted as CSDH and CSDH-, respectively.Figure 6 illustrates the mAP and precision of the proposed hashing model with different quantities of hash bits, where CSDH-denotes CSDH without the Hamming embedding distances.From the experimental results in Fig. 6, we can observe a fact that the Hamming embedding distances can significantly strength retrieval performance from the perspectives of both mAP and precision.This is reasonable and consistent with the intuitive understanding that the Hamming embedding distances can preserve the consistency of semantic similarity of data to some extent.For the multi-label collection, i.e., NUS-WIDE, the performance improvement was particularly pronounced due to the rich semantic information contained in multi-label images, which makes the model more effective.
Figure 7 presents the precision-recall curves of CSDH and its counterpart without the Hamming embedding loss (i.e., CSDH-) on the NUS-WIDE collection when 12 and 48 hash bits were used, respectively.The precision-recall curves further confirmed the effectiveness of the Hamming embedding loss, which could preserve the semantic similarities of data during the process of hash projection.This conclusion holds true for other quantities of hash bits as well; however, due to space constraints, we have not provided these results individually.

Conclusions
In this work, we proposed a novel end-to-end deep hashing model, CSDH, for image retrieval that leverages binary code similarities.The hashing model first employs a pre-trained deep convolutional neural network to extract deep features, capturing the semantic structural information of images.A hidden and fully connected layer is attached to the end of the deep network.This hidden layer referred to as the hash layer, transforms the continuous values outputted by the last layer into binary ones via an activation function.To hold the consistency in similarity preservation, the Hamming embedding distances are also introduced into the loss function.The superiority of CSDH was validated through extensive experiments on two public image collections.The experimental results verify that CSDH exhibits competitive performance compared to popular deep hashing models.
Note that the proposed hashing model, taking AlexNet as its foundational architecture, may inevitably encounter the well-documented issue of gradient vanishing.Besides, Fig. 7 The precision-recall curves of our model with (or without) the Hamming embedding loss, denoted as CSDH and CSDH-, respectively, on NUS-WIDE the depth of AlexNet is relatively shallow, potentially limiting the complexity of the deep features it can extract.Thus, our future work will explore the integration of more modern neural network architectures, such as GoogLeNet, ResNet, and DenseNet, into our hashing model.These models offer increased depth and innovative structures, suggesting that they may enhance the feature extraction capabilities of our model, as well as mitigating the gradient vanishing problem.By incorporating these advanced architectures, we aim to improve the robustness and performance of our hashing model.

Fig. 2
Fig. 2 The deep hashing model framework of CSDH

Fig. 4
Fig.4 The mAP comparison of CSDH to the deep hashing algorithms with different quantities of hash bits

Fig. 5 Fig. 6
Fig. 5 The mAP comparison of CSDH to the shallow hashing methods with deep features by VGG-F