An Approach for the Cross Modality Content-Based Image Retrieval between Different Image Modalities
An Approach for the Cross Modality Content-Based Image Retrieval between Different Image Modalities
Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography. 2013. Nov, 31(6_2): 585-592
Copyright © 2013, Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : November 25, 2013
  • Accepted : December 30, 2013
  • Published : November 28, 2013
Export by style
Cited by
About the Authors
Inseong Jeong
Data Solutions & Technology Incorporated, USA (
Gihong Kim
Corresponding Author, Regular Member, Dept. of Civil Engineering, Gangneung-Wonju National Univ., Korea (

CBIR is an effective tool to search and extract image contents in a large remote sensing image database queried by an operator or end user. However, as imaging principles are different by sensors, their visual representation thus varies among image modality type. Considering images of various modalities archived in the database, image modality difference has to be tackled for the successful CBIR implementation. However, this topic has been seldom dealt with and thus still poses a practical challenge. This study suggests a cross modality CBIR (termed as the CM-CBIR) method that transforms given query feature vector by a supervised procedure in order to link between modalities. This procedure leverages the skill of analyst in training steps after which the transformed query vector is created for the use of searching in target images with different modalities. Current initial results show the potential of the proposed CM-CBIR method by delivering the image content of interest from different modality images. Despite its retrieval capability is outperformed by that of same modality CBIR (abbreviated as the SM-CBIR), the lack of retrieval performance can be compensated by employing the user’s relevancy feedback, a conventional technique for retrieval enhancement.
1. Introduction
With the rapidly increasing input volume of remote sensing imagery along with its increasing dimensionality and modality, retrieving image content of interest from an image database has been an important task for database administrators or users in many geospatial communities. For this very purpose, the CBIR (content-based image retrieval) has provided an effective way to search image contents queried by an operator or user. In general, the CBIR first extracts principal image features and characteristics from a query image, and then seeks to retrieve other image occurrences having the similar features in image database (Jeong, 2012). In some CBIR approaches (Newsam et al ., 2004; Streilein et al ., 2000; Li and Narayanan, 2004), its implementation involves 1) quantifying image contents in a small image tile by a feature vector whose elements are designed to characterize low level primitives of image contents (e.g. texture), 2) then constructing feature vector database from entire images in the database, which transforms the CBIR application into vector space problem, and 3) finally, based on a feature vector of query image (i.e. query feature vector or query vector), searching relevant feature vectors that maximizes a similarity measure. In such CBIR systems, if two feature vectors created from same modality images are close to each other in terms of similarity measure, they are expected to represent similar image features.
However, this premise would not be valid for different image modality sources since each sensor’s imaging characteristic (e.g. imaging mechanism, sensor response to incoming signal, wavelength) causes differences in visual representation and, subsequently among feature vectors. So there is a visual distinction among image modality type (e.g. panchromatic, multispectral, hyperspectral, SAR and Lidar). Such modality difference is an obstacle for the successful CBIR implementation. For example, image pair in Fig. 1 shows a blue-band Quickbird image and a SAR image, both of which contain similar residential area features. One could imagine selecting a part of the Quickbird image as a “template” and searching for similar occurrences in the SAR image. However, due to the disparity in modality between the two images, a feature vector of the Quickbird scene is not generally close or similar in its form to that of the SAR image. So if a traditional similarity metric (e.g. Euclidean distance) is used in comparing the two feature vectors from each images, it will be like a comparison between apple and orange. Therefore feature vectors and similarity or proximity metric used for the same image type are unable to work for heterogeneous imagery.
PPT Slide
Lager Image
Modality difference between two images containing similar features: Quickbird blue band (left) and SAR image (right)
Considering various modalities of images archived in the image database, handling image modality difference is significant for the practical application of the CBIR. In the paper, this type of CBIR task is termed as the Cross Modality CBIR or CM-CBIR (cross modality contentbased image retrieval). In other words, for an image content of interest in one image, one attempts to search for similar image contents in the other images of different modality. However, many of CBIR related researches have studied on the cross modal search between text and image (Zhang et al ., 2001; Rasiwasia et al. , 2010; Jia et al ., 2011), for example, image search by using text keywords. To the best of authors’ knowledge, no known literatures have been published that explicitly deal with the cross modal search between different image modalities, which is the very purpose of this paper.
One of the key factors to make sure the successful CMCBIR implementation is how to link between different modalities. For example, a functional relationship or mapping between image intensity values can be set up via various ways. But it is assumed in this study that the image content abstraction is sufficiently achieved by the texturebased feature vector. Consequently, no effort is needed to establish a direct relationship between intensity values of different modality images but just between feature vectors.
So the chosen strategy is to set up a feature vector level relationship: how to transform a feature vector representing an image content to another feature vector of different modality representing the similar image content in order to fulfill cross modality retrieval. Therefore, this study aims at developing a CM-CBIR approach that exploits a relationship between feature vectors of dissimilar modality for the transformation of query vector. As an initial investigation, our approach is solely applied to a CBIR system that adopts texture based feature vector to stand for image content, and results from sample CM-CBIR tests are presented.
2. Methods and approach
One challenge in the CM-CBIR is that conventional similarity metrics adopted for the SM-CBIR fail to measure the actual image content similarity. There might be two ways to handle this problem: 1) inventing a new similarity metric between two cross modality feature vectors, and 2) transforming a feature vector of one image to capture image contents of the other image. In this research, latter approach is adopted. A straightforward but effective approach is suggested which leverages the skill of analysts in a training step. A suggested strategy for the feature vector transformation is demonstrated in the following example for the Quickbird blue band and SAR test image shown in Fig. 2.
PPT Slide
Lager Image
Quickbird blue band (left) and SAR test image (right) for the CM-CBIR experiments
The Quickbird test image has approximate ground resolution of 2.5m and the SAR image 1.5m. A middle part of the test images is tiled into 1089 square tiles (i.e. 33 tiles in row by 33tiles in column) for the Quickbird image and 600 (i.e. 25 by 24) for the SAR image. The size of image tile is 200 by 200 pixels for both images. One feature vector is created for each tile by applying the Gabor filter bank in order to capture texture contents. Gabor filter setting includes 6 frequencies (i.e. 1/4, 1/2, 1, 2, 4 and 8) and 9 orientations (i.e. 0˚ 10˚, 20˚, 30˚, 40˚, 50˚, 60˚, 70˚ and 80˚). So Gabor filtering yields 54 output images (i.e. 6 frequencies times 9 orientations). For each output image, mean and standard deviation of its histogram are computed so a feature vector consists of 108 elements.
In Fig. 3 , one image patch is selected from the SAR test image, and two image patches from the Quickbird image, referred to Training patch SAR #1, Training patch QB #1 and QB #2, respectively. All three patches are chosen by the analyst to contain the same feature of ‘residential subdivision’. Observing the right side plots of the feature vector elements in Fig. 3 , a distinct feature vector pattern difference is found that between FVsar1 and FVqb1, and between FVsar1 and FVqb2 owing to their modality difference. Also, judging from the pattern of feature vector plots, an explicit and general mathematical relationship between FVsar1 and, FVqb1 or FVqb2 seems hardly available. However, between FVqb1 and FVqb2, two vectors are a lot more similar than compared with FVsar1 as expected. This indicates that a degree of image content representation via the Quickbird feature vector is satisfactory for the feature class of ‘residential subdivision’.
PPT Slide
Lager Image
Image patches are selected for training on the feature type of “residential subdivision”. Their feature vectors, a composite feature vector and a modality difference vector are displayed.
Now questions arise: can a ‘feature class’ in one modality be represented by one primary feature vector? Can a meaningful modality difference or relation between modalities be derived from feature vectors of different modality? If so, can the difference be applied to transforming the feature vectors between modalities? As an approach to answer the questions, a composite feature vector, FVqbC, that is intended to stand for the residential subdivision feature class of the Quickbird image is generated from the FVqb1 and FVqb2. For the composite feature vector creation, averaging or weighted mean of given feature vectors or other algorithms can be applied. Here, for demonstration purpose, average of FVqb1 and FVqb2 is chosen to create the FVqbC. Then, by utilizing the composite feature vector FVqbC, a difference vector FVdiff is obtained by subtracting FVsar1 from FVqbC as depicted in Fig. 3 . This vector is called as a modality difference vector which is designed to fill the missing link between two modalities. Note that the derivation of a modality difference vector is defined as composite feature vector of training patches from search or target images minus another composite feature vector of training patches from a query image. Also, a modality difference vector differs by image content so it should be derived for each class.
When transforming a query feature vector of one image to that of different modality image, simply adding the modality difference vector fulfills conversion. Subsequently, through the transformed query feature vector, the CM-CBIR application is made possible as if the SM-CBIR is done. There is no need to devise a novel similarity metric exclusively for the CM-CBIR purpose since the SM-CBIR framework is retained. Compared with remote sensing classification techniques, the suggested procedure to produce a composite feature vector and modality difference vector resembles to characterizing a thematic class using statistics of training samples and performing the supervised classification. Though the proposed CM-CBIR strategy needs the analyst’s intervention, idea of employing training patches have some advantages: 1) it is simple but effective since no sophisticated efforts is required to solve for a complicated functional relation between modalities, and 2) relevancy of the CMCBIR results can be enhanced to some degree if more training data is provided. Details of experiment results will be presented in the next section.
3. Experimental results and discussion
In order to evaluate the applicability of the modality difference vector and transformed query vector for CMCBIR, an experiment is carried out to retrieve ‘residential subdivision’ feature class in the SAR test image (i.e. target image) given the Quickbird query image patch as shown Fig. 4 . In searching among the SAR image tiles, the qFV (QB) (i.e. query vector computed from the Quickbird query patch) is transformed to the qFV (SAR) (i.e. query vector converted for the SAR image) using the FVdiff (i.e. modality difference vector), expressed as qFV (SAR) = qFV (QB) + FVdiff. Note that the FVdiff was derived using training patches as explained in the previous section.
PPT Slide
Lager Image
Given query patch from the Quickbird test image and feature vector transformation for searching in the SAR test image
By using the qFV (SAR) (i.e. transformed query vector), a CM-CBIR result is obtained. In Fig. 5 , up to ten retrieved image tiles are displayed and ranked by the Euclidean distance (i.e. L2 norm), a chosen similarity metric for all experiments. Note that Euclidean distance is now usable for a similarity metric since the query feature vector is transformed in our approach. Euclidean distance is obtained between a transformed query vector and a feature vector of each of retrieved image tiles. Once rank is determined by the similarity criterion, relevancy of the CM-CBIR results is evaluated by the operator’s decision. In other words, an image tile is judged relevant if it include the queried image content fully or partially. Table 1 summarizes the relevant retrieval cases for all experiments in the study. As shown and listed in Fig. 5 and Table 1 , seven relevant tiles are searched for the first experiment. Except the rank 8th image tile partially containing the subdivision feature, the other tiles show a strong relevancy (i.e. when an image tile is almost full of feature of interest).
Relevant retrieval cases
PPT Slide
Lager Image
Relevant retrieval cases
PPT Slide
Lager Image
CM-CBIR result for ‘residential subdivision’: retrieved SAR image tiles by using the Quickbird query patch (relevant tiles are marked with a dotted line)
Next experiment is performed, this time, in the opposite direction, i.e. searching the Quickbird test image (i.e. target image) given the SAR query image patch. For this purpose, a transformed query vector, qFV (QB), is derived as seen in Fig. 6 . Note that the FVdiff’s in Figs. 4 and 6 have the same magnitude but different signs because they are derived from the same training patches, but target and query image are reversed. In the second experiment, three relevant tiles are obtained as marked in Fig. 7 .
PPT Slide
Lager Image
Given query patch from the SAR test image and feature vector transformation for searching in the Quickbird test image
PPT Slide
Lager Image
CM-CBIR result for ‘residential subdivision’: retrieved Quickbird image tiles by using the SAR query patch (relevant tiles are marked with a dotted line)
As one way to evaluate the retrieval performance of the two CM-CBIR results, they are compared with the SMCBIR results. Thus, the same query image used for the CM-CBIR is now used for searching in the same modality image. The Quickbird query patch in Fig. 4 is directly used for retrieving in the Quickbird test image, and the same is applied to the SAR query patch in Fig. 5 . These two SMCBIR results are displayed and listed in Figs. 8 and 9 , and Table 1 .
PPT Slide
Lager Image
SM-CBIR result for ‘residential subdivision’ retrieved in the Quickbird test image (relevant tiles are marked with a dotted line)
PPT Slide
Lager Image
SM-CBIR result for ‘residential subdivision’ retrieved in the SAR test image (relevant tiles are marked with a dotted line)
It is observed in the Table 1 that the SM-CBIR outperforms the CM-CBIR in terms of the number of relevant retrievals, a quantitywise criterion. By the SM-CBIR, 8 relevant image tiles are retrieved in the SAR test scene, and 6 tiles in the Quickbird test scene. However, by the CM-CBIR, 7 and 3 relevant tiles are obtained from the SAR and Quickbird test scene, respectively. Overall, lower retrieval achievement by the CM-CBIR is probably because its query vector, indirectly obtained via the proposed transformation procedure, would be degraded if a certain irrelevant residual or offset smear into it during the procedure, which can be liken to low signal to noise ratio. However, because a query vector of the SM-CBIR is directly extracted from a query image patch, it may be less noisy and yield better retrieval outcome than the CM-CBIR.
Retrieval capability of CM-CBIR can be enhanced by employing user’s relevancy feedback, a conventional way of refining retrieval result in many CBIR systems (Brocker et al ., 2001; Zhang et al ., 2001; Gelasca et a l., 2007). In this technique, the user designates the most successful image patches returned from the query, and an initial query vector is updated using feature vectors of the selected image tiles. Using this refined query vector, the previous retrieval result is generally improved. This technique is applied to both CM-CBIR experiments shown in Fig. 5 . and Fig. 7 . For the Quickbird (query)-to-SAR (target) case, rank 1st, 2nd, 3rd and 4th image patches in Fig. 10 top figure (i.e. CMCMBIR result in Fig. 5 ) are designated as the most relevant CM-CBIR cases. Similarly, in case of the SAR (query)-to-Quickbird (target) experiment, 4th and 8th image patches in Fig. 11 top figure (i.e. CM-CMBIR results in Fig. 7 ) are designated. Then, feature vectors of the designated tiles and the transformed query vector are averaged to generate a refined query vector.
PPT Slide
Lager Image
Demonstration of the CM-CBIR result when applying the user’s relevancy feedback: Rank 1st, 2nd, 3rd and 4th tiles, marked with a solid line, are chosen from Fig. 5 result to refine query feature vector (top figure). In this case, the refined query vector yields the same number of relevant tiles. Relevant tiles are marked with a dotted line (bottom figure).
PPT Slide
Lager Image
Demonstration of the CM-CBIR result when applying the user’s relevancy feedback: Rank 4th and 8th tiles, marked with a solid line, are chosen from Fig. 7 result to refine query feature vector (top figure). The refined query vector yields an improved result, i.e. retrieved two more relevant tiles than Fig.7. Relevant tiles are marked with a dotted line (bottom figure).
Bottom figures of Fig. 10 and Fig. 11 show the CMCBIR results after combining with the user’s relevancy feedback. In case of the SAR-to-Quickbird experiment ( Fig. 11 ), user’s feedback yields the enhanced result in terms of quantitywise criterion: two more relevant tiles are additionally retrieved, from three in the previous CMCBIR result to five as summarized in Table 1 . In case of the Quickbird-to-SAR experiment ( Fig. 10 ), the same number of relevant tiles is retrieved. Though the quantitywise result is the same, this CM-CBIR case was already comparable to the SM-CBIR performance (i.e. 7 vs. 8 relevant tiles, respectively) even before applying the user’s feedback. In contrast, for the SAR-to-Quickbird case, the CM-CBIR result was clearly surpassed by the SM-CBIR (i.e. 3 vs. 6 relevant tiles, respectively). Considering the inherent limitation of the proposed CM-CBIR method (i.e. not a direct query vector is used but a transformed one), it is thought that the CM-CBIR is hard to outperform the SM-CBIR but a comparable performance is acceptable. Therefore, for bigger performance gap between the CMCBIR and SM-CBIR as in the SAR-to-Quickbird case, user’s relevancy feedback is expected to more improve the CM-CBIR result.
Above results show that 1) the suggested CM-CBIR approach is able to deliver the image content of interest overcoming the modality difference but 2) the retrieval performance of the CM-CBIR is relatively lower than SMCBIR. However, the lack of retrieval performance by the CM-CBIR method appears to be improved to the level of the SM-CBIR by reflecting the user’s feedback, thereby the capability of the proposed CM-CBIR scheme becomes more satisfactory. This improvement is expected to be greater when initial performance gap between the CM-CBIR and SM-CBIR is bigger.
4. Conclusions
With an aim of enabling the CBIR between different modalities, this study presented a CM-CBIR approach that transforms given query feature vector in order to use it as a new query vector for searching target images with different modality. The suggested transformation is a supervised procedure that the operator chooses training image patches from heterogeneous images and sets correspondence between them.
Current initial results show the feasibility of the proposed CM-CBIR method that is able to retrieve some image feature of interest from different modality scenes. However, its retrieval capability seems less effective than that of the SM-CBIR. This lack of performance is thought inevitable since a query feature vector of the proposed CM-CBIR method is indirectly derived through the transformation procedure while the SM-CBIR directly computes its query vector from given query patch. Despite the inherent weakness, the retrieval performance of the presented CM-CMIR approach can be compensated by exploiting the user’s relevancy feedback (when only small number of relevant tiles are retrieved), a conventional CBIR enhancement technique that grabs more relevant image patches (i.e. quantitywise improvement) and put them in higher ranks (i.e. qualitywise improvement). Therefore, it is recommended to utilize the suggested CM-CBIR scheme for an initial retrieval and then augment the initial search result through user’s feedback, particularly when initial performance gap between the CM-CBIR and SM-CBIR is bigger.
In the CM-CBIR framework in this research, training step should be done very carefully to warrant a better retrieval result. As the suggested approach heavily depends on the correspondence between image patches for the same image content but different modalities, the operator’s supervision for choosing training patches and matching between them is very important to ensure the quality of transformed query vector as well as the satisfactory CM-CBIR performance.
For streamlining the suggested methodology, there are several issues to be studied further. Some supplemental feature vector components (e.g. structure and object pattern) that might be invariant regardless of image modality can be added to the texture-based feature vector and tested. Also, when generating a composite feature vector during the training steps, rather than simple averaging, another weighting strategy that gives different weight to each training patch can be tested for the potential to better represent an image feature class. Finally, future research will involve additional extensive experiments with another image type and feature class that would be useful to certify or refine the initial results.
Brocker L. , Bogen M. , Cremers A. B. 2001 Improving the retrieval performance of content-based image retrieval systems: The GIVBAC approach Fifth International Conference on Information Visualisation London, England 25-27 July 659 - 664
Gelasca E. D. , Guzman J. D. , Gauglitz S. , Ghosh P. , Xu J. , Moxley E. , Rahimi A. M. , Bi Z. , Manjunath B. S. 2007 CORTINA: Searching a 10 Million + Images Database, Technical Report, VRL, ECE University of California Santa Barbara
Jeong I. 2012 An approach for improving the performance of the Content-Based Image Retrieval (CBIR) Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography 30 (6-2) 665 - 672    DOI : 10.7848/ksgpc.2012.30.6-2.665
Jia Y. , Salzmann M. , Darrell T. 2011 Learning crossmodality similarity for multinomial data 2011 IEEE International Conference on Computer Vision (ICCV) Barcelona, Spain 6-13 November 2407 - 2414
Li J. , Narayanan R. M. 2004 Integrated spectral and spatial information mining in remote sensing imagery IEEE Transactionson Geoscience and Remote Sensing 42 (3) 673 - 685    DOI : 10.1109/TGRS.2004.824221
Newsam S. , Wang L. , Bhagavathy S. , Manjunath B. S. 2004 Using texture to analyze and manage large collections of remote sensed image and video data Journal of Applied Optics: Information Processing 43 (2) 210 - 217
Rasiwasia N. , Pereira J. C. , Coviello E. , Doyle G. , Lanckriet R.G. , Levy R. , Vasconcelos N. 2010 A new approach to cross-modal multimedia retrieval ACM Proceedings of the international conference on Multimedia Firenze, Italy 25-29 October 251 - 260
Streilein W. , Waxman A. , Ross W. , Liu F. , Braun M. , Fay D. , Harmon P. , Read C. H. 2000 Fused multi-sensor image mining for feature foundation data Proceedings of the 3rd International Conference on Information Fusion Paris, France 10-13 July 1 TuC3/18 - TuC3/25
Zhang H. J. , Chens Z. , Liu W.Y. , Li M. 2001 Relevance feedback in content-based image search Proceedings of 12th International Conference on New Information Technology (NIT) Beijing, China 29-31 May