Key Technologies in Mobile Visual Search and MPEG Standardization Activities

时间：2022-05-21 05:35:26

Abstract

Visual search has been a long-standing problem in applications such as location recognition and product search. Much research has been done on image representation， matching， indexing， and retrieval. key component technologies for visual search have been developed， and numerous real-world applications are emerging. To ensure application interoperability， the Moving Picture Experts Group （MPEG） has begun standardizing visual search technologies and is developing the compact descriptors for visual search （CDVS） standard. MPEG seeks to develop a collaborative platform for evaluating existing visual search technologies. Peking University has participated in this standardization since the 94th MPEG meeting， and significant progress has been made with the various proposals. A test model （TM） has been selected to determine the basic pipeline and key components of visual search. However， the first-version TM has high computational complexity and imperfect retrieval and matching. Core experiments have therefore been set up to improve TM. In this article， we summarize key technologies for visual search and report the progress of MPEG CDVS. We discuss Peking University’s efforts in CDVS and also discuss unresolved issues.

Keywords

visual search； mobile； visual descriptors； low bit rate； compression

1 Visual search

Searching for a specific object in a collection of images has been a long-standing problem with a wide range of applications in location recognition， scene retrieval， product search， and CD and book cover search. Smartphones and tablets have great potential for visual search because they have the integrated functionality of high-resolution embedded cameras， color displays， natural user interfaces， and 3G wireless connections. Most existing mobile visual search systems are deployed in a client-server architecture. The server end has a visual search system in which image representation and inverted index are based on a bag of words （BoW） model. In an online search， a mobile user can take a query photo and send it to a remote server that identifies the query image and its related description （Fig. 1a）.

Much research has been done on robust image representation， matching， indexing， retrieval， and geometric re-ranking. Fig. 2 shows a typical flow chart for a visual search. Global features [1]， [2] and/or local features [3]-[7] are extracted from database images at a server. The server looks for relevant images based on visual similarity between a query and database images， and inverted indexing speeds up the process （Fig. 2a）. Local features are particularly important in a visual search. Even though detecting global features is fast and little memory is required， matching and retrieval based on global features is often less accurate than matching and retrieving based on local features [8]. Recently， Chen et al. showed that a visual search using both global and local features yields better results than using local features only [9].

Using brute force to match query features with database features is infeasible for large-scale databases， and a feature index is needed to improve search efficiency （Fig. 2b） [3]， [10]-[12]. Other solutions include approximate nearest-neighbor search， such as k-D tree [3]， [10]， or locality-sensitive hashing [11]， [12]. In addition， a feature-space quantization scheme， such as k-means，

and scalable vocabulary tree [13]-[15] have been widely used for scalable image search， in which two individual features are considered the same word if they fall into the same cluster.

In online search， local features in a query image are extracted and used to search for the local features’ nearest neighbors based on a database of reference images. Database images containing nearest neighbors are quickly collected using indexing and are ranked according to a similarity score （Fig. 2c） [13]-[15]. Finally， the top

returned images are re-ranked through geometric by

taking into account the location of local features （Fig. 2d） [14]， [16]， [17].

Both academia and industry have made significant progress on visual search， and there is now a growing number of systems or prototypes， including Google

Goggles and Nokia Point & Find. To ensure application interoperability， the Moving Picture Experts Group （mpeg） has begun standardizing visual search technologies， in particular， compact descriptors for visual search

（CDVS） [18]. CDVS is considered a core technique in augmented reality.

In section 2， we discuss MPEG’s progress on CVDS and Peking University’s （PKU’s） responses to the MPEG CDVS call for proposals （CfP）. We also describe the setups of

PKU’s core experiments. In section 3， we discuss compact descriptors and related issues such as indexing， retrieval， and geometric re-ranking. In section 4， we discuss open problems and challenges. Section 5 concludes the paper.

2 MPEG Compact Descriptors for Visual

2.1 Introduction to CDVS

Academia and industry have made progress on key components for visual search； however， several issues still remain [2]-[4]， [13]-[15]. It is not clear， for example， how to make visual search applications compatible across a broad range of devices and platforms. This is one of the motivations for MPEG CDVS standardization， which was discussed during the first and second Workshops on Mobile Visual Search hosted by Stanford University [19] and PKU [20]， respectively. These workshops led to a formal request being made to MPEG for CDVS standardization and a fleshing out of CDVS requirements.

At the 92nd to 96th MPEG meetings， CDVS experts investigated potential applications， scope of standardization， requirements， and evaluation framework [21]-[24]. At the 97th MPEG meeting， the CfP was issued [25]. To ensure interoperability， the CDVS standard aims to define the format of compact visual descriptors as well as the pipeline for feature extraction and search process [22].

The visual descriptors need to be robust， compact， and easy to compute on a wide range of platforms. High matching accuracy must be achieved at least for images of rigid， textured objects； landmarks； and documents. Matching should also be accurate despite partial occlusions and changes in vantage point， camera parameters， and lighting. To reduce the amount of information transferred from the client end to the server end， and to alleviate query transmission latency， the descriptor length must be minimized. Descriptor extraction must allow adaptation of descriptor length so that the required performance level can be satisfied for different database. Extracting descriptors must not be complex in terms of memory and time so that they are deployable in most common platforms.

2.2 Evaluation Framework

Here， we summarize the CDVS evaluation framework [26]， which closely aligns with CDVS requirements [22]. Different proposals is evaluated using two types of experiments： retrieval and pairwise matching. The retrieval experiment is done to determine descriptor performance during image retrieval. Mean average precision （mAP） and success rate for top matches are the evaluation criteria. The pairwise matching experiment is done to determine the performance of descriptors during the matching of image pairs. The localization precision is assessed according to the ground-truth of localization annotations. Performance is measured by the success rate at a given false alarm rate （for example 1%）， and localization precision.

Each proposal sent to MPEG needed to show the

descriptor’s scalability， and this is done by reporting the performances at six operating points of different query sizes： 512 bytes， 1 kB， 2 kB， 4 kB， 8 kB， and 16 kB. Interoperability of the descriptors is also evaluated. A descriptor generated at any of these six operating points should allow matching with descriptors generated at any other operating point. For each proposal， an experiment is conducted to evaluate pairwise matching of 1 kB descriptor versus 4 kB descriptor， and 2 kB descriptor versus 4 kB descriptor. To reduce time-related complexity， feature extraction must not be longer than two seconds， pairwise matching must not be longer than one second， and retrieval must not be longer than five seconds in a single thread the reference platform （Dell Precision workstation 7400-E5440）.

Eight datasets are collected in the CDVS evaluation framework. These datasets are ZuBud， UKBench， Stanford， ETRI， PKU， Telecom Italia （TI）， Telecom SudParis， and Huawei. There are 30，256 images categorized as mixed text and graphics， paintings， frames captured from video clips， landmarks， and common objects. Fig. 3 shows these categories and the number of reference images. Each dataset provides ground-truth annotation of pairs of matching and non-matching images as well as ground-truth annotation of query images and corresponding reference images. For localization， the mixed text and graphics category provides bounding boxes for each matching pair. In the retrieval experiment， the discriminability of a descriptor is assessed using a distractor image set containing one million images of varying resolutions and content （collected from Flickr）.

2.3 Progress on CDVS and Evidence from CDVS Research

Remarkable progress has been made at MPEG CDVS Ad-Hoc Group meetings. At the 94th MPEG meeting， PKU proposed an extremely compact descriptor based on a bag-of-features histogram rather than features. This compact descriptor provides a much higher compression rate without serious loss of discriminability [27]-[30]. To determine the lowest operating point for a promising visual search， geo-tag （a kind of side information） is used to produce very compact descriptors for visual searches of landmarks [29]， [30]. A photo collection of landmarks is first segmented according to discrete geographical regions using a Gaussian mixture model. Then， ranking-sensitive vocabulary boosting is done to create a very compact codebook within each region. When entering a new geographical region， codebooks in the mobile device are downstream-adapted， which ensures compactness and discriminative power. This technique is called location-discriminative vocabulary coding （LDVC）. With only hundreds of bits per query image， LDVC performs descriptor coding over the bag of scale-invariant feature transform （SIFT） features [29]. Referring to such a small number of bits， the CDVS Ad-hoc Group has determined that 512 bytes， is the lowest point （including space for coordinate information）.

Beyond landmark search， visual statistics and rich contextual cues at the mobile end are exploited over the reference image database to determine multiple coding channels [28]. A compression function is determined for each channel. A query is first described in a high-dimensional visual signature that is mapped to one or more channels for further compression. The compression function within each channel is determined according to a robust principle component analysis （PCA） scheme， and the retrieval ranking capability of the original signature is maintained. With hundreds of bits， the resulting descriptor has search accuracy that is comparable to that of the original SIFT features [28]. In summary， PKU’s research has shown that extremely low bit-rate descriptors （128 or 256 bits per query image1） are possible using quantization and by establishing a small but discriminative codebook.

At the 98th MPEG meeting， proponents of CDVS submitted 11 proposals for compact descriptors. These proposals broadly fall into two categories. In the first category， compact descriptors are produced by compressing local descriptors， for example SIFT descriptors， in a data-driven or data-independent way. Raw descriptors are first extracted using existing techniques. Then， raw descriptors are compressed to produce a compact descriptor. Proposals from TI [26]， PKU [31]， [32]， and Stanford & Aptina [33] fall into this category. In the second category， distinct compact descriptors are produced at the first stage of raw-descriptor extraction； there is no additional and separate compression stage. NEC’s proposal belongs to this class [34].

Both TI and PKU proposed quantizing raw SIFT descriptors with product quantization in which quantization tables are pre-learned on a training dataset [35]， [36]. The difference between the proposals from TI and PKU lies in how the raw segment of a SIFT descriptor is subdivided. PKU divides a SIFT descriptor into more segments， and each segment is quantized with a smaller codebook. In contrast， TI divides a SIFT descriptor into fewer segments， and each segment is quantized by a large codebook. TI’s solution has greater matching and retrieval accuracy than PKU’s； however，

PKU’s solution has greater localization accuracy than TI’s. Moreover， TI’s quantization table may take up 500 MB， which is much larger than PKU’s at approximately 10 MB.

Stanford & Aptina [33] use type coding to compress SIFT-like descriptors. Compared with the product quantization of TI and PKU， type coding is data-independent and requires none of pre-learned quantization tables. Data-independent methods usually perform worse than data-driven methods， especially at lower operating points such as 512 bytes and 1 KB.

NEC [34] proposed a binary descriptor created by binarizing the histogram of quantized gradient orientations. The pairwise matching and localization of NEC’s binary descriptor outperforms that of TI and PKU. However， retrieval is much worse than in other proposals at almost all operating points and datasets. Significant degeneration in retrieval can be attributed to severe loss of discriminative information when there is a huge number of distractors.

After the CDVS CfP issued at the Torino meeting， a test model under consideration （TMuC） based on product quantization was selected at the 98th MPEG meeting in Geneve [37]. Crucial stages such as local descriptor extraction， feature selection， compression， and location coding have been identified in the TMuC [37]. Fig. 4 shows the modules and dataflow in the test model. The first two blocks extract raw local descriptors， including a multiscale key-point detector based on difference of Gaussian （DoG） and a feature descriptor SIFT based on gradient-orientation histogram. Key-point selection filters in a subset of important key points fulfill the compactness and scalability requirements. Product quantization compresses a SIFT descriptor in a data-driven way， and coordinate coding compresses key point locations. In the TMuC， SIFT is used as local descriptors because it has excellent scale and rotation invariance. More than 1000 key points could be detected in a 640 × 480 VGA image. To create a very compact descriptor， a subset of useful key points is kept. The TMuC has shown that key-point selection is critical to maintain search accuracy at lower operating points because noise points can be filtered out without loss of performance. The descriptor at each key point is product quantized with pre-learnt tables. Descriptor length can be significantly reduced； for example， in PKU’s proposal， more than 85% of bits are saved. Feature locations are finally compressed by quantization and context arithmetic coding.

Several issues remain open in the first version of the TMuC. First， quantization tables may consume up to 500 MB of memory， which is infeasible for most mobile phones. Second， the current TMuC is built on DoG-based interest-point detection and raw SIFT descriptor. However， both these techniques are patented [38]. UBC has granted both non-exclusive and exclusive field-of-use licenses to multiple companies. If UBC has already granted exclusive rights to commercialize SIFT in mobiles （where most likely MPEG CDVS would be used）， the patents would prevent widespread use of the standard and stymie competition. Fortunately， experts have shown that the current TMuC can be enhanced by replacing individual components with new technologies such as alternative interest-point detector and raw local descriptor.

2.4 Core Experiments

At the 99th MPEG meeting， six core experiments [39] were set up to investigate

・the effects of combining local descriptors with a global descriptor

・low memory solutions for descriptor compression

・improved feature-location compression efficiency

・key-point detection techniques to bypass intellectual property issues

・better uncompressed （raw） local descriptors

・more efficient retrieval pipeline.

Improvements in memory use， time complexity， search accuracy， and other aspects were verified at the 100th MPEG meeting. Table 1 shows the timeline and milestones of

MPEG CDVS.

3 Compact Descriptors

To minimize the amount of information to be transferred， compact descriptors are desirable. Descriptor compactness relates to the number of local features and the size of a local feature. Typically， each local feature consists of visual descriptor and its location coordinates. Hence， descriptor size is determined by the compression rates of both visual descriptors and locations. Key-point selection is necessary to figure out a subset of important key points （Fig. 4）. Product quantization minimizes the length of local descriptors， and coordinate coding minimizes the length of location codes. In the following subsections， we introduce techniques for local descriptor compression， location coding， and key-point selection. We also discuss PKU’s compact descriptors.

3.1 Local Descriptor Compression

Feature extraction starts by selecting key points in an image. State-of-the-art visual search algorithms are built on a DoG detector followed by SIFT description [3]. Such algorithms were used （sometimes in slightly modified form） in the CDVS proposals. To achieve low-bit-rate visual search， SIFT descriptors and other popular descriptors， such as sped-up robust features （SURF） [4] and PCA-SIFT [40]， are oversized. The size of hundreds of these local descriptors is often greater than the original image size. The objective of descriptor compression is to significantly reduce the descriptor size without deteriorating visual discriminative power.

In general， there are two groups of algorithms for compressing local descriptors. Algorithms in the first group quantize raw descriptors in a data-independent way. Chandrasekhar et al. proposed a descriptor called compressed histogram of gradients （CHoG）， which uses Huffman Tree， Gagie Tree [41]， or type coding [42] to compress a local feature into approximate 50 bits. Type coding involves constructing a lattice of distributions （types）， given by

where Ki is the number of points in each lattice of a predefined 2-D gradient partition. Thus， the probability of these lattice distributions is

where n is the number of points within the gradient partition.

The index of the type closest to the original distribution is then selected and transmitted [42]. Prior to type coding， descriptors are L1-normalized so that the descriptor can be dealt with as a probability distribution.

Algorithms in the second group work are data-driven. A codebook to partition the descriptor space is first determined from training data. Quantized descriptors are represented by the nearest code word [13]， [26]， [28]-[32]，[35]. PKU’s proposal [31]， [32]， [35] and TI’s proposal [26] fall into this category. Fig. 5 shows PKU’s descriptor compression using product quantizing. Generally speaking， a data-driven approach may result in better pairwise matching and retrieval than a data-independent approach， but storing a codebook consumes more memory.

3.1.1 PKU’s Proposal for Product Quantizing SIFT Descriptors

In product quantization， an input vector is divided into k segments， and those segments are independently quantized using k subquantizers [31]. Each compressed descriptor is thus represented by k indexes comprising the nearest code words of k segments. A less-complex quantizer qj is associated with the j th segment of the input vector. Product quantization is used to compress raw descriptors， given by

where hj = （hj1， ... ， hjm ） is the histogram of the i th subsegment of length m.

The descriptor dimension is n =m ×k. h is structured into k parts， each having m dimensions. The vectors are independently quantized from k parts using k subquantizers. The size z of subquantizer dictionaries is the same. The Cartesian representation space is given by z k. Each subquantizer q maps an m -dimensional vector hj to

where C is a set of reproduction values cj to represent original vector hj with dimension m. The size of the dictionaries

is z.

Quantizer quality is measured by the mean squared error （MSE） between an input vector and its reproduction value. Subquantizers are greedy learned by minimizing MSE using Lloyd’s algorithm. Consequently， the compressed descriptor is represented by a short code comprising the indexes in all the subquantizers. However， the descriptor compression in the TMuC is imperfect because product quantization requires large memory to store several tables. The size of quantization tables may reach up to 500 MB， which is unacceptable in mobile platforms. Reducing the size of quantization tables has been set up as a core experiment [39].

3.2 Location Coordinate Coding

Each local feature comprises a visual descriptor and location coordinates. A visual descriptor is used to compute visual similarity， and the location part is for geometric verification when re-ranking returned images. Location compression is important in low bit-rate visual search. Take， for example， the lowest operating point of 512 bytes per query in the CDVS evaluation framework. For a 640 × 480 VGA image， 20 bits are needed to encode a feature’s location without compression. For an image comprising 500 local features， it would take about 1250 bytes to code location coordinates， which is more than 512 bytes

Tsai et al. proposed lossy compression of feature locations [43]. Location coordinates are quantized with a uniform partition of the 2D location space prior to coding （Fig. 6b）.

The quantized location is converted into a location histogram comprising histogram map （Fig. 6c） and histogram count

（Fig. 6d）. The histogram map comprises empty and non-empty blocks， and the histogram count is the number of features in each non-empty block. Different block sizes lead to different quantization levels. The histogram map and histogram count are coded separately with context-based arithmetic coding. This scheme reduces bits by a factor of 2.8 compared with fixed-point representation. When run length， quad-tree， and context-based algorithms are compared， context-based coding performs the best， and gain decreases as block size becomes smaller. Run-length coding for a larger block size does not provide any gain.

Tsai’s location coding algorithm treats all features equally； however， in real-world applications， a subset of features could be more important for robustness and information fidelity. Compression distortion of important features can be reduced at the risk of increasing distortion for unimportant features. Therefore， an importance map can be placed over an image. Important features are quantized to finer levels， and unimportant features are quantized to coarser levels （Fig. 7）. The definition of an importance map depends on applications. In our experiments， an importance map is a Gaussian function so that key points near the center of the image are emphasized. This definition is validated in subsequent key-point selection experiments. These experiments show that， compared with Tsai’s approach， the importance map approach can reduce bits by 15% without seriously affecting performance （Fig. 8） [44]. Location coding was one of the core experiments at the 99th MPEG meeting.

3.3 Key-Point Selection

Selection criteria may include a key point’s location， peak value of DoG response， scale， dominant orientation， or descriptors. We conducted a systematic study [45]， and Fig. 9 shows some results of feature-point selection using the following approaches：

・ Sequential selection. This involves selecting key points continuously from top left to bottom right.

・ Spread selection. This involves selecting key points to maximize the spread of the points.

・ Center selection. This involves selecting key points near the image center.

・ Scale selection or peak selection. This involves selecting key points of larger scales or peak values.

・ Fusion selection. This involves using a supervised approach to combine factors for better performance. Key points can be repeatedly detected and matched and are regarded as stable. The stability score of a key point is the number of images containing correctly matched points. A RankSVM2 ranking model is trained to sort key points according to their stability scores [46].

For operating points from 512 bytes to 16 kB， we set the maximum number of key points to 200， 350， 450， 550， and 1000. If the number of key points in an image exceeds the limit， this number needs to be cut down to meet the limit. Four hundred images， including 100 objects from UKBench dataset， were used to generate features for RankSVM training.

The fusion selection approach performs the best， but center selection is also effective. In matching experiments， center selection and scale selection have about 10-30% gain at 512 bytes and outperform sequential selection. The gain is less significant when the operating point goes up to 4 kB because there is no need to cut down the number of key points. The performance of peak selection is comparable to that of spread selection， with TP improved by about 5-10%. Sequential selection performs the worst. Fusion based on RankSVM brings about slight gain， and similar results have been shown in retrieval experiments. However， scale selection performs much worse than center selection.

In summary， key-point selection significantly contributes to pairwise matching and retrieval， especially at very low operating points. Center selection basically works well； however， when the query object is far from the center， this approach is poor. A solution is to combine different factors.

In our model， each key point is characterized by scale， dominant orientation， peak value of DoG， and distance to center. Using the training model， for each individual factor， we estimate the probability that a feature can be matched correctly to features in other images. Fig. 10 shows these estimations based on scale and peak values. The probability of a correct match increases as the scale increases within a certain range， and this is consistent with our hand-crafted rule. The probability decreases as the scale exceeds the limit， and this is not taken into account in our current model.

Other models are fused with a naive Bayesian approach； that is， probabilities for all factors are multiplied to produce a score. Key points of higher scores are selected.

Instead of hand-crafted rules， as in [45]， Gianluca et al. [26] exploit the difference statistics for correctly and incorrectly matched key points. The difference statistics are consistent across various large datasets [26]. Training data contains images from the Pasadena buildings dataset， INRIA holidays dataset， and additional images of food labels and Italian buildings. Training data also contains video frames from movies. The training set is organized into a list of pairwise matching images. Each image undergoes key-point detection， matching， and geometric consistency check. The actual （true or false） matched pairs of features are grouped as inliers or outliers. Inliers are key points that have been correctly matched， and outliers are key points that have not been matched or have been incorrectly matched. Each feature is labeled according to whether the feature has been matched correctly （value 1） or not （value 0）.

Key-point selection is important； however， more in-depth work is needed on how to make optimal combinations of factors. Key-point selection might be a core experiment in the upcoming 100th MPEG meeting.

3.4 Global Descriptors

State-of-the-art visual search applications usually use a BoW model based on local features. Chen et al. showed that visual search accuracy can be improved by combining global and local features [9]. A global feature is generated from SIFT local descriptors. Some 190 centroids are generated by applying k-means clustering to SIFT descriptors. Given an image， each SIFT descriptor is quantized to its nearest centroid. The residual between the descriptor and the centroid is computed. For each centroid， the mean residual between the centroid and all quantized descriptors falling to this centroid is computed. To reduce the dimensions of residuals， linear discriminative analysis （LDA） is applied， and 32 of the most discriminative LDA eigenvectors are retained. The transformed values are binarized by the sign. Experiments show that incorporating global descriptors may improve performance， especially at lower operating points.

PKU has provided evidence that visual search accuracy can be improved by leveraging global descriptors [47]. A global descriptor is generated by aggregating SIFT descriptors （Fig. 11）. For a given image， a subset of SIFT key points is selected， and each 128-D SIFT descriptor is reduced to 64-D eigenvectors by applying PCA. A Gaussian mixture model （GMM） with 256 Gaussians is trained using an independent dataset. To form a global descriptor， an image is represented as a fixed-size Fisher vector generated from the gradient of the probability of the selected 64-D SIFT descriptors over the mean of the learned GMM. For each Fisher vector， PCA dimensions are reduced， and products are quantized to encode the global feature. When a CDVS evaluation framework is used， experiment results show that applying global descriptors to the retrieval step and then applying local descriptors to the re-ranking step greatly improves performance at all operating points （Fig. 12） [47].

4 Other Issues

Feature indexing， retrieval， and geometric re-ranking may not be included in the MEPG CDVS standard. Proper handling of these issues will greatly improve retrieval scalability and accuracy. For fast search， database features must be indexed [3]， [10]-[12]. There are two kinds of indexing approaches. The first approach involves attempting to search for approximate nearest neighbors， for example， k-D tree [3]， [10] and locality-sensitive hashing （LSH） [11]， [12]. The second approach involves a BoW model [13]-[15]， and greater speed is achieved by quantizing feature space. k-D tree methods outperform BoW methods in terms of run time and recognition accuracy. However， BoW methods require much less storage space than k-D tree and LSH methods because inverted indexing based on BoW does not require feature vectors to be stored. LSH methods generally result in better search accuracy than k-D tree methods but are inferior in terms of search time （which increases sharply as the database size increase）. In the CDVS evaluation framework， memory use is limited to 16 GB. Neither k-D tree nor LSH is able to load the descriptors of all reference images （including one million distractors） into 16 GB memory. The current TMuC uses the BoW model.

In online search， local features in a query image are used to perform nearest-neighbor search. Database images are ranked by a scoring function that represents their visual similarity to nearest neighbors [13]-[15]. Post-processing can be used to verify geometric consistency within a subset of top-ranked images [14]， [16]， [17]， and subset images are re-ranked accordingly. Various re-ranking algorithms have been proposed for trading-off between accuracy and time complexity. In [14]， spatial verification is used to estimate a transformation between a query region and each target image. Returned images are re-ranked according to discriminability of spatially verified visual words （inliers）. In [16]， Jegou et al. propose using angle and scale to adjust the similarity score of an image. The score decreases when a visual word is inconsistent with transformation and increases in the presence of consistently transformed points. Compared with re-ranking in [14]， re-ranking in [16] is much faster， but there is a risk of higher false positive rate.

In the TMuC， the fast re-ranking algorithm in [17] is slightly modified. Feature pairs are formed by comparing descriptor classification paths in a scalable vocabulary tree， and a geometric verification score is computed for these pairs. A histogram of logarithmic distance ratios （LDRs） for pairs is used in the TMuC. Distribution of LDRs for pairs of inliers is different from that of LDRs of pairs of outliers， which are used to rate geometric consistency.

5 Conclusion

In this paper， we have discussed developments in visual search technologies. These developments include state-of-the-art， low bit rate visual search pipeline， and important components such as compact descriptors and efficient image-matching and retrieval algorithms. To facilitate interoperability between visual search applications， MPEG has made great efforts to standardize compact descriptors for visual search. We have reported a CDVS evaluation framework， which is a competitive and collaborative platform for evaluating visual search technologies and solutions and is a benchmark for crucial modules.

Despite significant progress on visual search within academia and industry， a few challenges remain. The size of visual queries transmitted from a mobile device to servers needs to be minimized because of the bandwidth constraint of （3G） wireless networks. Under the CDVS umbrella， descriptor quantization [26]， [27]， [31]， [32]， [37]， [41]， [42]， location coding [43]， [44] and key-point selection [37]， [45] have been attempted in order to produce more compact descriptors. Approximately 90% of bits can be saved using compact descriptors as opposed to using uncompressed SIFT descriptors， and search performance is well maintained. In addition， a mobile device is constrained by battery life， so energy saving is crucial for feature extraction and compression. Because mobile phones usually have limited computing capability， operations on a mobile phone must not be too complex. A core experiment that has been set up to reduce quantization table size to less than 5 MB. Moreover， computing DoG is also time-consuming. Aptina [48] provided fast algorithms for key-point detection and descriptor generation， and these algorithms are targeted at system-on-chip （SoC） implementation. However， the robustness of a detector still needs to be evaluated. More importantly， the ongoing MPEG CDVS standardization has attracted much interest from hardware manufacturers such as Aptina， Nvidia， and STMicroelectronics.

Acknowledgments：

We would like to thank the reviewers for their useful comments. This work was supported by National Basic Research （“973”） Program of China （2009CB320902）， and in part by the Chinese National Nature Science Foundation （60902057）.

References

[1] N. Dalal and B. Triggs， “Histograms of oriented gradients for human detection，” in Proc. Conf. Vision and Pattern Recognitioni （CVPR）， San Diego， CA， 2005， vol. 1， pp. 886-893.

[2] A. Oliva and A. Torralba， “Modeling the shape of the scene： A holistic representation of the spatial envelope，” in Int. J. Comput. Vision， vol. 42， no. 3， pp. 145-175， 2001.

[3] D.G. Lowe， “Distinctive image features from scale-invariant keypoints，” in Int. J. Comput. Vision， vol. 60， no. 2， pp. 91-110， 2004.

[4] H. Bay， T. Tuytelaars， and L. V. Gool， “Surf： speeded up robust features，” in J. Comput. Vision Image Understanding， vol. 110， no. 3， pp. 346-359， Jun. 2006.

[5] K. Mikolajczyk and C. Schmid， “An affine invariant interest point detector，” in Proc. 7th European Conf. Comput. Vision （ECCV’02）， Copenhagen， 2002， pp. 128-142.

[6] J. Matas， O. Chum， M. Urban， and T. Pajdla， “Robust wide baseline stereo from maximally stable extremal regions，” in Proc. British Machine Vision Conf.， Cardiff， Wales， 2002， pp. 384-393.

[7] K. Mikolajczyk and C. Schmid， “A performance evaluation of local descriptors，” IEEE Trans. Pattern Analysis and Machine Intelligence， vol. 27， no. 10， pp. 1615-1630， Oct. 2005.

[8] M. Douze， H. Jégou， H. Sandhawalia， L. Amsaleg， and C. Schmid， “Evaluation of gist descriptors for web-scale image search，” in Proc. Int. Conf. Image and Video Retrieval （CIVR’09）， Santorini， Greece， July 2009.

[9] D. Chen， V. Chandrasekhar et al.， MPEG， “Improvements to the Test Model Under Consideration with a Global Descriptor”， ISO/IEC JTC1/SC29/WG11/ M23578， 2012/02.

[10] S. Arya， D. M. Mount， N. S. Netanyahu， R. Silverman， and A. Y. Wu， “An optimal algorithm for approximate nearest neighbor searching fixed dimensions，” in J. of the ACM， vol. 45， no. 6， pp. 891-923， Nov. 1998.

[11] A. Andoni and P. Indyk， “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions，” in Commun. of the ACM， vol. 51， no. 1， pp. 117-122， 2008.

[12] Y. Ke， R. Sukthankar， and L. Huston， “An efficient parts-based near-duplicate and sub-image retrieval system，” in Proc. 12th Annu. ACM Int. Conf. Multimedia， New York， NY， 2004， pp. 869-876.

[13] J. Sivic and A. Zisserman， “Video Google： a text retrieval approach to object matching in videos，” in Proc. 9th IEEE Int. Conf. on Comput， Vision， Nice， France， 2003， pp. 1470.

[14] J. Philbin， O. Chum， M. Isard， J. Sivic， and A. Zisserman， “Object retrieval with large vocabularies and fast spatial matching，” in Proc. IEEE Conf. Comput Vision and Pattern Recognition， Minneapolis， MN， 2007， pp. 1-8.

[15] D. Nister and H. Stewenius， “Scalable recognition with a vocabulary tree，” in Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognition， New York， NY， Jun. 2006， vol. 2， pp. 2161-2168.

[16] H. Jegou， M. Douze， and C. Schmid， “Hamming embedding and weak geometric consistency for large scale image search，” in Proc. 10th European Conf. Comput. Vision： Part I， Marseille， 2008， pp. 304-317.

[17] S. S. Tsai， D. Chen， G. Takacs， V. Chandrasekhar， R. Vedantham， R. Grzeszczuk， and B. Girod， “Fast geometric re-ranking for image-based retrieval，” in Proc. 17th IEEE Int. Conf. Image Processing， Hong Kong， 2010， pp. 1029-1032.

[18] MPEG， "Call for Proposals for Compact Descriptors for Visual Search，" ISO/IEC JTC1/SC29/WG11 N12201， 2011/07.

[19] Standford Center for Image Systems Engineering. Workshop on mobile visual search （2009） [Online]. Available： http：//scien.stanford.edu/pages/conferences/mvs/

[20] Peking University. The 2nd workshop on mobile visual search （2011） [Online]. Available： http：///mvs/talksabstract.asp

[21] MPEG， “Compact Descriptors for Visual Search： Evaluation Framework，” ISO/IEC JTC1/SC29/WG11/N12202， 2011/07.

[22] MPEG， “Compact Descriptors for Visual Search： Requirements，” ISO/IEC JTC1/SC29/WG11/N11531， 2010/07.

[23] MPEG， “Compact Descriptors for Visual Search： Context and Objectives，” ISO/IEC JTC1/SC29/WG11/N11530， 2010/07.

[24] MPEG， “Compact Descriptors for Visual Search： Applications and Use Scenarios，” ISO/IEC JTC1/SC29/WG11/N11529， 2010/07.

[25] MPEG， "Call for Proposals for Compact Descriptors for Visual Search，" ISO/IEC JTC1/SC29/WG11 N12201， 2011/07.

[26] MPEG， “Telecom Italia’s response to the MPEG CfP for Compact Descriptors for Visual Search，” ISO/IEC JTC1/SC29/WG11/ M22672， 2011/11.

[27] MPEG， “A Case Study for Compact Descriptors for Visual Search - Location Discriminative Mobile Landmark Search，” ISO/IEC JTC1/SC29/WG11/ M18542， 2010/10.

[28] R. Ji， L. Y. Duan， J. Chen， H. Yao， Y. Rui， S. F. Chang， and W. Gao， “Towards low bit rate mobile visual search with multiple-channel coding，” in Proc. 19th ACM Int. Conf. on Multimedia， Scottsdale， AZ， 2011， pp. 573-582.

[29] R. Ji， L. Y. Duan， J. Chen， H. Yao， T. Huang， and W. Gao， “Learning Compact Visual Descriptor for Low Bit Rate Mobile Landmark Search，” in Proc. Int. Joint Conf. on Artificial Intelligence， Barcelona， 2011， pp. 2456-2463.

[30] R. Ji， L. Y. Duan， J. Chen， H. Yao， J. Yuan， Y. Rui， and W. Gao. “Location Discriminative Vocabulary Coding for Mobile Landmark Search，” in Int. J. Comput. Vision， vol. 96， no. 3， pp. 290-314， 2012.

[31] MPEG， “Peking Compact Descriptor-PQ-SIFT，” ISO/IEC JTC1/SC29/WG11/M22620， 2011/11.

[32] MPEG， “Peking compact descriptor-PQ-WGLOH，” ISO/IEC JTC1/SC29/WG11/M22619， 2011/11.

[33] MPEG， “CDVS Proposal： Stanford Nokia Aptina Features，” ISO/IEC JTC1/SC29/WG11/ M22554， 2011/11.

[34] MPEG， “NEC's Response to CfP for Compact Descriptor for Visual Search” ISO/IEC JTC1/SC29/WG11/ M22717， 2011/11.

[35] Chunyu Wang， Ling-Yu Duan， Yi-Zhou Wang， and Wen Gao， “PQ-WGLOH： A bit-rate scalable local feature descriptor，” in Proc. IEEE Int. Conf. on Acoustics， Speech， and Signal Processing， Kyoto， Mar. 2012.

[36] H. Jegou， M. Douze， and C. Schmid， “Product quantization for nearest neighbor search，” in IEEE Trans. Pattern Analysis & Machine Intelligence， vol. 33， no. 1， pp. 117-128， Jan. 2011.

[37] MPEG， "Description of Test Model under Consideration for CDVS，" ISO/IEC JTC1/SC29/WG11 W12367， 2011/12.

[38] David G. Lowe， “Method and apparatus for identifying scale invariant features in an image and use of same for locating an object in an image，” US Patent 6 711 293， March 23， 2004.

[39] MPEG， "Description of Core Experiments on Compact descriptors for Visual Search，" ISO/IEC JTC1/SC29/WG11/ N12551， 2012/02.

[40] Y. Ke and R. Sukthankar， “PCA-SIFT： A more distinctive representation for local image descriptors，” in Proc. IEEE Comput. Soc. Conf. Computer Vision and Pattern Recognition， Washington DC， 2004， vol. 2， pp. 506-513.

[41] V. Chandrasekhar， G. Takacs， D. Chen， S. Tsai， R. Grzeszczuk， and B. Girod， “CHoG： compressed histogram of gradients―a low bit-rate descriptor，” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition， Miami， FL， 2009， pp. 2504-2511.

[42] V. Chandrasekhar， G. Takacs， D. M. Chen， S. S. Tsai， Y. Reznik， R. Grzeszczuk， and B. Girod， “Compressed histogram of gradients： a low bitrate descriptor，” in Int. J. Comput. Vision， vol. 96， no. 3， pp. 384-399， 2011.

[43] S. S. Tsai， D. Chen， G. Takacs， V. Chandrasekhar， J. P. Singh， and B. Girod， “Location coding for mobile image retrieval systems，” in Proc. 5th Int. ICST Mobile Multimedia Commun. Conf.， London， Sep. 2009， paper 8.

[44] Chunyu Wang et al.， “Scalable location coding towards low-bit rate visual search，” Institute of Digital Media， Peking University， Tech. Rep. 2012.

[45] MPEG， “Reference results of key point reduction，” ISO/IEC JTC1/SC29/WG11/M23929， 2012/02.

[46] T. Joachims， “Optimizing search engines using click through data，” in Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining， ACM， New York， NY， 2002， pp. 133-142.

[47] MPEG， “Peking University’s Response to CDVS Core Experiment 1”， JTC1/SC29/WG11/ m24781， 2012/4.

[48] MPEG， “Aptina Interest Point and Descriptor Method Summary，” ISO/IEC JTC1/SC29/WG11/ m22775， 2011/11.

Manuscript received： March 8， 2012

Biographies

Ling-Yu Duan （） received his MSc degree in automation from The University of Science and Technolohy， China， in 1999. He received his MSc degree in computer science from the National University of Singapore in 2002 and his PhD degree in information technology from The University of Newcastle， Australia， in 2007. From 2003 to 2008， he was a research scientist at the Institute for Infocomm Research， Singapore. Since 2008， he has been an associate professor at the School of Electrical Engineering and Computer Science at Peking University. Dr. Duan currently His research interests include visual search and reality augmentation， multimedia content analysis， and mobile media computing. He has authored more than 70 papers in these areas.

Jie Chen （） is a PhD candidate at the School of Electrical Engineering and Computer Science， Peking University. His research interest include mobile visual search， low bit-rate visual descriptors， and vector quantizer. He has published more than 10 journal or conference papers.

Chunyu Wang （） is a PhD candidate at the School of Electrical Engineering and Computer Science， Peking University. His research interests include visual search， object recognition， and activity recognition.

Rongrong Ji （） received his PhD degree in computer science from Harbin Institute of Technology， China. He is currently a postdoctoral research fellow at Columbia University. His research interests include image and video search， content analysis and understanding， mobile visual search and recognition， and interactive human-computer interface. Dr. Ji received the Best Paper Award at ACM Multimedia 2011 and received a Microsoft Fellowship in 2007.

Tiejun Huang （） received his BSc and MSC degrees in computer science from Wuhan University of Technology in 1992 and 1995. He received his PhD degree in pattern recognition and image analysis from Huazhong University of Science and Technology， China， in 1998. He is currently a professor in the School of Electrical Engineering and Computer Science， Peking University. He is also vice director of the National Engineering Laboratory for Video Technology of China. His research interests include video coding， image understanding， digital rights management （DRM）， and digital library. He has published more than sixty peer-reviewed papers and has authored or co-authored three books. He is a member of the board of directors for the Digital Media Project； he is on the advisory board for IEEE Computing Now； he is on the editorial board of the Journal on 3D Research； and he is on the board of the Chinese Institute of Electronics.

Wen Gao （） received his MSc degree in computer science from Harbin Institute of Technology， China， in 1985. He received his PhD degree in electronics engineering from the University of Tokyo in 1991. He is a professor in the School of Electronics Engineering and Computer Science， Peking University. He has led research efforts in video coding， face recognition， sign language recognition and synthesis， and multimedia retrieval. Professor Gao was admitted as an Academedian of the China Engineering Academy in 2011 and became an IEEE Fellow in 2010 for his contribution to video coding technology. He has been on the editorial boards of IEEE Trans. on Multimedia， IEEE Trans. Circuits Syst. For Video Tech.， and several other top international academic journals. He was the chair of IEEE Int. Conf. Multimedia & Expo （ICME） 2007， and ACM Int. Conf. Multimedia （ACM-MM） 2009. He has authored four books and published more than 500 research papers on video coding， signal processing， computer vision， and pattern recognition.

Key Technologies in Mobile Visual Search and MPEG Standardization Activities

文档上传者

热门推荐更多>

Key Technologies in Mobile Visual Search and MPEG Standardization Activities

文档上传者

热门推荐 更多>

热门推荐更多>