Visual Business Recognition – A Multimodal Approach
A business recognition system can automatically identify businesses in an image and retrieve additional relevant information such as reviews, ratings, and similar nearby businesses. Such a system can provide smartphone or wearable computer, such as Google Glass, users with extensive information about a particular business of interest in an automatic and convenient fashion. It can be used for enhancing the user experience in surfing maps or location-aware image understanding as well.
We developed a multimodal approach which incorporates business directories, textual information, and web images in a unified framework. Our method work for both franschis and non-franchise businesses, as well as the ones which do not have any text or logos in their storefronts. We assume the query image is associated with a coarse location tag and utilize business directories for extracting an over-complete list of nearby businesses which may be visible in the image. We use the name of nearby businesses as search keywords in order to automatically collect a set of relevant images from the web and perform image matching between them and the query. Additionally, we employ a text processing method customized for business recognition which is assisted by the nearby business names; we fuse the information acquired from image matching and text processing in a probabilistic framework to recognize the businesses. The above figure shows more details for each step and the fusion process for a sample query image:
Web Image Matching
In order to leverage the images on the web in business recognition, especially for storefronts which do not include any text, we use the list of nearby business names as search keywords and collect a set of images from the web for each one. The query image is expected to share some similarity with the web images of the business which is visible in it. Therefore, we match the query image to the collected web images in order to identify the similar ones. This process yields a PDF which represents how well the web images of each nearby business match the query image.
The above figure shows sample web images for five businesses. The red margin marks the positive examples. Green, yellow and blue markers denote the results which the keywords “business name”,”business name+city” and “business name+storefront” yielded. Note that the images were retrieved based on verbal tags only, but the positive examples share similarities with the content of the query image; hence, they can be utilized for identifying the right business using image matching. Also, the positive examples don’t have to belong to the same exact storefront; in fact, anything in common between positive examples and the query, e.g. logos, signs or the general appearance of the storefront, can be leveraged for identifying the right business.
To utilize the textual information, we perform text detection on the query image; then, we apply a multi-hypotheses text recognition approach assisted by the business lexicon which yields a PDF specifying how well a detected word in the query image matches the nearby businesses. Since the business in the image may include several words, we combine the PDFs of matching businesses to each word through marginalization to have a single PDF representing the textual information in the whole query image.
Above figure shows the process of our multi-hypotheses character recognition. (a): Business Lexicon. (b): the query word and nominated candidates for each query patch. The correct candidates are marked with the red circles. (c): best matching permutations to each business word and their respective edit distance
Lastly, we combine the two PDFs acquired from text processing and image matching in a probabilistic late fusion step to compute a PDF which includes both modalities.
The following figure illustrates the business recognition results for 8 sample query images:
The following shows the results of our method applied to a YouTube clip:
The following figures show the PDF each modalities yields, along with the fusion results:
The accuracy of our method in recognizing business-words (text processing only), along with four baselines are shown below. The precision-recall curves compare our results (text processing only) vs. Wang et al’s :
The following compares our words recognition results applied to a subset of Street View Text (SVT) data set , along with Wang et al’s ICCV’11 paper . (click HERE to see more results.)
Our code for text detection can be downloaded HERE. (by Oliver Nina)
Amir Roshan Zamir, Afshin Dehghan and Mubarak Shah, “Visual Business Recognition – A Multimodal Approach”, In Proceeding of ACM International Conf. on Multimedia (ACM MM), 2013 [PDF | BibTeX | Supplemental Mat.]
Take a look at a few of our other papers and projects in the area of geo-spatial analysis of images and videos:
Amir Roshan Zamir and Mubarak Shah, “Image Geo-localization Based on Multiple Nearest Neighbor Feature Matching using Generalized Graphs”, In IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2014 [PDF | Project Page]
Amir Roshan Zamir, Shervin Ardeshir and Mubarak Shah, “Robust Refinement of GPS-Tags using Random Walks with an Adaptive Damping Factor”, in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2014 [PDF | Project Page]
Gonzalo Vaca, Amir Roshan Zamir and Mubarak Shah, “City Scale Geo-spatial Trajectory Estimation of a Moving Camera”, in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2012 [PDF | BibTeX| Project Page]