Home > About Fuji Xerox > Technology > New Technologies for Valuable Communication > Image Recognition Technology > Image Annotation Technology

Image Annotation Technology

As digital cameras have become more widespread, photos are now used in the workplace in many industries, such as the construction industry, manufacturing industry, and insurance industry. Since a large number of photos are taken and stored every day, it takes time to classify and organize them. Also, there are now many situations in which image materials are used to create visually appealing documents, and there is thus a need to be able to retrieve desired images within a short amount of time. At Fuji Xerox, we are researching image annotation technology that assigns "labels" to an image indicating its contents. With this technology, images are classified automatically, and desired images can be retrieved using the labels.

Image annotation technology generally consists of two processes: learning and labeling. The learning process is as follows. The standard approach uses the statistical image recognition method, which involves preparing a training corpus containing a large number of images with assigned labels. This training corpus is then used to model the relationships between the image features and labels. However, we consider it impractical to prepare such a large number of training images for each customer who requires different labels. To overcome this issue, we aim to develop a technology that enables learning with the smallest possible amount of images. In this learning process, some images are prepared as in the training corpus shown in Fig 1, and each image is segmented into regions (here, a 4x7 grid). An image feature is then extracted from each region. By gathering the extracted image features corresponding with each label and statistically analyzing the distribution of the image features, a model of the distribution of image features is created for each label. Besides the image features related to the labels, each image also contains noise such as background. By statistically analyzing their distribution, it is possible to reduce the effects of noise, making efficient learning possible with only a small number of images.


Fig. 1: The process of learning annotation models

Moreover, we are also developing original technology related to the training algorithm. Firstly, when there is only a small amount of training data, an issue known as "overfitting" can occur. Overfitting is when the model over-adapts to the training data, causing the model's performance to decline when predicting unknown data. To overcome this issue, we added restrictions between the probability models of each label by maximizing their cross entropy,Note1 and were thus able to reduce the occurrence of overfittingNote2 (reference 1). Secondly, in order to enable easier customization for each customer, it is important to reduce the amount of time needed for the learning process. By using random forest classifiersNote3 to create models of the distribution of image features, the amount of time taken for the learning process was reduced to approximately 1/100 of the conventional time (reference 2).

Next is the labeling process. The process of assigning labels is illustrated in Fig. 2. As in the learning process, the input image is segmented into regions, and the models created as a result of the learning process are used to calculate each region's probability of containing the image feature corresponding to each label. By integrating the probabilities of all the regions in an image, an annotation score for the entire image is calculated. When the score for a model is equal to or above the threshold amount (zero), the label for that model is assigned to the image (in Fig. 2, labels for the "tiger" and "water" models are assigned to the image). Using the assigned labels (for example, "flower"), images can be retrieved, as shown in Fig. 3.


Fig. 2: The process of assigning labels


Fig. 3: Example of retrieving images of flowers

The difficulty of image annotation technology differs greatly according to the subject image. For example, the difficulty increases when more of the image is made up of background. At Fuji Xerox, we will continue research and development in this area, with the aim of enabling the classification and retrieval of any image used in the workplace.