Character Recognition Technology using the Mechanism of Human Visual Information Processing

Recognizing characters written on whiteboards or in notebooks with a machine is still a difficult challenge that has yet to be overcome. Additionally, given the separate development of character recognition technology for individual languages, character recognition is also difficult in a document containing multiple languages such as Japanese, English, and Chinese, even if the text is neatly written.

To address these challenges, Fuji Xerox has been developing character recognition technology that can recognize handwritten characters, even those of multiple languages used in a single document. The text recognition technology that we have developed adopts the mechanism of human visual information processing that has become much better understood thanks to the field of neuroscience.

Fig. 1a: Direction Selectivity of the V1 Cell
The V1 cell responds to an angle called the "optimal angle." If a different angled line is input, there is no response.

Fig. 1b: Position Invariance of the V1 Cell
The V1 cell responds when the line has the optimal angle, regardless of the line's location in the receptive field. It does not respond to patterns that cover large areas of the receptive field.

Fig. 1c: Cross-orientation Suppression of the V1 Cell
When a line having an angle significantly different from the optimal angle is input over the line with the optimal angle, the response to the line with the optimal angle is suppressed.

Visual information is first transmitted from the retina to an area called the primary visual cortex (V1), which has the following three properties: (1) direction selectivity, which enables a selective response by V1 to a certain angle of lines (Fig. 1a), (2) phase invariance, which enables a response by V1 even if a line in the optimal angle is positioned differently in the receptive field (Fig. 1b), and (3) cross-orientation suppression, which suppresses the responsiveness of V1 in case a line in the optimal angle overlaps a line at an angle very different from the optimal angle. After V1, visual information is relayed to V2 which can recognize figures consisting of two lines such as crosses and corners, then to V4 which can recognize more complex figures, and finally to the inferior temporal (IT) cortex (Fig. 2). It is believed that the human brain processes visual information in such a hierarchical way in order to recognize various shapes and figures, ranging from simple lines to such complex objects as characters or faces.

Fig. 2: Overview of Visual Information Processing
The information obtained at the retina is transmitted to the
primary visual cortex (V1), V2, V4, and then finally to the
inferior temporal (IT) cortex. It is believed that this is how
both simple and complex figures are recognized.

Figure 3 shows the overall structure of the character recognition method that uses the mechanism to process visual information. The features of an input image are first extracted by layers of convolution units and sub-sampling units, enabling a mechanism similar to how V1 and V2 process visual information in the brain. Then at the "character classifier," each character is finally determined by imitating the processing performed at the IT cortex. The image that includes characters is first input into the convolution unit, so as to extract its features. The filter used for convolution can evolve through learning by handling various types of characters. For example, the filter used to extract lines in Convolution unit 1 (which corresponds to V1 in the brain) learns and evolves so that it can process various types of lines.

Fig. 3: Structure of the Developed Character Recognition System

In the sub-sampling unit, the "energy model" enables position invariance so that it does not strictly differentiate small positional differences, and the "cross-orientation suppression model" (which includes excitatory input and inhibitory input) enables optimal cross-orientation suppression, as shown in Fig. 4. By transmitting information through multiple sets of a convolution unit and sub-sampling unit, the information is processed in a similar manner to processing performed in V1 and V2, allowing the extraction of both simple and complex figures. Finally, the character classifier identifies characters based on the extracted features. The classifier is also capable of learning, and consequently, the character recognition rate can be improved. In this way, the character recognition system can improve its ability to recognize characters by learning, just like the brain. As a result, handwritten characters and those of different languages can now be recognized.

Fig. 4: Structure of Sub-sampling Unit

Figure 5 compares the character recognition system that uses the mechanism of visual information processing with a conventional character recognition system. Figure 5a shows the freely written characters that were evaluated; Fig. 5b shows the evaluation results. Both systems show a high level of recognition for neatly written characters (grade 4 or 5), but when it comes to characters of grades 2 and 3, our system has a higher recognition rate than that of the conventional system.

The legibility of the characters is graded according to the chart below. [Grade: Definition][5: Excellent][4: Good][3: Fair][2: Poor][1: Bad]
Fig. 5a: Evaluated Characters

Fig. 5b: Recognition Rate for Each Grade of Legibility
The above shows the recognition rates of characters graded 4 and 5, and that graded 2 and 3.

Fig. 5: Results of Comparing the Developed Character Recognition System and a Conventional Character Recognition System