Document Analysis Technology
When extracting important information from documents, simply extracting individual characters is inadequate. Documents must be analyzed in order to organize the text information contained in documents into meaningful segments and units. To address this issue, Fuji Xerox has developed document analysis technology capable of automatically extracting the structure information of electronic documents converted from paper documents, as well as electronic documents created using various applications.
This technology consists of an image analysis system that analyzes images in documents, and a code analysis system that analyzes characters having code information (Fig. 1). After pre-processing (e.g., skew correction, resolution conversion) is performed on a document image, the image analysis system analyzes the characteristics of the image through area analysis to identify such area attributes as text, pictures, tables, and figures that constitute the document. In the text and table areas, individual text areas are identified based on the result of analyzing the arrangement of character strings, and then whether the text in each area is written vertically or horizontally is identified. In addition, processing to separate the text from ruled lines and contour correction for the text and lines are performed, in order to increase the character recognition rate. After that, OCR is applied, followed by the text areas and tables being correlated with character codes. Meanwhile, the code analysis system interprets the code information of characters that constitute the document and obtains information on individual characters. At the same time, elements in the document are divided into character elements having and not having code information, and then the character elements are converted into images. Based on the result of analysis conducted by the image analysis system, text areas are identified through area analysis and the layout information is obtained.
For example, there are cases where characters in electronic documents have no code information because the characters are generated as images. Even in such cases, by coordinating the image analysis system and the code analysis system, it becomes possible to obtain character code information and generate the layout information.
With this document analysis technology, various types of documents can be analyzed, regardless of whether in paper or digital form.