Automatic Document Naming System

Scanned documents are generally given automatically assigned file names based on such attributes as date, time, and page number. The content of the scanned documents cannot be determined from these automatically assigned document names, so in many cases, users must check the content of the files and rename them.

In order to minimize the time and effort spent in renaming documents, Fuji Xerox is developing a system that can automatically categorize documents by type and give them names that convey their content.

Fig. 1 shows an overview of the automatic document naming system. A scanned document is first separated into text, figures, and tables. Then, optical character recognition (OCR) is performed on the text, and the title of the document is identified using the OCR results that include text coordinate information and text size. To assign an appropriate file name, natural language processing is then performed on the text recognition results. By using natural language processing, words that appear to be important are identified. Using the identified title and keywords, the document's category is determined, along with a document name. The categories of documents and document naming rules can be freely specified by the user.

Fig. 1: Automatic document naming system

The following is an example of how a document's category is defined. First, the user defines the document types for the documents to be sorted into, such as delivery slips or bills. For example, the system could be set so that whenever the word "delivery" is contained in the title and a table is included in the document, the system recognizes the document as a delivery slip. The document naming rules can be specified as well. For example, the format for naming documents categorized as delivery slips can be specified as "Delivery slip_[keywords]_[date scanned]".

Fig. 2 shows a specific example of when a delivery slip is scanned. Through image analysis, the document is first separated into text and a table. Then, using text recognition processing and natural language processing, the document title "Delivery of Parts for X" and the keywords "Parts for X" are determined. Because there is a table included the document as well as the word "delivery" in the title, the document is categorized as a "delivery slip" according to the document category definitions. Finally, the document file is named "Delivery slip_Parts for X_yyyymmdd".
In this way, suitable names can easily be given to scanned documents.

Fig. 2: An example of a scanned delivery slip