Sub-tree Mining Technology for Extracting Useful Information in Huge Document Sets

Kazuto Hayashi System Technology Laboratory,
Research & Technology Group
Takeshi Yoshioka Key Technology Laboratory,
Research & Technology Group

Recently, the digital documents used by companies have dramatically increased, including not only office documents such as reports and business forms, but also web and xml documents. We are developing embedded subtree mining technologies to extract useful patterns in document sets, because documents are generally stored in tree structures in a management system of documents, and document contents such as xml tag sets or logical structures in a report are frequently represented in a tree structure. This paper introduces two types of embedded subtree mining technology that can extract ancestor-descendant relationships. In an experiment using a test dataset, our technologies outperformed conventional ones in terms of both processing speed and memory consumption. In an experiment using real document data such as usage logs, we could extract useful patterns. We now intend to develop useful mining applications to deliver valuable document services.

To Previous page

Return to Top page