Automatic Document Genre Identification For Faceted Document Browsing And Searching

Francine Chen FX Palo Alto Laboratory, Inc.
Andreas Girgensohn same as above
Lynn Wilcox same as above

Browsing and searching documents in large enterprise document repositories is an increasingly common problem. While users are usually satisfied with Internet search results, enterprise search has not been as successful because of differences in data types and user requirements. To support users in finding desired information from electronic and scanned documents, we created an automatic detector for genres such as papers, slides, tables, and photos based on imaged document features. The automatically identified genres play an important role in our faceted document browsing and search system. The system presents documents in a hierarchy as typically found in enterprise document collections. Documents and directories are filtered to show only documents matching selected facets and containing optional query terms and to highlight promising directories. Thumbnail images and automatically identified keyphrases help select desired documents.

