Video Indexing Technology Using Text, Images, and Speech

Recording video is now an everyday activity, as it has become more common to own devices such as video cameras, smartphones and digital cameras. In addition, thanks to a wide range of services and infrastructures available for handling video, we can easily find and watch videos online. However, videos without any metadata, such as date, creator, keywords, or description, cannot be found using typical search engines. Metadata is usually added manually, and the process is very time consuming. Furthermore, even if a video can be found by its metadata, search engines normally are not capable of finding a specific scene of interest within the video.

To make searching video easier, FX Palo Alto Laboratory, Inc., located in Silicon Valley, has developed a video indexing technology that is able to find specific scenes in videos by the text (character strings) appearing in them. With this technology, by simply entering a search term, users can search for lecture videos and scenes in which that term appears. This technology also lets users search for videos with no manually added metadata by automatically extracting and indexing the text from slides detected in a video.

Fig. 1: Video indexing analysis flowchart

Fig. 1: Video indexing analysis flowchart

Fig. 1 describes the flow of analyzing a video of a lecture using the video indexing technology. First, video segments that are stationary for a certain amount of time are identified as slides and extracted from the video. Then, the text data (character strings) contained in the slide images is extracted using OCR. The extracted text and the slide images are then correlated with the corresponding scenes in the original video and stored in a database.

Users can then access the database via a web interface (browser) and use text search to find specific slides in the indexed lecture videos.

Furthermore, when this technology is used in combination with the automatic image annotation technology on which Fuji Xerox has been conducting research, it can be used to automatically index images extracted from a video. We are also currently advancing research on video indexing using the audio of a lecturer's speech.