Technology to Determine the Positions of People and Detect a Speaker
Research has been conducted on interface technology that controls various devices by sensing movements of the human body. In order to make such technology a reality, monitoring the movements of people with cameras and sensors is required. Fuji Xerox and FX Palo Alto Laboratory (located in California's Silicon Valley) have developed a technology that determines the positions of people walking freely in a room using depth image sensors and detects a speaker.
The method of determining the positions of people is described below using the example of a meeting room.
Three depth image sensors are installed approximately two meters above the floor in a meeting room (D0 - D2 in Fig. 1). The position and angle of the sensors are pre-calibrated and measured. Each sensor provides 3D point cloud data, in which the sensor is located at the origin of coordinates. This data is converted into a single depth image with the room floor represented as the x-y flat surface and height as z, as shown in Fig. 2 (a). The value of the pixel indicates height (z); and the higher the value, the whiter the areas are displayed in the image. When image processing technology (blob tracking technology) that constantly tracks a certain object is applied to this 3D point cloud data, it enables objects in areas that are whiter than the rest of the single depth image to be detected as human beings, thus making it possible to determine the x-y coordinates of the people in the room.
Red circles in Fig. 2 (a) indicate the people detected. Six people, who are positioned as shown in Fig. 2 (b), are all detected. It is also possible to estimate the height of a person based on the maximum value of pixels in the extracted area. A tracking ID is assigned to each detected person. This ID is valid as long as the person is being tracked. Should the person be lost and then detected again, however, a new ID will be assigned to that person.
The detection of a speaker is explained next. The depth image sensors used for detecting the positions of people are equipped with a feature to detect the direction of a sound source. The direction of a sound source can be indicated as a vector on the x-y flat surface, as shown in Fig. 1. For every person, the distances between the vector of each sensor and the person are used to calculate a score. The closer the person is to the vectors, the higher the score. Thus, the person with the highest score is identified as the speaker.