Content Based Video Indexing Techniques
Di Zhong and Shih-Fu Chang
Structured organization of video data is the basis to build large video archives which allow efficient retrieval, browse and manipulation[1,2]. Conventionally, video data have to be manually annotated with keywords. This method is time consuming and can not provide a complete description of video content. In this work, we show how the automatic video analysis techniques, such as scene cut detection, key frame selection, visual/audio feature extraction, object recognition, text and speech identification, can be used in the content based video indexing process.
1. Content representation of video data
Generally, content based structure organization of sequential video streams includes three steps, i.e. segmentation, content extraction and indexing. Firstly, sequential video streams have to be broken into elemental units (i.e. video shots). Scene cut detection algorithms[3] have shown success for detection of such temporal boundaries indicating possible content changes. Here we focus mainly on how to extract contents from segmented video shots and then index them effectively so that users can retrieve and browse a large amount of video collections.
From the user's point of view, there are mainly two kinds of demands: visual query and concept query. By visual query, users want to find video shots that look similar to a given example. In concept query, users want to find video shots by the presence of specific objects or events. Visual query can be realized by directly comparing low level visual features like color, texture, shape and temporal variance of video shots or their representative frames (i.e. key frames)[4]. On the other hand, the concept query depends on object detection, tracking and recognition. Since fully automatic object extraction is still impossible, some extent of user interaction is necessary in this process. But as we will discuss later, manual indexing labor can be greatly reduced with the help of video analysis techniques.
Figure 1 shows a conceptual architecture for content-based video indexing and search [2] (Figure 1).

Figure 1: Parsing, representation and organization of video content
Various video and audio analysis and understanding techniques are needed in the above model to extract useful features, objects, and concepts. Mapping from low-level features to high-level concepts are achieved by using input from iterated user interaction and effective learning algorithms. Multimedia features including image, text, and speech features are used to characterize the video content of interest.
2. Development of A Content Based Video Indexing System
Based on the model discussed above, we are developing a powerful content-based video indexing system. This system provides automatic or semi-automatic facilities to help users browse, search and annotate compressed videos (MPEG-2). Currently, organization of video data is based on visual features computed from each shot. A generalized hierarchical clustering method is applied to index these feature data. It supports efficient visual retrieval, hierarchical browsing and annotation [2]. Concept based query and browse of video shots are now realized through annotated keywords.
Visual features include color histogram (64 bins), color moments, spatial color moments, MRSAR texture model, TAMURA texture features and statistical motion features. The first three color features are computed in the Lu*v* space from key frames of each video shot. They give a global visual similarity matching of different shots. The two kinds of texture features are computed from the nine (3x3 evenly divided) local regions in each key frame. They can be used to search for key frames with similar texture patterns on different locations. Statistical motion features are designed to characterize general motion distribution (speeds and directions). They are derived from optical flows which are computed using image sequences around the key frame in a video shot.
A generalized hierarchical clustering process, which adopts partition clustering recursively at each level of the hierarchy, is used to build the index of different feature data. This method is very efficient for the indexing of high-dimension feature vectors[5] and generates a very flexible structure. Different features, similarity measure metrics, and clustering methods can be applied at different levels of the indexing structure. Several partition clustering algorithms are used in our system, including K-Means, ISODATA and Self-Organization Map. The first two are typical iterative clustering algorithms, which starting from a given or arbitrary initial partition, iteratively adjust the classification of data items according to certain optimum functions such as MSE. The last one is an unsupervised learning neural networks. It is more robust and can generate better classification of a large number of feature points.
Several news videos from CNN are analyzed and indexed in our system. With the support of feature indexing, users can easily find similar shots using various visual features. Alternatively, if users do not know what to retrieve, class-based hierarchical browsing of video shots can give them a quick overview of video content. Furthermore, users are allowed to annotate any group of video shots or individual shot. These annotations can be used later for keyword query or browsing. An example is given in figure 2, which shows the hierarchical story structure of a CNN news video. By determining the appropriate levels of clustering and selection of visual features, the system automatically group the anchorperson shots and transient shots in the same category. With the anchor or transient shots as boundaries, different news stories can be automatically identified. This automated procedure provides a very effective method for users to assess the rough content (news stories in this case) contained in a long video sequence. A key frame of the first shot in each story is selected as the story's representative icon. Users can then click on any icon to view the real-time video playback with interactive control.

Figure 2. Hierarchical browsing of a news video based on efficient feature clustering
3. Further development of the system
We have identified several challenging issues which will be addressed in our current enhancement plan.
Although successful visual features have been developed for visual query of still images, comparison of visual similarity of video shots is still an open problem. Key frames can be used as a representative sample of a video shot. But when there are motions or large variance in the sequence, key frames may not be able to capture complete visual information. Extraction of visual features, such as color, from the complete video shots are expected to give better and more complete visual description of a video shot. Similarly, the statistical motion features mentioned above are also short-term features. They can be extended to capture the long-term motion information inside a video shot.
Object detection and recognition have been studied in computer vision for a long time. In the context of video indexing and retrieval, new methods should be developed to satisfy new requirements in video indexing. For example, we may want to estimate the possibility of existence of specific objects in a video shot. To this end, we will adopt the following procedure. Firstly, homogeneous regions should be detected and tracked in all the video shots. In the meanwhile, visual features of these regions are extracted and stored. These feature data will then be used in the content based visual query and indexing to compute the possibility of existence of specific objects. The selection of suitable features and associated weighting is also a non-trivial issue. Currently, we are developing effective methods for identifying appropriate feature sets in various specific application domains.
The definition of video objects (or concepts) have to come from domain knowledge or user input. In a video database, this can be fulfilled with the help of automatic region segmentation, tracking and matching. As an example, users firstly specify some typical regions as examples of a primitive concept. The system can then annotate all the similar regions as possible instances of the concept automatically based on the visual similarity. Users can further specify composed concepts, which may consist of other concepts, and then ask the system to automatically propagate these concepts to the whole database. By this way, a concept structure is set up with visual features at the bottom level. This provides a powerful support for concept query.
Text, audio and speech will be integrated into the system. They can be used for concept query by themselves or combined with above visual features and are also helpful in the process of concept definition and propagation, i.e. indicating or verifying the existence of a specific objects. Some state-of-art techniques will be examined and utilized for the detection and recognition of text and speech in digital videos.
References
J. Meng and S.-F. Chang, ``Tools for Compressed-Domain Video Indexing and Editing'', SPIE Conference on Storage and Retrieval for Image and Video Database, San Jose, Feb. 1996.
D. Zhong, H.J. Zhang and Shih-Fu Chang, ``Clustering methods for video browsing and annotation'', Storage and Retrieval for Still Image and Video Databases IV, IS&T/SPIE's Electronic Imaging: Science & Technology 96 [2670-38].
J. Meng, Y. Juan and S.-F. Chang, Scene Change Detection in a MPEG Compressed Video Sequence, IS&T/SPIE Conf. on Digital Video Compression: Algorithms and Technologies, San Jose, Feb. 1995.
John R. Smith and Shih-Fu Chang, ``Local Color and Texture Extraction and Spatial Query,'' IEEE International Conference on Image Processing, Switzerland, Sept. 1996.
H. J. Zhang and D. Zhong, ``A scheme for visual feature based image indexing'', Proc. IS&T/SPIE Conf. on Storage and Retrieval for Image and Video Databases III, February 1995, San Jose, pp. 36-46.