Fully Automatic Video Region Segmentation Fusing Multiple Visual Features

Contacts : Di Zhong and Shih-Fu Chang

In this project, we developed an automatic region segmentation and tracking system for general video sources. It uses feature fusion and motion projection to track salient video regions over long video sequences. Specifically, we proposed an innovative method to combine color, edge and optical flow directly in one tracking process. This approach is robust to image noises and can achieve accurate region boundaries. Experiments show that  our system can track salient regions reliably through long video shots.

We define an image region as a contiguous area of pixels with consistent features (e.g., color) in an image frame. It may correspond to part of a physical object, like a car, a person, or a house. A video region is a sequence of instances of the tracked image region in consecutive frames. The region segmentation and tracking process is applied within a video shot to obtain video regions.

More detailed information are given in the following sections:

Objectives

System Overview

Segmentation Results

Software Description (free binary code available)

References

 


 

 

Objectives

 

Moving objects segmentation using the motion field or optical flow has been the main focus of many researches. As motion fields are usually noisy for real-world scenes, direct segmentation of them is erroneous and not stable. One main problem with many existing approaches is that segmentation results are sensitive to noises and/or slight variances of features, especially at places around segmentation boundaries. In region tracking, the problem may cause different segmentations at successive frames. When video sequence is short, boundary errors usually do not hurt overall tracking performance seriously. However, when a region needs to be tracked over a long period, accumulated boundary errors are likely to completely break the tracking process. To increase the stability of region segmentation, fusion of various visual features in the segmentation process is an essential approach. As an example, edge-based methods may produce accurate object boundaries but are sensitive to noises. On the contrary, color based region growing methods are robust to noises but usually results in over-segmented regions, and may not be able to generate accurate boundaries (e.g., due to color blur). An efficient method to combine these two features is highly desired to achieve more consistent segmentations.

Another problem is that the mapping between regions at successive frames is not reliable when these regions are segmented independently. Because similar regions often exist within even small local windows, minor segmentation differences and/or motion estimation errors could cause region mismatches. To address this problem, an inter-frame segmentation process needs to be developed to partition an intermediate frame consistently with segmentation results of its preceding frame. This approach avoids the non-reliable afterwards mapping between successive frames.

In this project, we developed an automatic video region segmentation and tracking method based on the fusion of color, edge, motion and temporal features. This method can track video regions stably over a long period, and is especially useful to build visual index of a large video collection. 

System Overview

The segmentation and tracking of feature regions is based on the fusion of color, edge and optical flow.  Color is chosen as the major segmentation feature because of its consistency under varying conditions, such as change in orientation, shift of view, partial occlusion or change of shape.  Compared with other features such as edge, shape and motion, colors (or more precisely, mean colors) are more stable. Edge features are complementary to color information: color captures low frequency information (means) while edge captures high-frequency details (edges) of an image. Thus fusion of them greatly improves segmentation results, especially region boundaries.  Different from old merge-and-split methods where the edge is applied after color-based region merge, we propose a new method to fuse edge information directly in the color merging process. Affine motion model is estimated for each region based on the computation of optical flow.  It is utilized to track color regions through a video shot. The basic region segmentation and tracking procedure is shown in Figure 1.  Projection and segmentation is the major module in which different features are fused for region segmentation and tracking.  This module is described in Figure 2 and will be further discussed in detail below.  Optical flow of current frame n is derived from frame n and n+1 in the motion estimation module using a hierarchical block matching method.  Different from simple block matching methods which estimate motion solely based on minimum mean square errors, this technique yields reliable and homogeneous displacement vector fields, which are close to the true displacements.  Taken color regions and optical flow generated from above two processes, a linear regression algorithm is used to estimate the affine motion for each region. The affine transformation is a good first-order approximation to a distant object undergoing 3D translation and linear deformation.  Affine motion parameters are further refined by using a log(D)-steps region matching method in the six-dimensional affine space.  Through above modules, color regions with affine motion parameters are generated for frame n.  Similarly, these regions will be tracked in the segmentation process of frame n+1.

 

Video Regions

 

Figure 1. The diagram of region segmentation and tracking

 

A salient region selection module is applied at the final stage to automatically associate regions with high-level semantic objects.  Several criteria are adopted to group or identify major interesting regions. Sizes and durations are utilized to eliminate noisy and unimportant regions: regions with both small size and small duration are eliminated or merged with its neighbor regions. When object motion exists, affine motion models are used to group adjacent regions with similar motions to obtain the moving object. 

   Figure 2. The motion projection and segmentation module

 

Segmentation Results

Some segmentation results of sports videos are shown in Figure 3. People are the main objects within these videos. These images give us a general idea of what regions are automatically extracted. The results show that our algorithm can correctly identify salient region such as body and face, while ignoring detailed features like eyes. The region boundaries are accurate, which allows us to define shape features.

   

    

    

    

    

    

Figure 3.  More region segmentation results shown in random colors

One un-desired characteristic is that one semantic object is usually divided into multiple regions, due to the fact that a semantic object does not have homogenous visual features. Thus it is important to store spatial positions or spatial structures of video regions, so that users can search an object by specifying a set of regions with certain spatial locations.

Software Description

The system have been developed and tested under Sun-Solaris, HP-UX and SGI-IRIX systems. It requires GNU C++ compiler. The binary code of this system is free available. Please contact authors for the software.
 
The system takes a scene cut file specifying the frame number of cuts, and a parameter file defining track options is required.

The system can take both MPEG(1,2) or raw frames as inputs. For each frame, the system generates two output files:
- SEG file: segmented regions drawn in PPM format with mean colors
- OIF file: segmented regions with their basic features

 

References

  1. "Description of MPEG-4", ISO/IEC JTC1/SC29/ WG11 N1410, MPEG document N1410 Oct. 1996.

  2. D.Zhong and S.-F.Chang, "Video Object Model and Segmentation for Content-Based Video Indexing", ISCAS?7, HongKong, June 9-12, 1997.

  3. Di Zhong and S.-F.Chang, "Spatio-Temporal Video Search Using the Object Based Video Representation", ICIP'97, October 26-29, 1997 Santa Barbara, CA