¡¡

Moving Object Segmentation and Tracking Using Spatio-Temporal Consistency

¡¡

The success of object-based media representation and description (e.g., MPEG-4 and ¨C7) depends largely on effective object segmentation tools. In this project, we expand our previous work on automatic video region tracking and develop a robust - moving objects detection system. In our system, we first utilize innovative methods of combining color and edge information in improving the object motion estimation results.  Then we use the long-term spatio-temporal constraints to achieve reliable object tracking over long sequences.  Our extensive experiments demonstrate excellent results in handling challenging cases in general domains (e.g., stock footage) including depth-varying multi-layer background and fast camera motion.

Contacts : Di Zhong and Shih-Fu Chang

¡¡

¡¡  

         region segmentation               background layer           temporal consistency      

More detailed information are given in the following sections:

Objectives

Example Videos and Results

System Overview

Iterative Motion Layer Detection

Object Extraction using Spatio-Temporal Consistency

References

¡¡


¡¡

Objectives

Although much work has been done in decomposing images into regions with uniform features, we are still lacking robust techniques for segmenting semantic video objects in general video sources. In our previous work, we developed a general interactive tool for semantic object segmentation. It can be used in offline applications where object-based compression and indexing is needed. In the case when real-time processing is required, user inputs are usually not feasible or very limited. For example, in broadcast sports or news programs, if we want to parse and summarize video objects and events in real time, automatic object extraction methods are needed.

In this project we apply and expand our previous work on automatic video region tracking and develop an automatic moving object tracking system by grouping low-level regions using domain models. Specifically, we will look at the motion characteristics of objects, and extract salient moving objects from complex scenes. Our main objectives are: salient moving objects, fully automatic, and capable of handling practical situations involving complex scenes. These combined features distinguish our system from existing works.  

¡¡

Example Videos and Results

Some example videos are shown below. They all have depth variance and camera motion (i.e. following the moving objects) in the scenes, resulting in multiple motion layers. 

The first sequence contains a skater running towards the camera. The ice field has a gradual depth change from near to far. 

                 Video 1       Object Tracking Result

In the second sequence, a person is working away from the camera in an office. Cubic walls exit at different depths. 

                    Video 2       Object Tracking Result

The third sequence is a bird-eye¡¯s view of a soccer player running in the field.

                 Video 3       Object Tracking Result

Sequence 4 contains three background layers, which are the ground, wall and crowd. 

                 Video 4       Object Tracking Result

The last sequence contains the sky, the stage and the jumping skier.

                Video 5       Object Tracking Result

The gradual depth change in the sequence 1 does not cause much problem as the ground is merged into one large region in the first color based region segmentation stage. In sequence 2, the cubic walls are tracked as separated regions. Although these regions are classified as foreground motion layers in some frames, their temporal durations are short and thus are considered as background. In the third sequence, both the player and the grass field have gradual depth variances. Similar to the first sequence, color segmentations are proven to be useful in handling such situations. The above three sequences show good tracking results. Some small background regions are falsely included in the sequence 4. These regions are mainly from the connecting parts of two background regions, and usually have inaccurate motion fields. Some foreground pixels are missed in (5) is because small isolated regions are removed in the final morphological operations.

Our experiments demonstrated that long-term region based moving object detection approach is more robust and reliable compared to existing approaches that only uses local motion information (e.g., frame-to-frame motion field). The method is designed to automatically detect and track salient moving objects within scenes with multiple motion layers. By using temporal constraint, we can robustly and accurately segment moving objects over a long period. Our method can also handle objects with discontinuous motions (i.e., moving in some frames and still in other frames).

¡¡

System Overview

Except for some special cases (e.g., surveillance videos), common TV programs and home videos usually contain camera motions.  In these situations, to detect moving objects, we first need to compensate motions caused by camera operations.  When the scene is far from the camera and/or the camera motion only includes rotation and zoom, a single affine motion model can be used to model and compensate the camera-induced motion. However, when the scene is close to the camera and the camera is translating, multiple moving planar surfaces may be produced in the image sequence. To address this problem, expanding our existing region segmentation and tracking algorithms, we develop a two-stage moving objects detection method. This method uses regions with accurate boundaries to effectively improve motion estimation results, and uses the temporal constraint to achieve more reliable object tracking results over long sequences.

In the first stage, we apply an iterative motion layer detection process based on the estimation and merging of affine motion models. Each iteration generates one motion layer. The difference from existing methods is that motion models are estimated from spatially segmented color regions instead of just pixels or blocks.

In the second stage, temporal constraints are applied to detect moving objects in spatial and temporal space. Layers in individual frames are linked together based on characteristics of their underlying regions. One or more layers will be declared as motion objects according to specific spatio-temporal consistency rules.

¡¡

Iterative Motion Layer Detection

The iterative motion layer detection process is shown in the following figure.

a. region segmentation and tracking

As shown in the reference, we developed an accurate region segmentation method using effective fusion of color and edge features. Our method uses motion-based projection to track region over successive frames. Reliable tracking of salient regions over long periods have been demonstrated in our experiments (1,2).

b. motion based region merge

At each frame, non-background regions are merged into motion layers according their affine motion models, e.g. the 8-parameter ego-motion model. We use the following distance measure to compare two neighboring regions Ri  and Rj .

D(i,j) = min( MCErr(Ri,Mj), MCErr(Rj,Mi) )

where Mi  and  Mj are the affine motion models of region Ri  and Rj  respectively. MCERR(R,M)  is the motion compensation error of region R  under motion model M .  A region i is merged with its closest neighbor if their distance is below a given threshold.

c. detect background motion layer

After regions are merged into motion layers, we try to identify one background layer in each iteration. This is based on the assumption that a foreground layer usually has discontinued motion fields around most of its outer boundaries, while the background layer usually has continuous outer boundaries with neighboring background layers.  Boundaries of a layer are consisted of pixels that have at least one neighboring pixel not belonging to the layer.  Outer boundary is the outmost closed curve that contains the whole layer. We use the following energy function to measure its boundary discontinuity.

    G(p)=max( |p1-p8|, |p2-p7|, |p3|-|p6|, |p4|-|p5| )

where p1-p8 are motion vectors of p¡¯s 8 neighbors (clockwise, p1 at left-upper corner).  This energy function is similar to common edge detection operators such as the Roberts operator. A layer l is detected as a potential background layer only when the sum of G(p) around its outer boundary  is smaller than a threshold. If no background layer is detected, the algorithm stops and all remaining regions belong to foreground layers. When there are more than one possible background layers, the largest one is chosen as the background.

d. detect and exclude background regions

The affine motion model of the detected background region is used to compensate non-background regions. Those regions with small compensation errors are classified as background, and excluded from the next iteration of layer merging and detection. After multiple iterations, multiple background layers may be produced, while multiple foreground layers remain.

¡¡

Object Extraction using Spatio-Temporal Consistency

The foreground layers detected at individual frames may be reliable. There are several reasons. First, the motion field and motion models may be inaccurate.  Second, more importantly, a moving object may have noticeable motions in some frames where it can be easily detected. But in other frames, it may be static and is mistakenly treated as background. A long-term decision through a long-term interval (e.g., a shot) is necessary to remove such errors and achieve reliable results.

To apply temporal constraints, we first link foreground layers (i.e., tracking) in individual frames according to their underlying regions. A foreground layer  in frame m is linked with a layer  in frame n, if the following condition is satisfied:

where  and  are the kth and lth foreground layer in frame m and n respectively. The maximum is computed over all foreground layers in frame m and n. The intersection of two layers in Eq (3) is defined as the number of common regions they both contain. Two regions in different frames are said to be common if one is tracked by motion projection from another one. In other words, layer  in frame m is linked to the layer in a previous frame (n) that shares the most common regions. This process is iterated to foreground layers remaining unlinked. In addition, we also define the link as a conductive relationship, which means if layer A and B, B and C are linked respectively, then A and C are also linked. This ensures that each local motion layer belongs to one and only one temporal layer.

The above linking or tracking process results in a number of groups of foreground layers. We will refer these groups as temporal layers. We use some spatio-temporal constraints to validate these temporal layers. The first one is the duration of a temporal layer. Layers with short duration are likely to be noise or background regions, and thus are dropped. Secondly, the frame-to-frame changes of center coordinates and sizes of a temporal layer are examined. If there are large and abrupt changes, the temporal layer is not a valid tracking and will not be detected as a foreground object.

¡¡

References

  1. Di Zhong and S.-F.Chang, "Video Object Model and Segmentation for Content-Based Video Indexing", ISCAS'97, HongKong, June 9-12, 1997.

  2. S.-F. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, "VideoQ: An Automated Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia Conference, Seattle, WA, Nov. 1997.

  3. Di Zhong and S.-Fu Chang, "AMOS - An Active MPEG-4 Video Object Segmentation System", ICIP-98, Chicago, Oct. 1998.