¡¡
Moving Object Segmentation and Tracking Using Spatio-Temporal Consistency
¡¡
The success of object-based media representation and description (e.g., MPEG-4 and ¨C7) depends largely on effective object segmentation tools. In this project, we expand our previous work on automatic video region tracking and develop a robust - moving objects detection system. In our system, we first utilize innovative methods of combining color and edge information in improving the object motion estimation results. Then we use the long-term spatio-temporal constraints to achieve reliable object tracking over long sequences. Our extensive experiments demonstrate excellent results in handling challenging cases in general domains (e.g., stock footage) including depth-varying multi-layer background and fast camera motion.
Contacts : Di Zhong and Shih-Fu Chang
¡¡
¡¡
![]()
![]()

region segmentation background layer temporal consistency
More detailed information are given in the following sections:
Iterative Motion Layer Detection
Object Extraction using Spatio-Temporal Consistency
¡¡
¡¡
¡¡
Some example videos are shown below. They all have depth variance and camera motion (i.e. following the moving objects) in the scenes, resulting in multiple motion layers.
The first sequence contains a skater running towards the camera. The ice field has a gradual depth change from near to far.
Video
1 Object
Tracking Result
In the second sequence, a person is working away from the camera in an office. Cubic walls exit at different depths.
Video
2 Object
Tracking Result
The third sequence is a bird-eye¡¯s view of a soccer player running in the field.
Video
3 Object
Tracking Result
Sequence 4 contains three background layers, which are the ground, wall and crowd.
Video
4 Object
Tracking Result
The last sequence contains the sky, the stage and the jumping skier.
Video
5 Object
Tracking Result
The gradual depth change in the sequence 1 does not cause much problem as the ground is merged into one large region in the first color based region segmentation stage. In sequence 2, the cubic walls are tracked as separated regions. Although these regions are classified as foreground motion layers in some frames, their temporal durations are short and thus are considered as background. In the third sequence, both the player and the grass field have gradual depth variances. Similar to the first sequence, color segmentations are proven to be useful in handling such situations. The above three sequences show good tracking results. Some small background regions are falsely included in the sequence 4. These regions are mainly from the connecting parts of two background regions, and usually have inaccurate motion fields. Some foreground pixels are missed in (5) is because small isolated regions are removed in the final morphological operations.
Our experiments demonstrated that long-term region based moving object
detection approach is more robust and reliable compared to existing approaches
that only uses local motion information (e.g., frame-to-frame motion field). The
method is designed to automatically detect and track salient moving objects
within scenes with multiple motion layers. By using temporal constraint, we can
robustly and accurately segment moving objects over a long period. Our method
can also handle objects with discontinuous motions (i.e., moving in some frames
and still in other frames).
¡¡
Except for some special cases (e.g., surveillance videos), common TV programs and home videos usually contain camera motions. In these situations, to detect moving objects, we first need to compensate motions caused by camera operations. When the scene is far from the camera and/or the camera motion only includes rotation and zoom, a single affine motion model can be used to model and compensate the camera-induced motion. However, when the scene is close to the camera and the camera is translating, multiple moving planar surfaces may be produced in the image sequence. To address this problem, expanding our existing region segmentation and tracking algorithms, we develop a two-stage moving objects detection method. This method uses regions with accurate boundaries to effectively improve motion estimation results, and uses the temporal constraint to achieve more reliable object tracking results over long sequences.

In
the first stage, we apply an iterative motion layer detection process based on
the estimation and merging of affine motion models. Each iteration generates one
motion layer. The difference from existing methods is that motion models are
estimated from spatially segmented color regions instead of just pixels or
blocks.
In
the second stage, temporal constraints are applied to detect moving objects in
spatial and temporal space. Layers in individual frames are linked together
based on characteristics of their underlying regions. One or more layers will be
declared as motion objects according to specific spatio-temporal consistency
rules.
¡¡
Iterative Motion Layer Detection
The iterative motion layer detection process is shown in the following figure.

a. region segmentation and tracking
As shown in the reference, we developed an accurate region segmentation method using effective fusion of color and edge features. Our method uses motion-based projection to track region over successive frames. Reliable tracking of salient regions over long periods have been demonstrated in our experiments (1,2).
b. motion based region merge
At each frame, non-background regions are merged into motion layers according their affine motion models, e.g. the 8-parameter ego-motion model. We use the following distance measure to compare two neighboring regions Ri and Rj .
D(i,j) = min( MCErr(Ri,Mj), MCErr(Rj,Mi) )
where Mi and Mj are the affine motion models of region Ri and Rj respectively. MCERR(R,M) is the motion compensation error of region R under motion model M . A region i is merged with its closest neighbor if their distance is below a given threshold.
c. detect background motion layer
After regions are merged into motion layers, we try to identify one background layer in each iteration. This is based on the assumption that a foreground layer usually has discontinued motion fields around most of its outer boundaries, while the background layer usually has continuous outer boundaries with neighboring background layers. Boundaries of a layer are consisted of pixels that have at least one neighboring pixel not belonging to the layer. Outer boundary is the outmost closed curve that contains the whole layer. We use the following energy function to measure its boundary discontinuity.
G(p)=max( |p1-p8|, |p2-p7|, |p3|-|p6|, |p4|-|p5| )
d. detect and exclude background regions
The affine motion model of the detected background region is used to compensate non-background regions. Those regions with small compensation errors are classified as background, and excluded from the next iteration of layer merging and detection. After multiple iterations, multiple background layers may be produced, while multiple foreground layers remain.
¡¡
Object Extraction using Spatio-Temporal Consistency
The
foreground layers detected at individual frames may be reliable. There are
several reasons. First, the motion field and motion models may be inaccurate.
Second, more importantly, a moving object may have noticeable motions in
some frames where it can be easily detected. But in other frames, it may be
static and is mistakenly treated as background. A long-term decision through a
long-term interval (e.g., a shot) is necessary to remove such errors and achieve
reliable results.
To
apply temporal constraints, we first link foreground layers (i.e., tracking) in
individual frames according to their underlying regions. A foreground layer
in frame m
is linked with a layer
in frame n,
if the following condition is satisfied:
where
and
are the kth and lth
foreground layer in frame m and n respectively. The maximum is
computed over all foreground layers in frame m and n. The
intersection of two layers in Eq (3) is defined as the number of common regions
they both contain. Two regions in different frames are said to be common if one
is tracked by motion projection from another one. In
other words, layer
in frame m is linked to the
layer in a previous frame (n) that shares the most common regions. This
process is iterated to foreground layers remaining unlinked. In addition, we
also define the link as a conductive relationship, which means if layer A and B,
B and C are linked respectively, then A and C are also linked. This ensures that
each local motion layer belongs to one and only one temporal layer.
The
above linking or tracking process results in a number of groups of foreground
layers. We will refer these groups as temporal
layers. We use some spatio-temporal constraints to validate these temporal
layers. The first one is the duration of a temporal layer. Layers with short
duration are likely to be noise or background regions, and thus are dropped.
Secondly, the frame-to-frame changes of center coordinates and sizes of a
temporal layer are examined. If there are large and abrupt changes, the temporal
layer is not a valid tracking and will not be detected as a foreground object.
¡¡
Di Zhong and S.-F.Chang, "Video Object Model and Segmentation for Content-Based Video Indexing", ISCAS'97, HongKong, June 9-12, 1997.
S.-F.
Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, "VideoQ: An Automated
Content-Based Video Search System Using Visual Cues", ACM 5th Multimedia
Conference, Seattle, WA, Nov. 1997.
Di
Zhong and S.-Fu Chang, "AMOS - An Active MPEG-4 Video Object Segmentation
System", ICIP-98, Chicago, Oct. 1998.