The Kia project deals with the problem of
automatic summarization of videos via a fully automatic analysis of
audio and video data. Since we work on film data, the project name pays
homage to three famous film directors: Krzysztof
Kieslowski (1941-1996), Ingmar
Bergman (1918-) and Akira Kurosawa
(1910-1998).
Motivation
Automatic generation of video summaries represent an important subset
of the class of general content analysis problems. Potential
applications include: (a) creating summaries for browsing archival data
(b) dynamic real-time summaries (say one is late for an important
meeting; how to catch up quickly?) (c) In a pay-per-view television
scenario, summaries give the viewer (i.e. the content buyer) a better
understanding of the utility of the content about to be purchased.
Television programs that generate massive amounts of data, will make
manual creation of summaries expensive.
Related
Techniques
There are two existing methods to create summaries:
- Shot key-frames based image summary
- Video skims
An image summary is a collection of important key-frames arranged in
time-order. This enables the viewer to "look" at the
entire video, using a small number of key-frames. Then, by
clicking on an image he can then jump to the video corresponding to that
shot key-frame. However, there are a few difficulties: (a) selecting
good key-frames through shot-clustering is a challenging problem. (b)
The audio track is not used, either in key-frame selection or in
displaying the summary.
A video skim is a shortened version of the video, where the shots and
the corresponding audio have been selected using various pattern
recognition algorithms (e.g. a face detector to detect shots with faces
etc.). While skims preserve the dynamism in the original video data,
they suffer from the problem of being linear i.e. the viewer is forced
to watch the entire skim (which may be a few minutes long) to understand
the contents of the video.
Our Approach
Our summarization mechanism is a skim. However, our methodology for
constructing a skim is different from existing techniques. Since we work
with film data, our algorithms make use of film production techniques.
There are three elements to our solution. First, we perform a joint
audio-visual segmentation on the film data. Our solution accounts for
consistencies that arise out of film production techniques as well as
the psychology of audition. Then, we investigate the stochastic
structure of the film in terms of shot duration and time ordering. This
is done in conjunction with audio analysis and while preserving scene
boundaries. We've also developed algorithms to detect other structures,
such as dialogues. Finally, when incorporating shots in our skim, we
need to determine their duration. This is done by determining an
interesting relationship between the descriptive complexity of a shot
and decoding time time for that shot.