The Kia Project

Summarization

The Kia project deals with the problem of automatic summarization of videos via a fully automatic analysis of audio and video data. Since we work on film data, the project name pays homage to three famous film directors: Krzysztof Kieslowski (1941-1996), Ingmar Bergman (1918-) and Akira Kurosawa (1910-1998).

Motivation

Automatic generation of video summaries represent an important subset of the class of general content analysis problems. Potential applications include: (a) creating summaries for browsing archival data (b) dynamic real-time summaries (say one is late for an important meeting; how to catch up quickly?) (c) In a pay-per-view television scenario, summaries give the viewer (i.e. the content buyer) a better understanding of the utility of the content about to be purchased. Television programs that generate massive amounts of data, will make manual creation of summaries expensive.

Related Techniques

There are two existing methods to create summaries:
  1. Shot key-frames based image summary
  2. Video skims

An image summary is a collection of important key-frames arranged in time-order. This  enables the viewer to "look" at the entire video, using a small number of key-frames. Then,  by clicking on an image he can then jump to the video corresponding to that shot key-frame. However, there are a few difficulties: (a) selecting good key-frames through shot-clustering is a challenging problem. (b) The audio track is not used, either in key-frame selection or in displaying the summary.

A video skim is a shortened version of the video, where the shots and the corresponding audio have been selected using various pattern recognition algorithms (e.g. a face detector to detect shots with faces etc.). While skims preserve the dynamism in the original video data, they suffer from the problem of being linear i.e. the viewer is forced to watch the entire skim (which may be a few minutes long) to understand the contents of the video.

Our Approach

Our summarization mechanism is a skim. However, our methodology for constructing a skim is different from existing techniques. Since we work with film data, our algorithms make use of film production techniques. There are three elements to our solution. First, we perform a joint audio-visual segmentation on the film data. Our solution accounts for consistencies that arise out of film production techniques as well as the psychology of audition. Then, we investigate the stochastic structure of the film in terms of shot duration and time ordering. This is done in conjunction with audio analysis and while preserving scene boundaries. We've also developed algorithms to detect other structures, such as dialogues. Finally, when incorporating shots in our skim, we need to determine their duration. This is done by determining an interesting relationship between the descriptive complexity of a shot and decoding time time for that shot.