18.6 CYBERMAPs for Multimedia Data
Obviously it would be very useful to apply CYBERMAP not only to textual data, but to also compute overview maps of multimedia objects. In the simplest case, this means that textual descriptions of multimedia objects are used for clustering. In a first realization of this idea we built a CYBERMAP of "A City In Transition - New Orleans 1983-1986" [Dav88], a collection of movie segments stored on laser disk that have been produced by Professor Glorianna Davenport at the MIT Media Lab illustrating the New Orleans world exposition of 1986. The movies have been manually indexed by Professor Davenport and her students. Figure I.106 displays the CYBERMAP that has been computed based on the textual index shown at the bottom of figure I.106.
Figure I.106 CYBERMAP for collection of movie segment "New Orleans in Transmission"Figure I.107 illustrates the use of CYBERMAP for browsing the movie database using the hyperdrawer navigation window to get direct access to all the movie segments that have something to do with "Jax".
Figure I.107 Browsing in "New Orleans in Transition" stack using CYBERMAPIn textual databases, objects are easily compared as sequences of bytes. Multimedia information usually cannot be reliably compared in this manner. Rather, a model of the information encoded in the bytes must be constructed to provide a basis for comparison.
Our first multimedia CYBERMAP circumvented these issues by attaching annotative text to the audio and video components. Typically, these annotations are generated manually. There are problems associated with this method however. For example, by providing a verbal description of a piece of audio (e.g., explosion) one can query a database only on the keywords. That is, the search is being done on one media (text) on the behalf of another (audio). Given the perceptual nature of sound, the limitations of verbal descriptions should be apparent. Another issue is the practicality of generating verbal descriptions for multimedia objects. The sheer volume of information is prohibitive to providing manual descriptions. An automatic method for generating these annotations is necessary. This implies the need for automated methods for analyzing and comparing the content of multimedia objects. These same methods could be used for conducting media-specific searches in multimedia databases.
Timbral CYBERMAPIn the newest web version of CYBERMAP we have experimented with maps of audio data where the similarities have been computed directly from multimedia data. More precisely, we are building CYBERMAPs of audio databases using an automated method based on timbre recognition for analyzing and comparing the contents of these audio objects.
Audio-only databases can be found in a number of fields. In the entertainment industry for example, sound designers regularly use databases of thousands of sounds in the creation of soundtracks for movies and commercials. Invariably, designers use their ears to finalize the search, perhaps listening to dozens or hundreds of sounds. A browser for audio databases would improve this situation.
Timbre is the term used to describe a sound's identifying characteristics. In human audition, timbre perception is fundamental to things like speech recognition and speaker identification. While this skill is unconscious and taken for granted, it is very difficult to build computers that can simulate aspects of hearing. A reasonable approach in building timbre recognition systems is to model the human auditory system. Constraints imposed by the biological system can provide insight to the problem.
The method of audio analysis presented here attempts to deal with a wide variety of sounds by using a general model of timbre perception. The timbral representation being used was first presented in [Lan95]. The key points and justifications are summarized here.
The auditory system is a complex multi-staged processor. One of the most important steps occurs in the cochlea where the acoustic waveform is decomposed into its time-varying spectral components [Pic88]. In the present work, the cochlea is simulated by an algorithm developed by [Fit92] based on an algorithm by [Qua85]. Figure I.109 shows the output of the cochlear model for a trumpet tone.
Figure I.109 Analysis of a Trumpet ToneStudies indicate that features like spectral envelopes (the shape of the spectral information) and amplitude envelopes (the shape of the amplitude information) are important in timbre perception [Bre90].The current technique extracts 10 such features from the output of the cochlear model which are stored as coarse-coded vectors. The shape-based representation allows easy calculation of sound similarity by comparing the similarity of corresponding sets of curves. A number of similarity measurements could be used, such as the correlation coefficient. However, since we are potentially interested in distances within the timbre space we choose a metric similarity measurement. The particular measurement used here is the Ordered Linear Direction Metric (OLD) [Pol87]. This metric compares the similarity in adjacent linear direction between two curves. Given a database of sounds we can now extract the timbre categories within the database. Clustering is one method of finding the categories. By calculating the combinatorial similarities between all files we can make clusters of sounds which are highly similar. Based on this similarities it is now straigthforward to compute a CYBERMAP.
The database used in this project consisted of 171 sounds representing 19 different timbre categories (Bluejay, Boat, Camel, Crow, Chime, Clarinet, "Clink", Cymbal, Flute, Glockenspiel, Guitar, Harp, Organ, Parakeet, Piano, Rooster, Saxophone, Trumpet, Whistle). Each sound was approximately 1-2 seconds in length. In the initial clustering operation three thresholds were tested (0.2, 0.24, 0.26). The lower the threshold, the more timbre categories the clustering algorithm found. At the lowest threshold (0.2) the clustering algorithm found 31 categories, at a threshold of 0.24, 26 categories were found, at 0.26 20 categories were found.
The Timbral CYBERMAP has been implemented in Java as an applet (figure I.110). The user is presented with a partial tree representing the database. Each node is a timbral category. At each node the user can choose to play any of the sounds in that category. The user can also expand a sub-tree by pushing a "Grow" button.
Figure I.110 "Timbral" Audio CYBERMAP on the webThe need for multimedia browsing tools seems clear. Towards that end, content-based methods of media analysis are needed to extract structure from the database. The project presented here demonstrates one such method for audio. The ultimate goal in that aspect is the creation of a system which can automatically classify sounds in a manner consistent with human perception.
The next chapter introduces a way to visualize large information structures consisting of many nodes by permitting hierarchical nestings of CYBERMAPs.