Alexander Graham Bell stands at a turning point in the history of speech synthesis. When he was young, his father took him and his younger brother to an exhibition where they saw a replica of one of Von Kempelen's speech machines. Back home, the boys proceeded to build a similar speaking machine themselves. When, several years later, Bell invented the telephone, he introduced the technique that would determine the future of sound processing: the representation of sounds by means of electric signals. Bell also produced a detailed design that never got implemented, for a device that would have been a mechanical Vocoder.
But his most curious contribution to artificial speech synthesis was another early feat. Bell's youthful interest in speech production also led him to experiment with his pet Skye terrier. He taught the dog to sit up on his hind legs and growl continuously. At the same time, Bell manipulated the dog's vocal tract by hand. The dog's repertoire of sounds finally consisted of the vowels /a/ and /u/, the diphthong /ou/ and the syllables /ma/ and /ga/. His greatest linguistic accomplishment consisted of the sentence, How are you Grandmamma? The dog apparently started taking a 'bread and butter' interest in the project and would try to talk by himself. But on his own, he could never do better than the usual growl.
7 James L. Flanagan Speech Analysis Synthesis and Perception, second edition, Berlin 1972, pp. 206/207.
A related technology, with a cyberpunk slant, is due to Johannes Müller, the father of modern physiology. His working method is clearly characterized by his orientation toward experiments on living or dead objects. Continuing the efforts of Liskovius, who in 1814 was the first to generate chest- and head-voice from the larynx of a corpse, he cut off the head of a corpse in such a way that the entire vocal apparatus and part of the tracheae were preserved. By blowing air into the larynx of the corpse, Müller produced vocalic sounds, which closely resembled human speech. By moving the lips, he even managed to generate some consonants.
8 Köster, op.cit., p. 149
To Fake
Hermann Helmholtz was a pupil of that same Müller. But his work in the field of speech synthesis was less physiologically and more acoustically oriented. In the second half of the nineteenth century, research into the phenomenon of sound had reached the stage where one could attempt to analyse the sounds of human speech into elementary components.
To synthesize vowels, Helmholtz did not imitate the human body, but built up the sounds from elementary, sinus-shaped components.
His synthesis machine consists of a battery of tuning forks equipped with resonance chambers, with frequencies in harmonious proportions. Driven by electromagnets, the tuning forks vibrate with perfect regularity in their basic frequencies. The volumes of the contributions from the different tuning forks can be varied by partly opening or closing their resonance chambers. Thus, sounds with different spectrums can be composed, which bear resemblance to various vowels: Oo, Ee, Ah, Oh, Uh, Ih...
The same method of synthesis can be applied even more easily with modern electronic technology - a technology which was developed for the reproduction and transmission of sound. The crucial invention, which made electronic sound generation possible, was the loudspeaker: the general-purpose sound producer, which can replicate the sound of an arbitrary event, without having to mimic its material structure.
The loudspeaker transforms arbitrary electric signals into material sound waves. This creates the possibility of treating electric signals as modelsof sound waves. In electronic technology, this is done by means of resistors, induction coils, radio tubes, transistors. Objects with a specific electronic behaviour are combined into circuits which generate the desired output patterns.
The two kinds of approach mentioned above in connection with mechanical sound synthesis can be applied in electronics as well. The structure and the components of a mechanical system that imitates the human larynx can systematically be transposed to the electronic domain; this will indeed result in a circuit with an output signal that corresponds to the vocal sound produced by the mechanical model.
Translating Helmholtz' approach to the electronic realm is even
simpler: replace his tuning forks with sine wave generators, and his adjustable resonance chambers with potentiometers. Electronic simulation has a material form: a circuit consisting of identifiable components and connections. But on the outside, nothing seems to be happening. The clockwork stands still. It thinks.
The structure of the circuit corresponds to the mathematical analysis of a physical sound-generating process. The circuit is a materialized diagram. A print board actually looks like that.
9 Cf. Dick Raaijmakers 'De kunst van het machine lezen' (The art of machine reading), Raster 6, 1978, pp. 6-53
The computer is the next step in the development towards an increasingly abstract simulation. The hardware no longer has anything in common with the physics conjured up for the listener. The hardware even has a structure which is essentially incompatible with the origins of music. A computer really 'computes': it manipulates discrete symbols. Music, on the other hand, is generated by the resonance of continuous systems.
Digital sound simulation is two steps away from real sound: the electric signal driving the loudspeaker is represented in the computer as a sequence of discrete symbols that represent the amplitude variation in time, split up into small discrete steps. Thus, even the continuity of the electric signal is faked.
The operations on the symbolically represented signals largely correspond to the functioning of the components from electronic circuits – but because these operations are now symbolically represented as well (installed as software in the computer), they can be applied with infinite flexibility, in every imaginable combination and sequence. Cordemoy's impossibility has come true: lifeless matter has escaped the rigidity of the clockwork.
The flexible machine which can do anything is at the same time the enigmatic machine which shows nothing. The machine is motionless, so that we do not see anything happening. But neither does the wiring structure of the components reveal anything about the functions performed. This structure only says: calculations in progress.
The flexibility of the software medium is virtually complete. All operations, which can be described mathematically, can be implemented. Even the fact that the execution of each operation takes a short, but not infinitely short, moment of time, and that very complex combinations of operations can therefore take a long
time, is hardly a limitation anymore. This practical problem is solved by vlsi (custom-built chip) technology. It is often possible to develop special chips for sub-processes which take too much time: large-scale integrated electronic hardware, which is less flexible than software, but extremely fast.
Everything you can imagine you can do with software. That is what's interesting about ai and other experimental branches of computer science: we discover the limits of what we can imagine. Sound synthesis is a typical example of this: modern synthesizers can produce a tremendous richness of sound, but imitations of existing instruments still sound stylized.
Where they do sound natural, this is because they are not synthesized on the basis of structural analysis, but on the basis of samples. In that case there is no imitation, but reproduction of a previously recorded sound. The best sounding synthesizers have a great deal in common with tape recorders. They are digital mellotrons.
Digital sound registration technology is now the technology with the highest accuracy. The basic methods of digital sound representation are thus completely adequate. The limitations of digital sound synthesis are solely due to the limitations of our understanding of the psychological structure of sound.
Platonic People
Because their speech was barely intelligible, there was not much use for the first electronic speech-synthesis systems. For example, you could not make them speak a complex text with unpredictable contents if you wanted the text to be understood by an audience.
These systems also sounded distinctly inhuman. The voice appears to be generated by an alien body which is not flesh and blood – by the angular movements of the metal components of the prototypical robot. What you hear is a machine which, in its awkward mechanical way, tries to use the human means of communication. This behaviour evokes disturbing questions about the possibilities and the dangers of technology, about mind and matter, and the
nature of human identity.
But current state-of-the-art software is different. A typical example is dectalk.
10 dectalk was developed by Digital Equipment on the basis of mitalk. See: Jonathan Allen, M. Sharon Hunnicutt and Dennis Klatt From text to speech: The MITalk system, Cambridge University Press, 1987
This program is the realization of Abbé Mical's wildest dreams. //Têtes parlantes:// not one, not two, but nine different ones; and all of them can moreover be modified and interpolated.
The dectalk manual presents their portraits and gives them names:// Rough Rita, Frail Frank, Whispering Wendy, Huge Harry, Kit the Kid, Perfect Paul, Beautiful Betty, Uppity Ursula, and Doctor Dennis.//
Protagonists of a comic strip version of Peyton Place.
The input for programs such as dectalk consists of discrete symbols. The program processes files that consist of sequences of phonemes. So there is no human control of timing and dynamics as with the eighteenth-century machines which were operated by means of a keyboard. In spite and even partly because of this, the output has greater continuity.
The software does not only contain models of the signals that correspond to the individual phonemes, but also procedures for merging the successive signals seamlessly together.
Modern synthetic voices are perfectly intelligible. And because of a more accurate control of the spectrum of vowels, the distinctively metallic quality of the sound has disappeared. But nevertheless, no one would confuse their output with human speech. The synthetic voice is still inhuman, if only because of its uniformity.
11 Speech technologists are doing their best to imitate human limitations and imperfections. Allen et al. (op.cit.), for example: Some additional pauses are introduced in longer phrases and slow speaking rate so that the talker does not seem to have an inhuman supply of breath.
dectalk's standard voice, Perfect Paul, is an abstract sounding voice, that of a newsreader. Neither machine, nor human being. This marks the birth of a new medium. Up until now, you could not listen to a text without listening to someone's body. The independent text, independent of the human body, was always the printed text. For the first time, language now has a sound independent of the body – a sound that directly emanates from the linguistic system, from syntax and phonemes. The next step in this development is foreshadowed by other dectalk voices, such as Whispering Wendy and Huge Harry. These are more personal, but just as equable and imperturbable, smooth and continuous. Airbrush pinups. Platonic bodies.
Whispering Wendy's voice has a pure, clear sound, with very little substance – like Marilyn Monroe's singing voice or Brigitte Bardot's. The suggestion of a soft, supple, weightless body.
Huge Harry is Wendy's macho counterpart. His voice is heavy and lustful. Not Elvis Presley yet, but not bad for a beginner.
The synthetic body has already become an erotic ideal. Look, for instance, at the use of classical statues in thirties' fashion photography: ''The forms of high fashion assume the look of the statuesque, the hallowed, the classical. Living flesh has the smoothness, the soft luster of ancient marble. Stone, it almost
seems, is as supple as flesh. Hoyingen-Huene makes an equation between living and not living bodies, and the equation enchants, for in his photographs the bodies that do not live are not dead. They are statues. His imagery argues that in the realm of fashion there is no death. To enter the fashionable instant is to live forever.''
12 Carter Ratcliff, 'Out of Time', Artforum International 30, September 1991, pp. 112-117
The future of digital image- and sound-simulation: the smooth coolness of the statue in a naturally moving body, in a sensually modulating voice. Technology is heading slowly but surely toward increasingly perfect robot-porn. Live performers like Prince and Michael Jackson are already beginning to dissolve into their computer-animated images.
When Andy Warhol invented commercial telephone sex, he suggested in the same breath that it could best be done by robots:// A robot-computer to answer the phone, that would be great. It would do the job without emotion. //
13 Ultra Violet Famous for 15 minutes. My years with Andy Warhol New York 1990, p. 163
Epilogue by Ultra Violet
''I think back to one of Andy's earliest paintings, compelling in its simplicity - a starkly black-and-white six-foot-high Coca-Cola bottle, painted in oil on canvas in 1960. I think of the paintings of clean, shiny Campbell soup cans, the young, unlined, fresh-scrubbed faces of Marilyn Monroe, Jackie Onassis, Ingrid Bergman, so many others.
Then gradually I begin to grasp what Andy was trying to say with all his babble about machines and sex. Where sex has turned repulsive and inhuman, machine sex beckons alluringly. Only in telephone sex, robot sex, computer sex is there escape from ugliness and cruelty. Machine sex is the only kind left that is uncontaminated, antiseptic, clean, even a little mysterious (...).
Yes, here is still another of the endless paradoxes Andy strews along our paths. In sex, as in art, (...) he reinvents shining, pristine, early morning purity.
His kind, of course: on the surface, no deeper.''
14 Ultra Violet, Op. cit., pp. 165/166
translation olivier/wylie/scha