<– Machine Learning and Neural Networks   Visual humans and machines –>

Communication, whether among humans or between humans and machines, can proceed along multiple channels – hence the importance of multimodal interfaces. Still, language will remain the principal communication channel. While linguistic communication can be text-based or voice-based, our concern here is vocal communication. The state of the art in artificial voice-based communication as it relates to MPAI’s wider goals will be sketched: to foster the creation and increasing use of standard AI-based modules (AIMs) that facilitate implementation of varied multi-module use cases, called AI Workflows (AIWs).

With this approach in mind, the state of the art in automatic speech recognition (ASR) and text-to-speech (TTS) is considered with an eye toward current and future workflows and their benefits and dangers.

5.1 Automatic Speech Recognition
5.2 ASR issues and directions
5.3 Text-to-Speech (TTS)
5.4 Some final considerations

5.1       Automatic Speech Recognition

Classical ASR

ASR has made particularly dramatic progress in the last two decades. Throughout the 2000s, speaker-dependent ASR remained dominant: to achieve acceptable accuracy using commercially available ASR, each speaker had to provide speech samples, initially twenty minutes or more. In most systems, the speech signal to be converted into text was sliced into short segments, so that the system could estimate the probability of certain text sequences given a sequence of sound slices, generally using Hidden Markov Models (HMMs). These estimates yielded possible words or word fragments and their probability rankings; and one could go on to estimate which word sequences were most likely, using compilation of word sequence probabilities called language model. The search through the associated set of possibilities – the associated space of possible words and word sequences – was usually managed through some variant of Viterbi search techniques.

By means of these techniques, and with sufficient speaker-specific and domain-specific recordings and accurate transcripts as training material, accuracies well above 90% became feasible. Necessary recording time dropped in a few years from twenty-plus minutes to less than a minute as processing power steadily increased according to Moore’s Law and as usable recording databases became much larger. As a result, speaker-independent training had finally arrived by the early 2010s: that is, training time per new speaker had dropped to zero!

Neural ASR

Then neural speech recognition appeared on the scene: by the late 2010s, Deep Neural Networks (DNNs) had essentially replaced HMM-based systems. Fundamentally, NN models learn input-to-output relationships: when given certain patterns as input, they learn to yield certain patterns as output. For ASR, they can learn to deliver the most probable text transcripts when given suitably pre-processed speech signals. However, since speech recognition involves mediating between sequential patterns for both input (sequences of sounds) and output (sequences of graphemes – that is, letters or characters – and words), neural architectures specialized for sequences are essential. Until recently, Recurrent NNs and Convolutional NNs were preferred – the first, designed, when computing sound-to-text probabilities for the next step along a sequence in progress, to accumulate the output of all prior steps and include these as input; and the second, designed to exploit a window moving across the sequence. These have now made room for transformer-based setups, which can efficiently shift attention throughout an entire sequence, thus providing superior consideration of audio and textual contexts.

5.2       ASR issues and directions

ASR Issues

Numerous problems remain. Much speech, whether collected in real time or from recordings, is spontaneous rather than from written materials, and consequently contains hesitations, stutters, repetitions, fragments, and other features unfriendly to recognition. Speech often occurs in noisy environments. It often involves multiparty conversations, with several voices that often overlap. The voices may be speaking different dialects and may even mix languages.

To address these and other issues, continued ASR development beyond neural network techniques themselves is under way. Numerous possible architectural variations and component interactions can be tried according to the use case. Noise reduction modules can deliver cleaner audio input. Language, dialect, and/or domain recognition modules can pre-select optimally trained variant ASR modules.

Integration of knowledge sources will also be a fruitful ongoing research direction. Presently, ASR still usually operates with little knowledge of the language structure other than sequence relations. Also usually lacking is any attempt to understand the objects and relationships in the speech situation.

ASR Directions

Considerations of understanding raise the question of future use cases for ASR. As one example for now – we’ll see several more below – consider self-driving cars: the car will “know” about its dynamic environment, having acquired from “experience” (multiple instances) visual “concepts” like CAR, TRUCK, STREET, and their spatial and causative relations. And so, when recognizing user questions or commands concerning cars, trucks, streets, etc., the car will be able to use knowledge about the referents – and not only the audio and the prior text – to raise or lower probabilities of currently recognized text. But a car’s concepts could include not only visual percepts but also a wide range of sensor data, such as sounds, vibrations, lidar or radar. In coming years, this incorporation of perceptually grounded knowledge is likely to transform all areas of AI, speech recognition not least. The results will affect speech translation; transcription of all audio and video (real-time and otherwise); and in fact, every use case demanding ASR – roughly, every use case involving speech.

Speech Analysis

While considering speech recognition, we should not overlook speech analysis to extract extra-textual information, such as sentiment or other social factors: what are the speaker’s emotions, styles, backgrounds, or attitudes? That vocal analysis can complement textual analysis of the language. If carried out by ML, it must depend heavily on the amount and quality of available data – for instance, on collections of recordings reliably labelled, or otherwise identified, for the emotion or other relevant factors.

5.3       Text-to-Speech (TTS)

Synthetic speech reached an acceptable quality level – understandable if colourless and unmistakably artificial – in the 1990s. The problem was considered largely solved; and, partly for that reason, remained relatively static while ASR was rapidly and noticeably improving. We’ll look at “classical” text-to-speech first, then move on to the current neural era.

Classical TTS

Concatenative TTS

The most widely used classic technology – still in use for some purposes – was concatenative: that is, short, recorded audio segments associated with speech sounds (phonemes and their sub-parts or groupings) were stitched together (concatenated) to compose words and larger units.

The segments in question were collected from large databases of recorded speech. Utterances were segmented into individual phones, syllables, words, etc., usually using a specially modified speech recognition system yielding an alignment between sound elements and those linguistic units. An index of the units was compiled, based on the segmentation and on acoustic parameters including pitch, duration, and position among other units. And then, to build a target utterance given a text, one selected the best chain of candidate units, typically using a decision tree while extending the chain. Good results could be achieved, but maximum naturalness required large recording databases, up to dozens of hours. (Alternatives to such concatenative text-to-speech could synthesize utterances from scratch, by artificially generating waveforms. The resulting speech was less natural, but waveform methods had advantages e.g., in size, so that they lent themselves to implementations in small devices, even toys.)

General TTS Issues

Concatenative or otherwise, any speech synthesis system confronts several issues.

Allophones and Co-articulation. Phonemes are generally pronounced differently (as allophones, or phoneme variants) according to their place in words or phrases. For instance, in US English, phoneme /t/ may be pronounced with or without a puff of air (called aspiration, present in top but absent in pot). Moreover, even those variants – and all other speech sounds – will vary further in context according to the neighbouring sounds (i.e., to co-articulation effects): for instance, the puffed /t/ sounds different before different vowels. For this reason, diphones, or pairs of phonemes, are frequently used as speech sound groupings. Co-articulation changes arising from some sound sequences can be dramatic in given styles or registers, as when /t/+/y/ in don’t you becomes the /ch/ of doncha. If classical TTS handled such cases – they usually didn’t – it was through dedicated spellings (“doncha”) or through programs implementing hand-written combinatory rules.

Disambiguation. Then there’s the problem posed by text sequences that can be pronounced entirely differently according to their use in a sentence, like “record” in “For the record, …” vs. “We need to record this meeting.” Some analysis of sentences is needed to select the appropriate variant and resolve the ambiguity – that is, to perform disambiguation. In classical text-to-speech, this need was often met by symbolic (hand-written) parsing programs.

Normalization. Yet another challenge is presented by text elements whose pronunciation isn’t specified in text at all but is instead left to the knowledge of the reader-out-loud. Numbers and dates are typical examples: 7/2/21 might be pronounced as “July second, twenty twenty-one” in the US – though variants are many, even leaving aside the matter of European writing conventions. Some ways must be found to convert symbols etc. to pronounceable text – to normalize the text.

Pronunciation problems. Foreign or unfamiliar words (“Just hang a uey on El Camino.”) pose obvious difficulties for text-to-speech. They’re normally addressed either through compilation of specialized or custom dictionaries or through use of a guesser – a program that uses rules (then) or AI (now) to guess the most likely pronunciation.

Prosody. Some treatment is needed of prosody – movement of pitch (melody), duration (rhythm), and volume (loudness). In the classical era, the prosody of a sentence was superimposed on speech units via various digital signal processing techniques. For instance, via the Pitch Synchronous Overlap and Add (PSOLA) technique, the speech waveform is divided into small overlapping segments that can be moved further apart to decrease the pitch, or closer together to increase it. Segments could be repeated multiple times to increase the duration of a section or eliminated to decrease it. The final segments were combined by overlapping them and smoothing the overlap. The means of predicting the appropriate prosody were relatively simple – e.g., by reference to punctuation – so the results were often repetitive and lacking in expression.

Extra-prosodic speech features. Extra-prosodic speech features like breathiness, vocal tension, creakiness, etc. were only occasionally treated in research, e.g., by simulating the physics of the voice tract. Using models of vocal frequency jitter and tremor, airflow noise and laryngeal asymmetries, one system was used to mimic the timbre of vocally challenged speakers, giving controlled levels of roughness, breathiness, and strain.

Neural TTS

As mentioned, neural technology learns input-to-output functions – usually from corpora of input-output examples. For neural speech synthesis, the job is now usually divided into two input-to-output problems: (1) given text, what should be the corresponding acoustic features (numbers indicating factors like segment pitch, duration, etc.) – call this acoustic feature generation; and (2) given acoustic features, what actual waveforms should be generated – call this waveform generation, the function of a vocoder.

For (1), the acoustic features are represented as spectrograms, which show frequency changes over time: in an X/Y graph, the vertical (Y) axis shows frequency, and the horizontal (X) axis shows time. (A modified frequency scale is often substituted for raw frequency: the mel frequency scale – mel for “melody” – which takes account of human perception.)

Neural text-to-speech began as recently as 2016, when DeepMind demonstrated networks able to model raw waveforms and thus to generate speech from acoustic features. In 2017, the technology was used by others to produce such raw waveforms directly from text – and neural text-to-speech was born. At the same time, Google and Facebook offered Tacotron and VoiceLoop, which could generate acoustic features, as opposed to waveforms, from input text. Then Google proposed Tacotron2, combining a revised acoustic feature generator with the WaveNet vocoder. The entire sequence – text to waveform – is termed end-to-end speech synthesis. Now that current end-to-end systems can generate speech whose quality approaches that of humans, this methodology has been widely adopted.

End-to-end speech synthesis models are indeed attractive. Good models for given speakers or languages, or for new data, can be created with little engineering. They’re robust, since there are no components that can fail. Unlike classical concatenative models, they require no large databases at run time.

Neural TTS Issues

But of course, challenges remain.

  1. Learning of models takes much time and computation. Resolution efforts have emphasized architectural variation for handling NN-based prediction of acoustic sequences. Transformer-based architecture (which, as mentioned, can scan back and forth throughout an entire sequence) is substituted for auto-regressive models, which make predictions about future sequences by reference to a limited number of past elements, or for Recurrent NNs, which refer to an accumulation of all past elements. Transformer-based sequence prediction is enhanced by also modelling the duration of speech sounds.
  2. If training data is insufficient or low in quality, speech quality suffers. The problem turns out to be strongly related to text alignment failures; so, focus has been on improving alignment by leveraging the known relations between text and speech sounds: their respective sequences march forward in tandem, and nearby text and sounds are more helpful for prediction than distant ones.
  3. Control points are absent: what you hear is what you get. Research has stressed variational auto-encoders – methods of learning representations of certain speech features as embeddings, or points in multi-dimensional (vector) space. For example, the points can represent emotions (like anger or sadness) as expressed through speech features like pitch or rhythm. That representation remains separate from, e.g., the pronunciation, and thus can be combined with it. Moreover, the emotions themselves can be blended or combined. Another control tactic is to break up the speech synthesis problem into several stages or aspects, so that each aspect can be separately programmed or trained, and thus controlled. For instance, a separate pre-processing stage can handle co-articulation combinations like don’t + you to yield the pronunciation of doncha. Any such combinatory or divide-and-control methods can be become elements of automated or semi-automated workflows.
  4. Prosody and pronunciation tend to be flat, since they’re averaged over large collections of training data. Intervention is possible at or after synthesis time: users can interactively post-tune preliminary flat (emotionless, bland, boring) renderings, either through demonstrations (via microphones or recordings) or via manual user interfaces. In addition, a single text-to-speech model can be made to generate speech with various speaker styles and characteristics. The trick is to create embeddings representing speakers and/or speaking styles, as opposed to emotions in our previous example.
  5. And more. The challenges surveyed above in relation to classical speech synthesis are still with us in the neural era: normalization (“Call 521-4553 after 6pm for a good time.”); disambiguation (“Chuck Berry wanted to record a new record.”); pronunciation of foreign or unfamiliar words (“Just hang a uey on El Camino.”); and so on. However, each such problem also provides an opportunity to propose a dedicated AI module (for MPAI, an AIM) as a solution.
  6. Neural Vocoders. We mentioned that neural speech synthesis can be handled or conceptualized in two stages, where the second is sound generation (acoustic-features-to-waveforms), as performed by a vocoder. That vocoder can exploit neural networks, as do the popular Wavenet and WiFi-GAN vocoders.

5.4       Some final considerations

TTS Evaluation

How can we judge the quality or adequacy of a speech synthesis system?

  1. Human judgements are unavoidable at the state of the art; but, once elicited, these judgments could also become input for ML, leading in time to automatic judgments approaching human ones.
  2. Establishment of common test sets will become increasingly important.
  3. Also significant will be development of automatic assessment of styles, emotions, attitudes, etc. In discussing ASR, we mentioned use of speech analysis to extract such extra-textual information. When these techniques become reliable, they can be applied to speech synthesis evaluation.

Speech Technology: Dangers

Since language is so central to human experience, linguistic technology can only be hugely influential, the more so as it grows more powerful. Like medical technology; like energy technology; like computational technology; like communications technology – linguistic technology promises to be hugely beneficial – but, inevitably, also dangerous. For example, speech recognition magnifies the danger of ubiquitous surveillance. And speech synthesis, as an element of technology’s growing capacity to simulate every aspect of perception, threatens a world of deep fakes, in which we can never be sure who said what – bad enough for celebrities and personal enemies, but worse for the powerful and entrusted. We can hope, however, that laws and norms will ultimately combine with technological fixes to ward off the most dystopian dangers.

Speech Technology: Benefits

But as for the potential benefits: MPAI’s aim is to promote the creation of standard modules that can be assembled in endless configurations, so that myriad beneficial systems can be created without endless reinvention of the wheel.

<– Machine Learning and Neural Networks   Visual humans and machines –>