Download e-book Data-Driven Techniques in Speech Synthesis

Free download. Book file PDF easily for everyone and every device. You can download and read online Data-Driven Techniques in Speech Synthesis file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Data-Driven Techniques in Speech Synthesis book. Happy reading Data-Driven Techniques in Speech Synthesis Bookeveryone. Download file Free Book PDF Data-Driven Techniques in Speech Synthesis at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Data-Driven Techniques in Speech Synthesis Pocket Guide.
Data-Driven Techniques in Speech Synthesis gives a first review of this new field. All areas of speech synthesis from text are covered, including text analysis.
Table of contents

Shrikanth Narayanan , Abeer Alwan. Text to speech synthesis TTS is a critical research and application area in the field of multimedia interfaces. Recent advances in TTS will impact is wide number of disciplines from education, business and entertainment applications to medical aids. Until recently, speech synthesis relied on models and rule-based approaches. While this had yielded intelligible sounding speech, the voice quality was unacceptable for widespread adoption.

Fortunately, there has been a major technological paradigm shift recently in how speech synthesis is done: going from rule-based to explicit data-driven methods. Recent advances in computing and corpus driven methodologies have yielded exciting possibilities for research and development in this domain yielding highly natural sounding speech.

While many different parameterizations of the spectrum have been developed for synthesis [ 3 ] [ 4 ] [ 5 ] [ 6 ] , few have yet managed to survive in the long run. The most obvious indications of this are the systems that are submitted to the annual Blizzard Challenge [ 7 ]. Very few statistical parametric systems submitted to the challenge since its inception use vocoders that do not use Mel Cepstral coefficients. These are usually converted into Mel Cepstral coefficients MCEPs before being used by statistical parametrical systems.

This lack of new parameterizations that perform better than MCEPs is especially intriguing considering the amount of research effort that has gone into finding a replacement. An ideal parameterization for statistical parametric synthesis will have to fulfill all of the following requirements:. It must be robust to corruption by noise. It must be of sufficiently low dimension.


It must be in an interpolable space. Even if a parameterization technique were invented that could comply with three of the above four requirements, the technique would be useless if it did not at least partially satisfy the remaining one. Therein lies the difficulty of inventing a new parameterization. Mel Cepstral coefficients satisfy all of these requirements to a reasonable extent.

However, this representation is not perfect and places a major bottleneck on the naturalness of modern parametric speech synthesizers. Techniques such as [ 9 ] and [ 10 ] rectify some of the problems that occur with this representation but the Mel Cepstral representation still leaves plenty of room for improvement. In this age of big data and deep learning, it behooves us to try to construct a parameterization purely from data which might be more adept at dealing with all these constraints.

Neural networks themselves have existed for many years but the training algorithms that had been used were incapable of effectively training networks that had a large number of hidden layers [ 11 ]. This is because the standard technique used for training a neural network is the backpropagation algorithm [ 12 ].

Data-Driven Techniques in Speech Synthesis

The algorithm works by propagating the errors made by the neural network at the output layer back to hidden layers and then adjusting the weights of the hidden layers using gradient descent or other techniques to minimize this error. When the network is very deep, the propagated error to the first few hidden layers becomes very small. As a result, the parameters of the first few layers change very little in training. One strategy that was developed in recent years was to start off by training the neural network one pair of layers at a time and then building the next pair on top of previous ones [ 13 ] [ 14 ].

This step is called pretraining because the weights that are obtained through this process are used as the initialization for the backpropagation algorithm. Pretraining techniques are believed to provide an initialization much closer to the global optimum compared to the random initializations that were originally used. Our search for a technique to create a purely data-driven parameterization led us to the Stacked Denoising Autoencoder SDA which was developed for pretraining deep neural networks [ 15 ]. The SDA is trained in a manner more or less identical to the layer-wise pretraining procedure described in [ 16 ] and [ 13 ].

As the name suggests, the Stacked Denoising Autoencoder is constructed by stacking several Denoising Autoencoders together to form a deep neural network.

Mod-05 Lec-24 Speech Synthesis

Each Denoising Autoencoder is a neural network that is trained such that it reconstructs the correct input sequence from an artificially corrupted version of the input provided to it. This process is shown in Figure 1. The network is fully connected between each layer but in the interest of clarity, the figure will only show a limited number of connections.

The SDA is of particular interest to parametric speech synthesis because this network learns to reconstruct a noisy version of the input from a lower dimensional set of features. It is therefore apparent that the very definition of this network fits the first three of the four requirements of an ideal parameterization.

We will discuss the fourth requirement in a later section. The SDA is actually rarely used in a task where the input needs to be reconstructed from the representation that the SDA transforms the input into. It is nearly always used to provide a lower dimensional representation on top of which a classifier such as logistic regression, or Support Vector Machines are used.

Data-Driven Techniques in Speech Synthesis - ePrints Soton

An example of this is the Deep Bottleneck Features that are used in Speech Recognition [ 17 ] [ 18 ]. However, such approaches are less relevant to parametric synthesis since it is not a classification problem. We build an SDA on our features by stacking multiple Denoising Autoencoders that were built by learning to reconstruct corrupted versions of the input.

Backpropagation is used to finetune the MLP such that the output layer can reconstruct the input provided to the first layer through the bottleneck in the middle. Once this finetuning has been completed, this network is split down the middle into two parts. The section from the input layer to the bottleneck region is the encoding network, while the section from the bottleneck region to the output layer is the decoding network. The encoding network codes the speech signal into a representation which is by design, invertible, robust to noise, and low dimensional.

This representation is the encoding that the synthesizer uses as the parameterization of the speech signal i. At synthesis time, the synthesizer predicts values of this encoding based on the input text. The decoding network converts this code back into a representation of the speech signal.

Search form

This approach is similar to the one proposed for efficient speech coding in [ 19 ]. Apart from the fact that [ 19 ] proposes the use of the code for other applications, it is also different in that it specifically looks for a binary encoding. Such binary encodings are not very useful in a statistical synthesis framework because binary representations are not interpolable while synthesis is an inherently generative task.

In previous sections, we have discussed how a deep neural network will build a low-dimensional noise-robust representation of the speech signal, but what should our deep neural network actually encode? To put it more explicitly, what should be the input to our deep neural network that it can learn to reconstruct? Should it be the actual speech signal itself, the magnitude spectrum, the complex spectrum, or any of the other representations that signal processing research has provided us? In theory, the input representation should not matter since it has been proven that multilayer feedforward networks are universal approximators [ 20 ].

However, this proof places no constraints on the size or structure of the network. Nor does it provide a training algorithm that reaches the global optimum. Therefore, it is sensible to train the network on a representation that is known to be highly correlated with speech perception.

  • Germany and the Future of European Security.
  • It Didnt Happen Here: Why Socialism Failed in the United States?
  • Etude stratigraphique et paléontologique de lAptien inférieur de La Bédoule (prés Cassis) (Bouches du Rhône);
  • Speech synthesis - Wikipedia;

Human hearing is known to be logarithmic both in amplitude [ 21 ] , and frequency [ 22 ]. So, we propose that the Mel Spectrum and Mel Log spectrum are the most suitable representations that the network can be trained on. Despite using input and output layers that were linear, the network had difficulty working with the wide range of values in the Mel spectrum.

Therefore, for the rest of this paper, we will only describe our attempts at using a deep neural network to get an invertible, low-dimensional, noise-robust representation of the Mel Log Spectrum. The input to the neural network was a dimensional Mel Log Spectral vector which was obtained from a point FFT of a 25ms speech frame. The encoding obtained using the network is dimensional.

Arxiv Sanity Preserver

This encoding size was chosen to make it easier for us to compare the quality with the dimensional MCEP representation used in our baseline system. The Stacked Denoising Autoencoder was built in a x x 75 x 50 configuration i. This results in an MLP with a x x 75 x 50 x 75 x x configuration for fine-tuning. The encoding network will therefore have a configuration of x x 75 x 50 and the decoding network, 50 x 75 x x In all of these networks, the layer that is contact with the Mel Log Spectra is a linear layer with no non-linear function involved.

This is so that the layer can deal with the range of values that the Mel Log Spectra can take.