Text to Speech Synthesis in Celebrity’s Voice

Main Article Content

Ajinkya P. Gaddime
Dhananjay P. Mane
Ruchita K. Vehale
Vaishnavi S. Khawale
D. G. Bhalke

Abstract

This paper is proposed for text to speech synthesis. It uses neural network architecture for generation of speech and its synthesis directly from text in celebrity’s voice. The device is fitted with a recurring sequence-to-sequence prediction that graphs the embedding characters into mel scale spectrograms, followed by an updated WaveNet model that functions as a vocoder to create time-domain waveforms from those spectrograms. Here, project evaluation of the impact of mel spectrograms as the conditioning input to WaveNet rather than linguistic features, length, and F0. This paper further would be showing that utilizing this compact acoustic intermediate representation allows a significant reduction in the size of the WaveNet architecture.
Using this technique, we are going to modulate the output of the vocoder according to the frequency and pitch of a specific celebrity. Using a unit selection method of concatenation synthesis, a database of prerecorded voice is collected. This paper includes creating a database of an Indian celebrity, clustering, indexing, and synthesizing it for creating a voice output with respect to the text as input. Also worked on normalization of text which includes abbreviations, acronyms, and linguistic analysis. This paper gives output for phonemic features, like vowel length, vowel height, frontness, consonant voicing, consonant poi, and position in the syllable and word.

Downloads

Download data is not yet available.

Article Details

How to Cite
Gaddime, A., Mane, D., Vehale, R., Khawale, V., & Bhalke, D. (2020). Text to Speech Synthesis in Celebrity’s Voice. SAMRIDDHI : A Journal of Physical Sciences, Engineering and Technology, 12(SUP 2), 27-30. https://doi.org/10.18090/samriddhi.v12iS2.6
Section
Research Article

References

[1] Jonathan Shen1, Ruoming Pang1 and Ron J. Weiss (2018), Natural
TTS synthesis by conditioning Wave-Net on Mel spectrogram
predictions, University of California, Berkeley,.
[2] Aaronvanden Oord, Sander Die leman and Heiga Zen
(2016), Wave-Net: A Generative model for raw audio, Google
DeepMind, London, UK.
[3] Tom Le Paine, Pooya Khorrami, Shiyu Chang, Yang Zhang
(2016), Fast Wave-Net generation model, University of Illinois at
USA.
[4] Nalluri, S. K., & Parasaram, V. K. B. (2015). Automating
Software Builds with Jenkins: Design Patterns and Failure
Handling. International Journal of Technology, Management
and Humanities, 1(01), 16-33.
https://doi.org/10.21590/ijtmh.01.02.03
[5] P. Taylor (2009), Text-to-Speech Synthesis, Cambridge University
Press, New York, NY, USA, 1st edition,.
[6] N. Swetha and K. Anuradha (2013), Text to speech
conversion, International Journal of Advanced Trends in
Computer Science and Engineering, Vol .2, No.6, Pages :
269-278
[7] H. Zen, A. Senior, and M. Schuster (2013), “Statistical parametric
speech synthesis using deep neural networks,” in Proceedings
of ICASSP, pp. 7962–7966.