After watching Google I/O 2018, one thing we've realized is that the voice of the artificially intelligent system is going to be a significant interface to interact with a human, apart from the text. The research on speech-to-text has been going on since quite a few years after we've taken a big leap on the Deep Learning approach. In this talk, I'm going to talk mainly about the Mozilla's DeepSpeech open source project to convert speech-to-text in Python.
Now, the new problem at hand is how an artificially intelligent system can give a human-like voice to the written text because when a human speaks, there are a lot of intricacies in our speech that is so obvious for the human brain. Expressions in our voice, where to give a pause, and accent etc are few important factors that play a big role in how humans talk to each other. So, here I'm going to introduce WaveNet.
The talk will be divided in following four segments :
- 0-5 minutes: The talk will begin with explaining the Speech-to-text earlier existing libraries and which machine learning models they used. Comparison of various libraries like Cloud speech-to-text by Google, IBM Watson and DeepSpeech will be done
- 5-25 minutes: DeepSpeech is based on Baidu's DeepSpeech research paper. This model directly translates raw audio data into text - without any domain specific code in between. I'll quickly brief about the underlying deep learning architecture used in DeepSpeech. A short live-demo will be given and the code, written in Python, will be explained with the tips on hyper-parametric tuning to get the best possible results.
- 25-45 minutes: Now, the talk will switch to the latest research going on in the field of Text-to-speech and how products like Alexa, Siri, Google Assistant etc are leveraging this to behave like a human. The deep learning architecture of WaveNet, open sourced by Google's DeepMind, will be discussed followed by the live-demo and explaining the code written in Python.
- 45-50 minutes: For QA session.