Google’s Translatotron converts one spoken language to another, no text involved

Every day we creep a little closer to Douglas Adams’ famous and prescient babel fish. A new research project from Google takes spoken sentences in one language and outputs spoken words in another — but unlike most translation techniques, it uses no intermediate text, working solely with the audio. This makes it quick, but more importantly lets it more easily reflect the cadence and tone of the speaker’s voice.
Translatotron, as the project is called, is the culmination of several years of related work, though it’s still very much an experiment. Google’s researchers, and others, have been looking into the possibility of direct speech-to-speech translation for years, but only recently have those efforts borne fruit worth harvesting.
Translating speech is usually done by breaking down the problem into smaller sequential ones: turning the source speech into text (speech-to-text, or STT), turning text in one language into text in another (machine translation), and then turning the resulting text back into speech (text-to-speech, or TTS). This works quite well, really, but it isn’t perfect; Each step has types of errors it is prone to, and these can compound one another.
Furthermore, it’s not really how multilingual people translate in their own heads, as testimony about their own thought processes suggests. How exactly it works is impossible to say with certainty, but few would say that they break down the text and visualize it changing to a new language, then read the new text. Human cognition is frequently a guide for how to advance machine learning algorithms.
Google’s Translatotron converts one spoken language to another, no text involved
Spectrograms of source and translated speech. The translation, let us admit, is not the best. But it sounds better!
See also:
Leave a comment
  • Latest
  • Read
  • Commented
Calendar Content
«    Май 2019    »