Featured Post

Machine Learning: A Truthy Lie?

Unsupervised Singing Voice Conversion: Facebook's AI Can Convert Your Voice Into Any Other Singer's Voice

Unsupervise-singing-voice-conversion-img
unsplash-logoJason Rosewell

You might have heard of the miracles artificial intelligence is doing these days. From movie recommendation on Netflix to Sophia, the robot, from intelligent humanoids in video games to super intelligent Google Assistant, artificial intelligence is dominating most aspects of our lives. It has an impact on our present and future in either way.
Continuing the pursuit to a technology intensive future, Facebook in collaboration with Tel Aviv University published their new research last week on arxiv.org (Unsupervised Singing Voice Conversion). Researchers at Facebook have trained an intelligent machine which can instantly convert audio of one singer to voice of another singer. You check the sample results here. All that is needed is to record a song in own "beautiful" voice and feed the file to the machine. AI will never disappoint you. The results will astound you.

 

Summary of the work

According to authors of the work,
Our approach could lead, for example, to the ability to free oneself from some of the limitations of one’s own voice. The proposed network is not conditioned on the text or on the notes [and doesn’t] require parallel training data between the various singers, nor [does it] employ a transcript of the audio to either text … or to musical notes … While existing pitch correction methods … correct local pitch shifts, our work offers flexibility along the other voice characteristics.
It is clearly mentioned in the manuscript that no transcription technique is used. i.e. no song -> text and text -> song conversion.

  

Datasets used:

Two datasets are used and both are publicly available.
  1. Stanford’s Digital Archive of Mobile Performances (DAMP) corpus
  2. National University of Singapore’s Sung and Spoken Corpus (NUS-48E)

Data Augmentation:

Four data augmentation techniques are used.
  1. Audio file is played in forward direction
  2. Played in reverse direction
  3. Multiply raw file signals by -1 and played forward
  4. Multiply raw file signals by -1 and played reverse
Authors have explained that when the signal is played in forward and backward, the energy spectrum remains the same.
The network comprises of a CNN encoder, Google's WaveNet decoder and a classifier. The network is trained an unsupervised manner. Two techniques are used for training.
  1. Back translation
  2. Mixup technique

1. Back translation:

Back translation is a technique which is used for Natural Language Processing. This technique is used in translating text in one language to another. If their is no representation for an input in the target language, the input is represented by a unique symbol during training using Automatic Machine Translation (AMT). The training pair, the input with its corresponding symbol is then used to translate back from target language to source language.

 

2. Mixup technique:

As the name suggests, in mixup technique various unique virtual pairs of input are created by shuffling the input. e.g For learning a functino y=f(x), additional virtual samples are created from 2 samples (x1, y1) + (x2, y2) = (x', y'). It is actually a data augmentation technique.

According to the authors,
Our contributions are: (i) the first method, as far as we know, to perform an unsupervised singing voice conversion, where the target singer is modeled from a different song, (ii) demonstrat-ing the effectiveness of a single encoder and a conditional de-coder trained in an unsupervised way, (iii) introducing a two-
phase training approach in unsupervised audio translation, in which backtranslation is used in the second phase, (iv) introduc-ing backtranslation of mixup (virtual) identities, (v) suggesting a new augmentation scheme for training in a data efficient way.

 

Results:

The generated results are evaluated by a classifier and human reviewers to judge on scale of 1-5 the similarity of generated voices to the target singing voice. Reviewers alloted a scale of 4 on average while the classifier found that the reconstructed samples and generated samples were identified with almost the same accuracy.

 Don't forget to share your thoughts with us in the comments below.

Comments