Unsupervised Singing Voice Conversion: Facebook's AI Can Convert Your Voice Into Any Other Singer's Voice

unsplash-logoJason Rosewell

You might have heard of the miracles artificial intelligence is doing these days. From movie recommendation on Netflix to Sophia, the robot, from intelligent humanoids in video games to super intelligent Google Assistant, artificial intelligence is dominating most aspects of our lives. It has an impact on our present and future in either way.
Continuing the pursuit to a technology intensive future, Facebook in collaboration with Tel Aviv University published their new research last week on arxiv.org (Unsupervised Singing Voice Conversion). Researchers at Facebook have trained an intelligent machine which can instantly convert audio of one singer to voice of another singer. You check the sample results here. All that is needed is to record a song in own "beautiful" voice and feed the file to the machine. AI will never disappoint you. The results will astound you.


Summary of the work

According to authors of the work,
Our approach could lead, for example, to the ability to free oneself from some of the limitations of one’s own voice. The proposed network is not conditioned on the text or on the notes [and doesn’t] require parallel training data between the various singers, nor [does it] employ a transcript of the audio to either text … or to musical notes … While existing pitch correction methods … correct local pitch shifts, our work offers flexibility along the other voice characteristics.
It is clearly mentioned in the manuscript that no transcription technique is used. i.e. no song -> text and text -> song conversion.


Datasets used:

Two datasets are used and both are publicly available.
  1. Stanford’s Digital Archive of Mobile Performances (DAMP) corpus
  2. National University of Singapore’s Sung and Spoken Corpus (NUS-48E)

Data Augmentation:

Four data augmentation techniques are used.
  1. Audio file is played in forward direction
  2. Played in reverse direction
  3. Multiply raw file signals by -1 and played forward
  4. Multiply raw file signals by -1 and played reverse
Authors have explained that when the signal is played in forward and backward, the energy spectrum remains the same.
The network comprises of a CNN encoder, Google's WaveNet decoder and a classifier. The network is trained an unsupervised manner. Two techniques are used for training.
  1. Back translation
  2. Mixup technique

1. Back translation:

Back translation is a technique which is used for Natural Language Processing. This technique is used in translating text in one language to another. If their is no representation for an input in the target language, the input is represented by a unique symbol during training using Automatic Machine Translation (AMT). The training pair, the input with its corresponding symbol is then used to translate back from target language to source language.


2. Mixup technique:

As the name suggests, in mixup technique various unique virtual pairs of input are created by shuffling the input. e.g For learning a functino y=f(x), additional virtual samples are created from 2 samples (x1, y1) + (x2, y2) = (x', y'). It is actually a data augmentation technique.

According to the authors,
Our contributions are: (i) the first method, as far as we know, to perform an unsupervised singing voice conversion, where the target singer is modeled from a different song, (ii) demonstrat-ing the effectiveness of a single encoder and a conditional de-coder trained in an unsupervised way, (iii) introducing a two-
phase training approach in unsupervised audio translation, in which backtranslation is used in the second phase, (iv) introduc-ing backtranslation of mixup (virtual) identities, (v) suggesting a new augmentation scheme for training in a data efficient way.



The generated results are evaluated by a classifier and human reviewers to judge on scale of 1-5 the similarity of generated voices to the target singing voice. Reviewers alloted a scale of 4 on average while the classifier found that the reconstructed samples and generated samples were identified with almost the same accuracy.

 Don't forget to share your thoughts with us in the comments below.


  1. The conversion of any information can very harmful and not bring any results that may not be applicable in the future.

  2. NDS, the short type of Nintendo DS, is a 32-piece double screen handheld game comfort by Nintendo. AnyConv

  3. Thanks for the article. You should try to search for digital transcription software, transcribe audio to text, video to text or similar search strings to get appropriate software or online service.

  4. With such a significant number of video designs accessible for various working frameworks, the clients regularly need to change over the recordings to play them on their own frameworks. FLV to MP4 video converter

  5. All the free music downloads at stafabandra.club are made available through Creative Commons licensing meaning that the artists themselves have decided they want to give out their music for free for the masses to enjoy.

  6. If you have a great personality and truly unique and amazing content, you could probably get away with using a great Video Equipment.


Post a comment

Popular posts from this blog

Supervised Learning vs Unsupervised Learning vs Reinforcement Learning

AI Vs Machine Learning Vs Deep Learning

What is Object-Oriented Programming? The concept of Class and Objects