Skip to main content

Unsupervised Singing Voice Conversion: Facebook's AI Can Convert Your Voice Into Any Other Singer's Voice

Unsupervise-singing-voice-conversion-img
unsplash-logoJason Rosewell

You might have heard of the miracles artificial intelligence is doing these days. From movie recommendation on Netflix to Sophia, the robot, from intelligent humanoids in video games to super intelligent Google Assistant, artificial intelligence is dominating most aspects of our lives. It has an impact on our present and future in either way.
Continuing the pursuit to a technology intensive future, Facebook in collaboration with Tel Aviv University published their new research last week on arxiv.org (Unsupervised Singing Voice Conversion). Researchers at Facebook have trained an intelligent machine which can instantly convert audio of one singer to voice of another singer. You check the sample results here. All that is needed is to record a song in own "beautiful" voice and feed the file to the machine. AI will never disappoint you. The results will astound you.

 

Summary of the work

According to authors of the work,
Our approach could lead, for example, to the ability to free oneself from some of the limitations of one’s own voice. The proposed network is not conditioned on the text or on the notes [and doesn’t] require parallel training data between the various singers, nor [does it] employ a transcript of the audio to either text … or to musical notes … While existing pitch correction methods … correct local pitch shifts, our work offers flexibility along the other voice characteristics.
It is clearly mentioned in the manuscript that no transcription technique is used. i.e. no song -> text and text -> song conversion.

  

Datasets used:

Two datasets are used and both are publicly available.
  1. Stanford’s Digital Archive of Mobile Performances (DAMP) corpus
  2. National University of Singapore’s Sung and Spoken Corpus (NUS-48E)

Data Augmentation:

Four data augmentation techniques are used.
  1. Audio file is played in forward direction
  2. Played in reverse direction
  3. Multiply raw file signals by -1 and played forward
  4. Multiply raw file signals by -1 and played reverse
Authors have explained that when the signal is played in forward and backward, the energy spectrum remains the same.
The network comprises of a CNN encoder, Google's WaveNet decoder and a classifier. The network is trained an unsupervised manner. Two techniques are used for training.
  1. Back translation
  2. Mixup technique

1. Back translation:

Back translation is a technique which is used for Natural Language Processing. This technique is used in translating text in one language to another. If their is no representation for an input in the target language, the input is represented by a unique symbol during training using Automatic Machine Translation (AMT). The training pair, the input with its corresponding symbol is then used to translate back from target language to source language.

 

2. Mixup technique:

As the name suggests, in mixup technique various unique virtual pairs of input are created by shuffling the input. e.g For learning a functino y=f(x), additional virtual samples are created from 2 samples (x1, y1) + (x2, y2) = (x', y'). It is actually a data augmentation technique.

According to the authors,
Our contributions are: (i) the first method, as far as we know, to perform an unsupervised singing voice conversion, where the target singer is modeled from a different song, (ii) demonstrat-ing the effectiveness of a single encoder and a conditional de-coder trained in an unsupervised way, (iii) introducing a two-
phase training approach in unsupervised audio translation, in which backtranslation is used in the second phase, (iv) introduc-ing backtranslation of mixup (virtual) identities, (v) suggesting a new augmentation scheme for training in a data efficient way.

 

Results:

The generated results are evaluated by a classifier and human reviewers to judge on scale of 1-5 the similarity of generated voices to the target singing voice. Reviewers alloted a scale of 4 on average while the classifier found that the reconstructed samples and generated samples were identified with almost the same accuracy.

 Don't forget to share your thoughts with us in the comments below.

Comments

  1. The conversion of any information can very harmful and not bring any results that may not be applicable in the future.

    ReplyDelete
  2. NDS, the short type of Nintendo DS, is a 32-piece double screen handheld game comfort by Nintendo. AnyConv

    ReplyDelete
  3. Thanks for the article. You should try to search for digital transcription software, transcribe audio to text, video to text or similar search strings to get appropriate software or online service.

    ReplyDelete
  4. With such a significant number of video designs accessible for various working frameworks, the clients regularly need to change over the recordings to play them on their own frameworks. FLV to MP4 video converter

    ReplyDelete

Post a comment

Popular posts from this blog

How Big Data Analytics Can Help You Improve And Grow Your Business?

Big Data Analytics There are certain problems that can only solve through big data. Here we discuss the field big data as "Big Data Analytics". The big data came into the picture we never thought how commodity hardware is used to store and manage the data which is reliable and feasible as compared to the costly sources. Now let us discuss a few examples of how big data analytics is useful nowadays. When you go to websites like Amazon, Youtube, Netflix, and any other websites actually they will provide some field in which recommend some product, videos, movies, and some songs for you. What do you think about how they do it? Basically what kind of data they generated on these kind websites. They make sure to analyze properly. The data generated is not small it is actually big data. Now they analysis these big data they make sure whatever you like and whatever you are the preferences accordingly they generate recommendations for you. If you go to Youtube you have noticed it kn…

How Computers Understand Human Language?

How Computers Understand Human Language? Natural languages are the languages that we speak and understand, containing large diverse vocabulary. Various words have several different meanings, speakers with different accents and all sorts of interesting word play. But for the most part human can roll right through these challenges. The skillful use of language is a major part what makes us human and for this reason the desire for computers that understand or speak human language has been around since they were first conceived. This led to the creation of natural language processing or NLP.
Natural Language Processing is a disciplinary field combining computer science and linguistics. There is an infinite number of ways to arrange words in a sentence. We can't give computers a dictionary of all possible sentences to help them understand what humans are blabbing on about. So, an early and fundamental NLP problem was deconstructing sentences into small pieces which could be more easily…

The Limits of Artificial Intelligence

If you are here, it means that you are familiar with term artificial intelligence. Either you have read about it in school or have seen it in sci-fi movies or somewhere else. Talking about the limitations of AI, let me ask you one simple question first, do you know the definition of AI? You might be thinking to answer me with a yes, yes I know what is artificial intelligence. But what if I tell you that AI is a buzzword and it is almost impossible to properly define. It is this way because the definition of artificial intelligence is moving. People don’t call the things AI that they used to call. For example, a problem that seemed too complex to be solved by human and was solved by AI algorithm is no longer a problem of AI. Playing chess, is one of the examples. It was considered the peek level of artificial intelligence back in previous century. Now it hardly fits the criteria for AI. It is presented to the world as a super power that when given to a computer, it magically starts li…