Vocaloid Wiki
Vocaloid Wiki
🛠 This subject is work in progress.
Please bear with us while improvements are being made and assume good faith until the edits are complete.
For information on how to help, see the guidelines. More subjects categorized here.
💻 Technology article work in progress.
For information on how to help, see the guidelines.  More subjects categorized here.

📰 This subject requires intervention.
For information on how to help, see the guidelines.  More subjects categorized here.

VOCALOID:AI (ボーカロイド:エーアイ) is a new vocal technology and was first demonstrated using the deceased voice of Hibari Misora to demonstrate its abilities in a live performance. It was first demonstrated in a documentary in relation to NHK broadcasting. The singer's samples were provided by Nippon Columbia.

This is the 3rd project Vocaloid has undergone to use a dead singer as a basis, with Ueki-loid and hide being the previous ones.

It was first announced with VOCALOID3.


VOCALOID:AI is a "live" performance technology that allows for VOCALOID to be used as a "live" singer. While "VOCALOID" embraces all of Yamaha's singing technology, "VOCALOID:AI" refers to specifically the develop of voice synthesizer technology branch that involves AI.

Creation process[]

Alongside the voice a 3D image was used to represent Hibari Misora

The software as it suggests uses artificial intelligence to achieve its results. The technology is developed to run alongside 4K/3D images of the singer. The process is known as "Deep learning", it can learn the traits of a singer over time.

GobouP noticed the editor version used was VOCALOID3 or VOCALOID4 and the voice being adapted by VOCALOID:AI was VY1.[1] In an ITmedia article it noted that you can't record any new samples for Hibari Misora.[2] This is due to her death in 1989. Since you could not record new samples for her, a different approach was adopted an AI was used instead called "DDN" (Deep neural networks, a part of AI Deep Learning processing). This meant there is no need to cut samples and feed them into Vocaloid.[2]

Note; For comparison, to make Ueki-loid Ueki's son Kouichi had provided any missing information when all other data from taken from his father's vocal performances didn't have the full data. This included adaptation of his son's vocal samples to sound more like his father.[3][4]

The AI learns the traits of the vocalist and mimicks them to give a one-off performance as a real singer would in a live performance. To be able to use this technology, data must be collected in advance, with bad data results having to be filtered out. The voice created is a result of the AI learning the timbre and singing style; it even picks up on the singer's nuisances.

The process of Deep Learning took place over time, but it was noted that it could learn the basics of the voice within a few hours without the GPU. Also, though the technology is listed as "VOCALOID:AI", Yamaha was not sure if it would become its own product and it carried "VOCALOID" because of being in an early prototype stage.[2]

A lot of results are based on feedback from those familiar from the vocalist and trial and error of different models.[5]

DNN is a technology that has gained popularity since 2013. It gained attention with the introduction of “Sinsy” in 2016. Microsoft Japan released their own DNN "VoiceText".DNN is currently the biggest adaptation in the development of Speech Technology.[5]

Note that how the voice works is that is made up from various examples from the singer's career and span decades worth of data samples, each era and style broken into more then one voicebank. The singer result is processed in a automated method that is similar to VocaListener and XSY, but works with both functions active at the same time. The main voicebank used is VY1 and the AI changes the voice to match the singer's own voice and is fairly accurate in its results. The AI uses the XSY-like result to switch between each voicebank, using what result it thinks is best using its VocaListener-like process. It is unconfirmed what exactly the tools are used in this process, so it is unconfirmed if Vocalisner and XSY are the actual functions, or Yamaha is using simiilar tools to how these functions work, as the information sources only supplied how they worked and not details on the tools VOCALOID:AI used.

Problems with the project[]

Yamaha's engineers in their interview with IT media describe the recordings themselves from years ago as a form of "bottleneck" restriction for technology going forward.

One of the draw backs of the original process was in using Hibari Misora's vocal it was present only on analog recordings on tape. This meant it was difficult as recording vocal effects had to originally be applied at the beginning of tapes and limited how sound engineering could work, compared to modern vocal processing where they can be added later. In short this meant recordings of Hibari Midora had the issue of applied vocal effects on the tapes. However, digital technology later came within the project that allowed these vocal effects to be removed entirely and later recordings of her were much clearer.[2]

Another issue is that her voice when she debuted was different to how it was later in her carer. She was known to have "seven colors of voice", which meant simply that she could adapt her voice in different ways to fit a song leading to inconsistency within her recordings as well. So she sounded differently doing Jazz compared to her performance doing Enka.[2]

The variations had issues in that if you just pasted all results together, they become difficult to hear the vocal. The machine had to keep the vocal performances separate from each other. The AI had to be programmed to acknowledge conditions that would cause it to use one set of data over another if certain conditions were met. This included separating the earlier analogue recordings to the later digital ones. You can make Enka in this way that sounds like a "70s" recording, but at the same time it could make her vocal sound more like later year recordings.[5]

The next problem is that the machine itself can make mistakes and use the wrong recordings due to its own limitations. This is uncomfortable for the listener. DNN however can sort this to a higher degree of accuracy and is less likely to provide problems with this. In the process of the music making, a change in tempo causes a ripple effect that impacts the way the entire performances lyrics act and sound, to master this is to give a more human-like appearance of the vocal. The DNN itself has bolstered the composition qualities of sound and word formation.[5]

The Resulting performance[]

It allows a virtual singer powered by this technology to perform live without the need to pre-make the results, the AI is capable of learning as it goes along. The AI bases its results only on the highest quality data that is produced. The process was done not to fool listeners, but rather to get the same reaction from the performance that Hibari Misora herself gave.[5]

The technology is currently evolving quickly according to Yamaha.[6]

One of the own proposals of the show was could the vocal performance give the same effect as the singer herself once had, could AI shake an audience.[7] The performance gave its audience a mixed reaction even from its audience.[8] The technology is very realistic and produces an effect close to having a singer in a machine. There is very little roboticness during singing results, except normal Vocaloid engine restrictions. This include VOCALOIDs issues with weak consonants and and machine-like qualities.[2] As commentators since the showing have noted it is almost as if the actual singer herself is in the machine. However, the performance lacked is the emotions of an actual human being.

Due to the result of the DNN applied in this demonstration, the sound engineers have noted they have much to discuss with Yamaha about applying it elsewhere in voice sound synthesising processing. It produces currently the most impressive results, that it is not just a speech synthesising but an amazing one.[5]