Vocaloid Wiki
🛠 This subject is work in progress.
Please bear with us while improvements are being made and assume good faith until the edits are complete.
For information on how to help, see the guidelines. More subjects categorized here.
💻 Technology article work in progress.
For information on how to help, see the guidelines.  More subjects categorized here.
📰 This subject requires intervention.
For information on how to help, see the guidelines.  More subjects categorized here.

VOCALOID:AI (ボーカロイド:エーアイ) is a new vocal technology that was first demonstrated using the deceased voice of Hibari Misora to demonstrate its abilities in a live performance. It was first demonstrated in a documentary in relation to NHK broadcasting. The singer's samples were provided by Nippon Columbia.

This is the 3rd project VOCALOID has undergone to use a dead singer as a basis, with Ueki-loid and hide being the previous ones.

It was first announced with VOCALOID3.


VOCALOID:AI is a "live" performance technology that allows for VOCALOID to be used as a "live" singer. While "VOCALOID" embraces all of Yamaha's singing technology, "VOCALOID:AI" refers to specifically the development of voice synthesizer technology branch that involves AI.

Creation process[]

Alongside the voice, a 3D image was used to represent Hibari Misora

The software as it suggests uses artificial intelligence to achieve its results. The technology is developed to run alongside 4K/3D images of the singer. The process is known as "Deep learning", where it can learn the traits of a singer over time.

GobouP noticed that the editor version used was VOCALOID3 or VOCALOID4 and that the voice being adapted by VOCALOID:AI was VY1.[1] In an ITmedia article, it was noted that you can't record any new samples for Hibari Misora.[2] This is due to her death in 1989. Since you could not record new samples for her, a different approach was adopted: an AI was used instead, called "DDN" (Deep neural networks, a part of AI Deep Learning processing). This meant there was no need to cut samples and feed them into VOCALOID.[2]

NOTE: For comparison, to make Ueki-loid, Ueki's son Kouichi had provided any missing information when all other data taken from his father's vocal performances didn't have the full data. This included adapting his son's vocal samples to sound more like his father.[3][4]

The AI learns the traits of the vocalist and mimicks them to give a one-off performance as a real singer would in a live performance. To be able to use this technology, data must be collected in advance, with bad data results having to be filtered out. The voice created is a result of the AI learning the timbre and singing style; it even picks up on the singer's nuisances.

The process of Deep Learning took place over time, but it was noted that it could learn the basics of the voice within a few hours without the GPU. Also, though the technology is listed as "VOCALOID:AI", Yamaha was not sure if it would become its own product and it carried "VOCALOID" because of it being in an early prototype stage.[2]

A lot of results are based on feedback from those familiar from the vocalist and trial and error of different models.[5]

DNN is a technology that has gained popularity since 2013. It gained attention with the introduction of “Sinsy” in 2016. Microsoft Japan released their own DNN "VoiceText". DNN is currently the biggest adaptation in the development of Speech Technology.[5]

Note that how the voice works is that it is made up of various examples from the singer's career and span decades worth of data samples, each era and style broken into more than one voicebank. The singer result is processed in a automated method that is similar to VocaListener and XSY, but works with both functions active at the same time. The main voicebank used is VY1 and the AI changes the voice to match the singer's own voice and is fairly accurate in its results. The AI uses the XSY-like result to switch between each voicebank, using whatever result it thinks is best using its VocaListener-like process. It is unconfirmed what exactly the tools are used in this process, so it is unconfirmed if Vocalistener and XSY are the actual functions, or Yamaha is using similar tools to how these functions work, as the information sources only supplied how they worked and not details on the tools VOCALOID:AI used.

Problems with the project[]

Yamaha's engineers in their interview with IT media describe the recordings themselves from years ago as a form of "bottleneck" restriction for technology going forward.

One of the draw backs of the original process was in using Hibari Misora's vocal: it was present only on analog recordings on magnetic tape. This meant it was difficult as recording vocal effects had to originally be applied at the beginning of tapes and limited how sound engineering could work, compared to modern vocal processing where they can be added later. In short this meant recordings of Hibari Misora had the issue of applied vocal effects on the tapes. However, digital technology later came within the project that allowed these vocal effects to be removed entirely and later recordings of her were much clearer.[2]

Another issue is that her voice when she debuted was different to how it was later in her career. She was known to have "seven colors of voice", which meant simply that she could adapt her voice in different ways to fit a song, leading to inconsistency within her recordings as well. So she sounded differently doing Jazz compared to her performance doing Enka.[2]

The variations had issues in that if you just pasted all results together, it became difficult to hear the vocal. The machine had to keep the vocal performances separate from each other. The AI had to be programmed to acknowledge conditions that would cause it to use one set of data over another if certain conditions were met. This included separating the earlier analogue recordings to the later digital ones. You can make Enka in this way that sounds like a "70s" recording, but at the same time it could make her vocal sound more like later year recordings.[5]

The next problem is that the machine itself can make mistakes and use the wrong recordings due to its own limitations. This is uncomfortable for the listener. DNN however can sort this to a higher degree of accuracy and is less likely to provide problems with this. In the process of the music making, a change in tempo causes a ripple effect that impacts the way the entire performance's lyrics act and sound; to master this is to give it a more human-like appearance of the vocal. The DNN itself has bolstered the composition qualities of sound and word formation.[5]

The Resulting performance[]

It allows a virtual singer powered by this technology to perform live without the need to pre-make the results; the AI is capable of learning as it goes along. The AI bases its results only on the highest quality data that is produced. The process was done not to fool listeners, but rather to get the same reaction from the performance that Hibari Misora herself gave.[5]

The technology is currently evolving quickly according to Yamaha.[6]

One of the own proposals of the show was that could the vocal performance give the same effect as the singer herself once had, AI could shake an audience.[7] The performance gave its audience a mixed reaction even from its audience.[8] The technology is very realistic and produces an effect close to having a singer in a machine. There is very little roboticness during singing results, except normal VOCALOID engine restrictions. This includes VOCALOIDs' issues with weak consonants and machine-like qualities.[2] As commentators since the showing have noted, it is almost as if the actual singer herself is in the machine. However, the performance lacked the emotions of an actual human being.

Due to the result of the DNN applied in this demonstration, the sound engineers have noted they have much to discuss with Yamaha about applying it elsewhere in voice sound synthesising processing. It produces currently the most impressive results, that it is not just a speech synthesising but an amazing one.[5]