Vocaloid Wiki
Register
Advertisement
🛠 This subject is work in progress.
Please bear with us while improvements are being made and assume good faith until the edits are complete.
For information on how to help, see the guidelines. More subjects categorized here.
🛠
💻 Technology article work in progress.
For information on how to help, see the guidelines.  More subjects categorized here.
💻
📰 This subject requires intervention.
For information on how to help, see the guidelines.  More subjects categorized here.
📰

VOCALOID:AI (ボーカロイド:エーアイ) is a vocal synthesis technology that incorporates the usage of artificial intelligence (AI) technology, making it possible to create realistic sounds along with the nuances of performance that make it sound as if a human were really playing.[1]

It is currently an unreleased concept product as a standalone software with demonstrations of its capabilities done over the years, however, VOCALOID:AI technology was implemented into the newest generation of VOCALOID voice synthesizers, VOCALOID6, allowing more natural and expressive singing voice synthesis.

About[]

VOCALOID AI Synthesis

VOCALOID:AI waveform synthesis.

VOCALOID:AI allows users to enter notes and lyrics and the software will not only trace the data, but also determine the nuances to be performed on the notes, for example, selecting timbres, connecting notes, applying vibrato, etc. The result, creating a lively voice that sounds like a real singer. Yamaha notes that what makes VOCALOID:AI different from it's predecessors is that, the synthesizer "makes its own expression, sometimes in a way that the creator would never have thought of".[2]

Development Process[]

According to Yamaha, VOCALOID:AI's development can be divided into two phases: the training phase and the synthesis phase.

Basic Concept of Training the AI[]

During the training phase, VOCALOID:AI™ learns the timbre, singing style, and other characteristics of a target human singer’s voice with the technique of deep learning. VOCALOID:AI™ can then create a singing voice that includes the singing expressions and nuances of the original singer for any arbitrary melody and lyrics.

VOCALOID AI Training Phase

VOCALOID:AI Training Phase flowchart.

Yamaha states that the first step of the training is to extract the acoustic features. Timbre and pitch information are extracted from the singing voice, and timing deviation information is extracted from the paired data of the singing voice and the score. It was noted that the aim of the training phase is to make the AI learn the tendency of correspondence between the acoustic features and the score information (notes and lyrics). It was stated, however, that some elements of the human singing voice can not be determined from the sequence of notes and lyrics alone. Even if the exact same melody and lyrics are given, a variety of timbres could be applied, depending on the singer’s singing style, the genre of the song, and the dynamics used by the singer. To solve this, a couple of auxiliary features that could represent such singing styles are explicitly extracted from the training data and fed into the AI. The dynamics parameter is calculated from the singing voice itself, and the song ID information is used to represent a singing style that could be related to each song.

It was noted that he training phase may take a few hours or maybe a days in order for the AI to learn the original singer’s performance by iterating its computation millions of times.

VOCALOID:AI has two remarkable features in the following synthesis phase.

Feature 1: Requesting Singing Expression to an AI Singer[]

While VOCALOID:AI™ automatically applies singing nuances to the synthesized sounds, they do not always match the expression that the human creators really want. Yamaha believes that it is very important to make it possible for the AI synthesizer to receive the creator’s musical intention, expecting the AI to SUPPORT humans as they express themselves, not TAKE AWAY human creativity. VOCALOID:AI™ allows users to make requests regarding musical expression in its singing. The song ID information and the dynamics parameter introduced in the training phase are used here. Users can make requests such as “with the atmosphere of a certain song” by specifying the ID of a song used for training, “with a slightly strong nuance” by explicitly giving the dynamics parameter to the AI. VOCALOID:AI™ responds to the request by changing the singing voice with respect to nuances such as phrasing, vibrato, deepness, and breathing, by estimating how the original would have sung the song if requested to do so. As a result, users are able to create their own vocal tracks in a very intuitive way, just as a musical director instructs a singer in the real world.

Feature 2: Real-time Interaction with an AI Singer during a Performance[]

VOCALOID:AI™ receives requests for singing expression in real-time while the synthesis process is going on. This feature, which allows users to create music as if they were actually interacting with a virtual singer on a stage, is an especially important part of Yamaha’s AI sound synthesis technology.

The synthesis phase can be roughly divided into two steps. In the first step, the information of the entire score is input into the system. This allows the AI to understand information such as “this song consists of such structures,” “each note connects to such kinds of notes,” etc. The second step is processed frame-by-frame (e.g., 100 times per second) as the AI decides what sound to generate at that moment, given a song ID and the dynamics parameter. Each of these steps can be compared to the following steps in human singing: the first step corresponds to reading, interpreting, and understanding the music, and the second step to singing a song aloud.

AI Musical Instrument Sound Synthesis Technology[]

VOCALOID AI Synthesis Phase

VOCALOID:AI Synthesis Phase flowchart.

YAMAHA has developed an AI instrument sound synthesis technology by adapting VOCALOID:AI™ technology to a wide range of instrument sounds. Currently, the AI can realize lively performances of wind instruments such as saxophones, trumpets, flutes, clarinets, and oboes in a similar way to singing voices. This is also a completely new type of musical instrument sound synthesis technology that was noted to be able to express itself. Given a score, the synthesizer determines the nuances of expression by referring to the sequence of notes, then synthesizes the sound (waveform) of the instrument, including vibrato, crescendo, decrescendo, and so on as if it were really played by a human. This makes it much simpler for users to create a full-fledged wind instrument performance track than giving detailed music expression parameters. The AI musical instrument sound synthesis technology also has the same two features as VOCALOID:AI.

Feature 1: Requesting Performance Expression to an AI Performer[]

Given requests such as “with the atmosphere of a certain song” or “with a slightly powerful nuance,” the AI synthesizes a waveform by estimating how the original performer would play if such requests were actually given. Depending on how the AI is trained, it could be possible to request a genre, such as “bossa nova style” or “funk style,” instead of specifying the name of a song. (This is also the case with VOCALOID:AI™.)

Feature 2: Real-time Interaction with an AI Performer during a Performance[]

After having the score of the entire piece read in advance, it is possible to input requests for performance expression in real-time while the synthesis process is going on.

The AI sound synthesis technology has a strong sense of having a virtual AI performer play an instrument. Therefore, rather than using it like a conventional synthesizer where users can play the keys in real-time, its strength lies in the way the virtual AI performer and the user interact with each other to create music. For example, it could be particularly useful in situations where human instrumentalists and AI performers are in session, or in DTM-based music production.

AI Artist Stage[]

VOCALOID AI AI Artist Stage

VOCALOID:AI AI Artist Phase flowchart.

On Tuesday June 22, 2021, a new exhibit titled "AI Artist Stage — Creating Music Together with AI" was unveiled at the Innovation Laboratory.

Featuring proprietary AI singing synthesis technology (*VOCALOID:AI™) created by Yamaha together with new AI wind instrument synthesis technology developed based on VOCALOID:AI, this exhibit allows visitors to enjoy performing with an AI singer or saxophonist. First users can use the performance expression sensor to tell the AI artist how to give live to a song. The AI artist responds to user requests by making natural changes in real time to aspects of musical expression such as vibrato and the connections between notes.

Further developments to this technology may allow AI singers to sing songs users create in the manner users want, or to fill in for any missing people in an ensemble, or even create backing that give the impression that someone could have been invited someone to perform with the user at home.[3]

Hibari Misora[]

The technology was first demonstrated using the voice of the deceased vocalist Hibari Misora to demonstrate its abilities in a live performance. It was first demonstrated in a documentary hosted by NHK broadcasting in 2019. The singer's samples were provided by Nippon Columbia. This was the 3rd project VOCALOID has undergone to use a deceased singer as a basis, with Ueki-loid and hide being the previous ones.

It was first demonstrated using VOCALOID3 technology.

Creation process[]

HibariMisoraVAI

Alongside the voice, a 3D image was used to represent Hibari Misora

The software as it suggests uses artificial intelligence to achieve its results. The technology is developed to run alongside 4K/3D images of the singer. The process is known as "Deep learning", where it can learn the traits of a singer over time.

GobouP noticed that the editor version used was VOCALOID3 or VOCALOID4 and that the voice being adapted by VOCALOID:AI was VY1.[4] In an ITmedia article, it was noted that new samples could not have been recorded for Hibari Misora.[5] This is due to her death in 1989. As there were no new recordings to draw from, a different approach was adopted: an AI was used instead, called "DNN" (Deep neural networks, a part of AI Deep Learning processing). This meant there was no need to cut samples and feed them into VOCALOID.[5]

NOTE: For comparison, to make Ueki-loid, Ueki's son Kouichi had provided any missing information when all other data taken from his father's vocal performances didn't have the full data. This included adapting his son's vocal samples to sound more like his father.[6][7]

The AI learns the traits of the vocalist and mimics them to give a one-off performance as a real singer would in a live performance. To be able to use this technology, data must be collected in advance, with bad data results having to be filtered out. The voice created is a result of the AI learning the timbre and singing style; it even picks up on the singer's nuisances.

The process of Deep Learning took place over time, but it was noted that it could learn the basics of the voice within a few hours without the GPU. Also, though the technology is listed as "VOCALOID:AI", Yamaha was not sure if it would become its own product and it carried "VOCALOID" because of it being in an early prototype stage.[5]

A lot of results are based on feedback from those familiar from the vocalist and trial and error of different models.[8]

DNN is a technology that has gained popularity since 2013. It gained attention with the introduction of “Sinsy” in 2016. Microsoft Japan released their own DNN "VoiceText". DNN is currently the biggest adaptation in the development of Speech Technology.[8]

Note that how the voice works is that it is made up of various examples from the singer's career and span decades worth of data samples, each era and style broken into more than one voicebank. The singer result is processed in a automated method that is similar to VocaListener and XSY, but works with both functions active at the same time. The main voicebank used is VY1 and the AI changes the voice to match the singer's own voice and is fairly accurate in its results. The AI uses the XSY-like result to switch between each voicebank, using whatever result it thinks is best using its VocaListener-like process. It is unconfirmed what exactly the tools are used in this process, so it is unconfirmed if Vocalistener and XSY are the actual functions, or Yamaha is using similar tools to how these functions work, as the information sources only supplied how they worked and not details on the tools VOCALOID:AI used.

Reception[]

Yamaha's engineers in their interview with IT media describe the recordings themselves from years ago as a form of "bottleneck" restriction for technology going forward.

One of the draw backs of the original process was in using Hibari Misora's vocal: it was present only on analog recordings on magnetic tape. This meant it was difficult as recording vocal effects had to originally be applied at the beginning of tapes and limited how sound engineering could work, compared to modern vocal processing where they can be added later. In short this meant recordings of Hibari Misora had the issue of applied vocal effects on the tapes. However, digital technology later came within the project that allowed these vocal effects to be removed entirely and later recordings of her were much clearer.[5]

Another issue is that her voice when she debuted was different to how it was later in her career. She was known to have "seven colors of voice", which meant simply that she could adapt her voice in different ways to fit a song, leading to inconsistency within her recordings as well. So she sounded differently doing Jazz compared to her performance doing Enka.[5]

The variations had issues in that if users just pasted all results together, it became difficult to hear the vocal. The machine had to keep the vocal performances separate from each other. The AI had to be programmed to acknowledge conditions that would cause it to use one set of data over another if certain conditions were met. This included separating the earlier analogue recordings to the later digital ones. users can make Enka in this way that sounds like a "70s" recording, but at the same time it could make her vocal sound more like later year recordings.[8]

The next problem is that the machine itself can make mistakes and use the wrong recordings due to its own limitations. This is uncomfortable for the listener. DNN however can sort this to a higher degree of accuracy and is less likely to provide problems with this. In the process of the music making, a change in tempo causes a ripple effect that impacts the way the entire performance's lyrics act and sound; to master this is to give it a more human-like appearance of the vocal. The DNN itself has bolstered the composition qualities of sound and word formation.[8]

The Resulting performance[]

It allows a virtual singer powered by this technology to perform live without the need to pre-make the results; the AI is capable of learning as it goes along. The AI bases its results only on the highest quality data that is produced. The process was done not to fool listeners, but rather to get the same reaction from the performance that Hibari Misora herself gave.[8]

The technology is currently evolving quickly according to Yamaha.[9][10]

One of the own proposals of the show was that could the vocal performance give the same effect as the singer herself once had, AI could shake an audience.[11] The performance gave its audience a mixed reaction even from its audience.[12] The technology was noted to be very realistic and produced an effect close to having a singer in a machine. There was very little robotic-sounding singing results, aside from normal VOCALOID engine restrictions. This includes VOCALOIDs' issues with weak consonants and machine-like qualities.[5] As commentators since the showing have noted, it is almost as if the actual singer herself is in the machine. However, the performance lacked the emotions of an actual human being.

Due to the result of the DNN applied in this demonstration, the sound engineers have noted they have much to discuss with Yamaha about applying it elsewhere in voice sound synthesising processing. It produces currently the most impressive results, that it is not just a speech synthesizing but an amazing one.[8]

VOCALOID6[]

On October 13, 2022, VOCALOID6 was announced and released on the same day by the Yamaha Corporation following the announcement of server maintenance also scheduled on October 13, 2022.[13] VOCALOID6 is equipped with VOCALOID:AI, which utilizes new AI technology, to achieve more natural and expressive singing voice synthesis. The software made its debut with a set of 4 new vocalists SARAH, ALLEN, HARUKA, and AKITO as well as the software's first third-party vocalist, AI Megpoid. All five voicebanks were noted to have access to the VOCALOID:AI technology and can be used to sing in three languages: English, Japanese, and Mandarin Chinese interchangeably using the "Multilingual" function.[14][15]

Releases[]

VOCALOID6 Logo

Vocal libraries released for the VOCALOID6 engine.

  • Note: Currently, all VOCALOID6 voicebanks utilize VOCALOID:AI technology.

References[]

Advertisement