Vocaloid Wiki
📰 This subject requires intervention.
For information on how to help, see the guidelines.  More subjects categorized here.

Below is information on how the VOCALOID voicebanks are developed, based on what is known on the software as revealed by the studios.

Notice that this is not a guide on how to create an actual VOCALOID. Please consider alternatives such as UTAU.

For a list of all vocals released and in development, see status.

First developments[]

The VOCALOID project was a international effort, and is considered the brainchild of Kenmochi Hideki, also known as the "father" of VOCALOID. In Japan, in 2000, he proposed the first initial ideas that founded VOCALOID. Much of the research into the software came from the Pompeu Fabra University in Spain, in a project led by Mr. Kenmochi. It was purely collaborative research; selling a product using it was not being considered at the time. At first, VOCALOID could only say vowels like ai (love). Four months later, the VOCALOID's first real word was "asa (morning)". The original aim of VOCALOID was to act as a replacement singer for a real vocalist. Many reviewers at the time of LEON and LOLA's release noted that "VOCALOID" was a bold effort, as human speech was a complex thing to recreate. VOCALOID was regarded as the first of its kind to tackle singing vocals.

Both an English and a Japanese version were developed alongside each other. The first studio on board was Crypton Future Media, who was hired to find English studios to support an English version. Sadly, their efforts amounted to mostly negative responses, and the only studio to enter development was Zero-G.

For more details on this see, see VOCALOID.

VOCALOID software[]

VOCALOID's singing synthesis technology is categorized as concatenative synthesis, which splices and processes vocal fragments extracted from human singing voices in the frequency domain. In singing synthesis, the system produces realistic voices by adding information of vocal expressions like vibrato to the score information. VOCALOID's synthesis technology was initially called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法 Shūhasū-domain Kashō Articulation Setsuzoku-hō), although YAMAHA no longer uses this name on its websites. "Singing Articulation" is explained as "vocal expressions" such as vibrato, and vocal fragments necessary for singing. Most versions of the VOCALOID synthesis engine is designed for singing, not for speech. It also cannot replicate singing expressions like hoarse voices or shouts.

System architecture[]

The main parts of the VOCALOID2 system are the Score Editor (VOCALOID2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices. There is minimal difference in the Score Editor and the Synthesis Engine provided by YAMAHA among different VOCALOID2 products. If a VOCALOID2 product is already installed, the user can enable another VOCALOID2 product by adding its library. The system originally supported two languages, Japanese and English; upon the release of VOCALOID3 language support for Korean, Spanish, and Chinese was also included. Other languages may be optional in the future. It works standalone (playback and export to WAV) and as a ReWire application or Virtual Studio Technology instrument (VSTi) accessible from a Digital Audio Workstation (DAW).

Score Editor[]

The Score Editor uses a piano roll style editor to input notes, lyrics, and some expressions. For a Japanese Singer Library, the user can input gojūon lyrics in hiragana, katakana or romaji writing. For an English library, the editor automatically converts the lyrics into X-SAMPA phonetic symbols using the built-in pronunciation dictionary. The user can directly edit the phonetic symbols of unregistered words. A Japanese library and an English library differ in the lyric input method, but share the same platform. Therefore, the Japanese editor can load an English library and vice versa. As mentioned above, the lyric input method is library-dependent, and so the Japanese and English editors differ only in the menus. The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices. This editor supports ReWire and can be synchronized with a DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.

Beginning Process[]

A VOCALOID studio will first have to approach YAMAHA and acquire a license to produce a VOCALOID.

The price of licensing varies per circumstance. Overseas studios such as Zero-G, PowerFX and Voctro Labs pay more for their VOCALOIDs because of the exportation rates to outside of Japan, while Japanese and other Asian studios pay less. It was confirmed by Joffrey of Voxwave that many groups have approached YAMAHA to create a VOCALOID but were rejected; he even suggested that the majority of the attempted VOCALOIDs do not make it past this stage.[1]

The average cost of producing a VOCALOID is unknown, though a few amounts have been dropped;

  • ¥5,000,000 was raised for the production of Tohoku Zunko (approx $50,000+ USD) for her vocal bank to be developed.
  • €7000 ($9,300+ USD) was originally estimated as the cost to record the samples for ALYS, including hiring the voice provider (Poucet) and obtaining a studio.[2] This does not include other costs required to produce the product.
  • $10,000 to $12,000 was given by PowerFX as an estimate of the cost for creating a new voicebank, once a voice provider, artist and plan were in place.[3]

Once the company is in agreement to release the VOCALOID, the VOCALOID becomes theirs. They hold the license, they pay the fees they distribute and sell the product through their website etc.[4] Each studio is sent a construction kit which guides the studio in the production of each VOCALOID. After they have set up all the necessary means to begin work, the process moves onto selecting the singer.

Extra vocals and updates of old vocals will cost more licensing fees to release. This is why several studios focus on the sales of new VOCALOIDs rather then updating older ones.[5] Though if a VOCALOID sells well, some studios may consier updating their vocals even if they have never updated any of their vocals in the past.[6]

If the vocal is an update to an old vocal, then the older vocal will have to be analysed for issues to be found and fixed.[7] Studios often rely on user feedback to find any issues.

At any point, YAMAHA can pull licensing for VOCALOIDs, as was the case with the VOCALOID CHINA PROJECT.

For the full story, see Controversy_Concerns/Character_issues#YANHE_and_Zhiyu_Moke.

The Singer[]

Each VOCALOID licensee develops the Singer Library, a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary. For example, the phonemes corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī. The VOCALOID system changes the pitch of these fragments so that it fits the melody.

Hiring/Selecting the Vocalist[]

Zero-G's singer selecting process during the VOCALOID1 era began by looking at what was missing in the Vocaloid range. It was decided that there was a gap for a classical soprano voice. This voice type was decided during Miriam's release, along with the type of voice suited for choir music.[8] Prima's voice provider was a singer who answered an ad put up on a music academy website. After some test samples in the VOCALOID software, they decided to go ahead and record her voice for the VOCALOID2 software.[9]

Internet Co. wanted to utilize the voice of a singer for the creation of VOCALOID, but felt it would be difficult to get a singer to agree. They consulted Dwango Co.,Ltd. who managed Nico Nico Douga, and Dwango suggested Gackt (神威 楽斗 Camui Gackt), a singer and actor, as he had previously provided his voice for Dwango's cell phone services.[10] He lent his voice and named the VOCALOID, Gackpoid.

Not all VOCALOIDs have professional singers behind their vocals like Prima and Miriam did. According to Crypton, professional female singers refused to provide voice samples, in the fear that the software might create their singing voice's clones. In response, Crypton changed their focus from imitating certain singers to creating characteristic vocals. This change of focus led to sampling vocals of voice actors. The Japanese voice actor agency Arts Vision supported their development. Similar concerns regarding vocal clones have been expressed throughout the other studios using VOCALOID, with Zero-G refusing to release the names of their providers. Miriam Stockley (who provided the voice for Miriam) was the only known Zero-G voice provider up until AVANNA, DEX, and DAINA had their providers revealed in the later years. PowerFX only hinted at Sweet Ann's voice provider; only Big AL and YOHIOloid's are known. AH-Software named the voice providers for Miki, Kiyoteru, Yukari, and Zunko, but for legal reasons cannot name Kaai Yuki, as minors were the subject of the recordings.

For Aoki Lapis, a voice recording competition was held to find the voice provider, where entrants uploaded the song they thought best suited her. In early August 2012, i-Style Project started an open recruitment for the voice provider of Merli, the follow up project of Aoki Lapis. The deadline of recruitment was set to September 10, 2012.

The Recording process[]

All VOCALOIDs have a similar recording process. First, the recording sessions begin with the vocal provider singing out the phonetics needed for the vocal library. This has been nicknamed a "spell" by those working with this part of the vocal construction process for its almost "chanting" sound. Originally, the "spell" was nonsense words, but it has been adjusted over time to make getting voice samples easier.[11] During the production of Kizuna Akari her provider Yonezawa noted that the script to record Vocaloids is quite strange. It isn't just basic sounds such as "A" (あ), "I" (い), "U" (う), "Ka" (か), "Ki" (き), "Ku" (く), but contains pitch variations that can be confusing for a provider to read.[12] In addition as noted by Hatsune Miku's provider, the script for one language can be very different from one language to another. She reported that she had become adapt at reading the Japanese script, but the English one required her to almost relearn everything she had learnt about reading scripts.[13]

This recording session varies on how long it takes to do. For Japanese VOCALOIDs, a voicebank may be produced within four hours, as was the example of Gackpoid's voicebank, while English voicebanks can take from one week to up to a month to record all their samples, due to the size of the vocal library required. Additional second recording sessions may take place to give a better result.

In order to get a more natural sounding vocal, three or four different pitch ranges are required to be stored in the library. Japanese requires 500 diphones per pitch, whereas English requires 2,500. Japanese has fewer diphones because it has fewer phonemes, and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in English. Each language has to go through similar processes to English and Japanese, with the number of sounds needed. being different, as well as the type of sound needed to be recorded. Some languages require more vowel variations, others consonants, there are even others who focus on certain tonal based sounds.

Voicebank construction[]

The construction of a voicebank occurs mostly through the trial, error, and experience of those who work on it. Each sample will need to be assembled into the singer library piece by piece. There is a risk of a particular sample not being correct, therefore editing may have to occur to get the best the production team can from the samples provided. Other sounds cannot be gotten via easy means as the sound is only formed in a cluster of other sounds, resulting in the sound having to be cut out among others. This is most commonly found in to European languages like Spanish and English.

Many studios such as Zero-G and PowerFX do not have an in-house production team, whereas others like Crypton Future Media and Internet co., Ltd do. Therefore, one would often find people like Anders Sodergren who will work with more than one studio. There is also a possibility of the same workers being present on more than one construction team for the studios. Another example of this is Luo Tianyi, who was also worked upon by members who had experience from making VY1 and/or VY2.

YAMAHA's Final Approval[]

Once a VOCALOID is complete, they must be sent to YAMAHA for inspection before they could be released. At this stage, both the boxart[14] and vocal are subject to change and rejection.[15]

YAMAHA inspects the vocals and checks for issues that may require repair, which is often the reason a vocal may be delayed or suspended in production. If an issue was found, the vocal is sent back to the development team for repairs. Even if there is only one minor fix to be made, the entire vocal must be resubmitted for YAMAHA to inspect from scratch. The process is the same regardless of language or studio involved. Once the testing is finished, the voice is compressed ready for release.[16] At this point, the vocal may be released without hesitation.

A VOCALOID cannot be released without YAMAHA's permission. As seen with OLIVER, the process may take over a month to complete. Other vocaloids confirmed to have been held up by this process include YOHIOloid, Ruby, DEX and DAINA. It was witnessed that DEX and DAINA took approximately two months to achieve release stage, spending roughly their final two weeks waiting for compression to take place.

Other than YAMAHA, as seen with DEX and DAINA, distributors that the studios affiliate to sell their vocal with can also reject certain illustrations.

Post Release[]

Successful Release[]

After release of the VOCALOID voicebank, further planning is taken into consideration for the release of the next VOCALOID. Some studios like Zero-G are only given enough funding to cover the costs of making a single vocal, and therefore are limited by what they can do within a year. The impact of the sales of one VOCALOID can easily effect the production of the next, and studios will also be watching to see how well the VOCALOID fairs.

A VOCALOID originally needed to ship 1,000 units in order to be declared successful, as quoted by Crypton Future Media in regards to the success of Hatsune Miku, who sold 40,000+ in her first year and went onto sell 60,000+ over her lifetime.

VOCALOIDs that continue to receive updates and studios that continue to release new vocals are considered at least proving to do well enough to continue new developments. In short, they have been successful with their VOCALOID releases to warrant more of them.

Unsuccessful Release[]

Quite often it is easy to look at the most successful VOCALOID developments and think that VOCALOID is always a profitable venture with no negative outcomes. VOCALOID is a profit-based software and the amount of sales becomes important to the continuation of sales. A studio is less inclined to even continue producing VOCALOIDs if few sell. Each VOCALOID requires a licensing fee from Yamaha to produce, and each VOCALOID costs money to record because of hiring the singer alone, but more money to hire those to develop it. The cost is always therefore in need to be recovered for the next VOCALOID to even be considered. Most VOCALOIDs do not sell this well as reported, SeeU had failed to meet sales expectations, and LEON and LOLA failed to impact America.

PowerFX Systems AB. is an example who had not seen much profit for their VOCALOIDs and since 2013 had been focusing on Soundation, their own on-line DAW which was actually bringing in a decent profit. After both Yohioloid and Ruby failed to pick up sales despite Oliver's improvements, they simply pulled from VOCALOID to focus on Soundation. A similar case is likely behind what Voctro Labs, S.L. did: they appear to have pulled out of VOCALOID to produce their own private synthesizer software technology, Voiceful.

The fact that VOCALOID is a profit-based software was also noted when Macne Nana first was made into a VOCALOID. Haruna Ikezawa had no budget to produce a VOCALOID despite wanting to and talks were made about this problem; in the end there was no money exchanged and she entered production, but she was warned Nana was not being allowed because of charity-related reasons.

By the time VOCALOID5 rolled around, it was being reported that there was a sense that the VOCALOID craze was over, as by 2019 the number of new developments of characters and voicebanks had been greatly reduced. This led to the conclusion that VOCALOID was no longer generating the success it had been in the period of 2007-2009. One thing cited to be a display of this was the lack of 3rd party vocals soon after VOCALOID5 was announced.

For the full feed back on VOCALOID5, see VOCALOID5#Criticism.

The impact of KAITO's VOCALOID failure led to an overall smaller demand for Japanese male VOCALOIDs, and is considered part of the reason why there is less production put into masculine vocals.[17] However, VOCALOIDs are subject to producer trends and interests, and this may be turned around by a sudden rise in the popularity of a particular VOCALOID, as was the case with KAITO himself.[18]


  1. [1]
  2. [2]
  3. [3]
  4. link
  5. link
  6. link
  7. link
  8. [4] New York Times - Could I Get That Song in Elvis, Please?
  9. [5] VocaloidOtaku - Questions Targeted at Zero-G/PowerFX
  10. [6] cnet Japan - プロがなぜ、二次創作を願うのか--Gacktが歌い、三浦建太郎が描く「がくっぽいど」 (Professional singer to provide voice for character, drawn by Kentaro Miura)
  11. link
  12. link
  13. https://www.youtube.com/watch?v=eD9H34jvh0c
  14. link
  15. link
  16. link
  17. [7] INTERNET Watch - クリプトン・フューチャー・メディア社長 伊藤博之氏(前編) (Crypton Future Media president Hiroyuki Itō (part one)
  18. [8] Vocaloid Blog - 出た!「2008年ニコニコ市場年間売上ランキング」でKAITOが2位! (2008 Nico Nico Market annual sales rankings, In the second place KAITO!)