VOCALOID development

Below is information on how the VOCALOID voicebanks are developed, based on what is known on the software as revealed by the studios. ''Notice that this is not a guide on how to create an actual VOCALOID. Please consider alternatives such as UTAU.''

First developments
The VOCALOID project was a international effort, and is considered the brainchild of Kenmochi Hideki, also known as the "father" of VOCALOID. In Japan, in 2000, he proposed the first initial ideas that founded VOCALOID. Much of the research into the software came from the Pompeu Fabra University in Spain, in a project led by Mr. Kenmochi. It was purely collaborative research; selling a product using it was not being considered at the time. At first, VOCALOID could only say vowels like ai (love). Four months later, the VOCALOID's first real word was "asa (morning)". The original aim of VOCALOID was to act as a replacement singer for a real vocalist. Many reviewers at the time of LEON and LOLA's release noted that "VOCALOID" was a bold effort, as human speech was a complex thing to recreate. VOCALOID was regarded as the first of its kind to tackle singing vocals.

Both an English and a Japanese version were developed alongside each other. The first studio on board was Crypton Future Media, who was hired to find English studios to support an English version. Sadly, their efforts amounted to mostly negative responses, and the only studio to enter development was Zero-G.

VOCALOID software
The VOCALOID singing synthesizer technology is categorized as concatenative synthesis, which splices and processes vocal fragments extracted from human singing voices in the frequency domain. In singing synthesis, the system produces realistic voices by adding information of vocal expressions like vibrato to score information. The VOCALOID synthesis technology was initially called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法 Shūhasū-domain Kashō Articulation Setsuzoku-hō?), although YAMAHA no longer uses this name on its websites. "Singing Articulation" is explained as "vocal expressions" such as vibrato, and vocal fragments necessary for singing. The VOCALOID and VOCALOID2 synthesis engines are designed for singing, not reading text aloud. They also cannot naturally replicate singing expressions like hoarse voices or shouts.

System architecture
The main parts of the VOCALOID2 system are the Score Editor (VOCALOID2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices. There is minimal difference in the Score Editor and the Synthesis Engine provided by YAMAHA among different VOCALOID2 products. If a VOCALOID2 product is already installed, the user can enable another VOCALOID2 product by adding its library. The system originally supported two languages, Japanese and English; upon the release of VOCALOID3 language support for Korean, Spanish, and Chinese was also included. Other languages may be optional in the future. It works standalone (playback and export to WAV) and as a ReWire application or Virtual Studio Technology instrument (VSTi) accessible from a Digital Audio Workstation (DAW).

Score Editor
The Score Editor uses a piano roll style editor to input notes, lyrics, and some expressions. For a Japanese Singer Library, the user can input gojūon lyrics in hiragana, katakana or romaji writing. For an English library, the editor automatically converts the lyrics into the IPA phonetic symbols using the built-in pronunciation dictionary. The user can directly edit the phonetic symbols of unregistered words. A Japanese library and an English library differ in the lyrics input method, but share the same platform. Therefore, the Japanese editor can load an English library and vice versa. As mentioned above, the lyric input method is library-dependent, and so the Japanese and English editors differ only in the menus. The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices. This editor supports ReWire and can be synchronized with a DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.

Beginning process
A VOCALOID studio will first have to approach YAMAHA and acquire a license to produce a VOCALOID.

The price of licensing varies per circumstance. Overseas studios such as Zero-G, PowerFX and Voctro Labs pay more for their VOCALOIDs because of the exportation rates to outside of Japan, while Japanese and other Asian studios pay less.

The cost of producing a Vocaloid is unknown, though a few amounts have been dropped;
 * ¥5,000,000 was raised for the production of Tohoku Zunko (approx $50,000+ USD) for her vocal bank to be developed.
 * €7000 ($9,300+ USD) was given as the estimated cost to hire the singer (Poucet) for ALYS alone.

Each studio is sent a construction kit which guides the studio in the production of each VOCALOID. After they have set up all the necessary means to begin work, the process moves onto selecting the singer.

The Singer
Singer Library Each VOCALOID licensee develops the Singer Library, or a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary. For example, the voice corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī. The VOCALOID system changes the pitch of these fragments so that it fits the melody.

Hiring/Selecting the Vocalist
Zero-G's singer selecting process during the VOCALOID1 era began by looking at what was missing in the Vocaloid range. It was decided that there was a gap for a classical soprano voice. This voice type was decided during Miriam's release, along with the type of voice suited for choir music. Prima's voice provider was a singer who answered an ad put up on a music academy website. After some test samples in the VOCALOID software, they decided to go ahead and record her voice for the VOCALOID2 software.

Internet Co. wanted to utilize the voice of a singer for the creation of VOCALOID, but felt it would be difficult to get a singer to agree. They consulted Dwango Co.,Ltd. who managed Nico Nico Douga, and Dwango suggested Gackt (神威 楽斗 Camui Gackt), a singer and actor, as he had previously provided his voice for Dwango's cell phone services. He lent his voice and named the VOCALOID, Gackpoid.

Not all VOCALOIDs have professional singers behind their vocals like Prima and Miriam did. According to Crypton, professional female singers refused to provide voice samples, in the fear that the software might create their singing voice's clones. In response, Crypton changed their focus from imitating certain singers to creating characteristic vocals. This change of focus led to sampling vocals of voice actors. The Japanese voice actor agency Arts Vision supported their development. Similar concerns regarding vocal clones have been expressed throughout the other studios using VOCALOID, with Zero-G refusing to release the names of their providers. Miriam Stockley (who provided the voice for Miriam) remains the only known Zero-G voice provider. PowerFX only hinted at Sweet Ann's voice provider; only Big AL and YOHIOloid's are known. AH-Software named the voice providers for Miki, Kiyoteru, Yukari, and Zunko, but for legal reasons cannot name Kaai Yuki and Nekomura Iroha's, as minors were the subject of the recordings.

For Aoki Lapis, a voice recording competition was held to find the voice provider, where entrants uploaded the song they thought best suited her. In early August 2012, i-Style Project started an open recruitment for the voice provider of Merli, the follow up project of Aoki Lapis. The deadline of recruitment was set to September 10, 2012.

The Recording process
All VOCALOIDs have a similar recording process. First, the recording sessions begin with the vocal provider singing out the phonetics needed for the vocal library. This has been nicknamed a "spell" by those working with this part of the vocal construction process for its almost "chanting" sound. Originally, the "spell" was nonsense words, but it has been adjusted over time to make getting voice samples easier. This recording session varies on how long it takes to do. For Japanese VOCALOIDs, a voicebank may be produced within four hours, as was the example of Gackpoid's voicebank, while English voicebanks can take from one week to up to a month to record all their samples, due to the size of the vocal library required. An additional second recording may take place to give a better result.

In order to get more natural sounds, three or four different pitch ranges are required to be stored in the library. Japanese requires 500 diphones per pitch, whereas English requires 2,500. Japanese has fewer diphones because it has fewer phonemes, and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in English.

Voicebank construction
The construction of a voicebank occurs mostly through the trial, error, and experience of those who work on it. Each sample will need to be assembled into the singer library piece by piece. There is also a risk of a particular sample not being correct, therefore editing may have to occur to get the best the production team can from the samples provided.

Many studios such as Zero-G and PowerFX do not have an in-house production team, whereas others like Crypton Future Media do. Therefore, one would often find people like Anders Sodergren who will work with more than one studio. There is also a possibility of the same workers being present on more than one construction team for the studios. Another example of this is Luo Tianyi, who was also worked upon by members who had experience from making VY1 and/or VY2.

Post Release
After release of the VOCALOID voicebank, further planning is taken into consideration for the release of the next VOCALOID. Some studios like Zero-G are only given enough funding to cover the costs of making a single vocal, and therefore are limited by what they can do within a year. The impact of the sales of one VOCALOID can easily effect the production of the next, and studios will also be watching to see how well the VOCALOID fairs. A VOCALOID needs to ship 1,000 units in order to be declared successful, as quoted by Crypton Future Media in regards to the success of Hatsune Miku, who sold 40,000+ in her first year and went onto sell 60,000+ over her lifetime. However, most VOCALOIDs do not sell this well as reported, SeeU had failed to meet sales expectations, and LEON and LOLA failed to impact America.

The impact of KAITO's VOCALOID failure led to an overall smaller demand for Japanese male VOCALOIDs, and is considered part of the reason why there is less production put into masculine vocals. However, VOCALOIDs are subject to producer trends and interests, and this may be turned around by a sudden rise in the popularity of a particular VOCALOID, as was the case with KAITO himself.