Phoneme List

The phonetic system forms the basis of speech play back in the Vocaloid software. Symbols used in the phoneme system are based on X-SAMPA.

Using the Phonetic System
Note: The following applies to the Vocaloid 2 system onwards, while both programs work in a similar fashion, some things may not apply to Vocaloid or working differently to Vocaloid 2

The Recording Process
The samples are gathered via the provider reading out a script in various keys while being recorded. The recording is then transferred to into a library which the Vocaloids will pull their results from. The libraries consist of various sounds recorded and seperated for use with the software.

For Japanese the script is much simpler with each phonetic sample successsfully divided across the notes with little trouble. This renders each note being fairly precise.

However, for English Vocaloids, the phonetic data has to be seperated by cutting sections out of the recorded samples, because some sounds simply cannot be gathered unless they were spoken as part of a word. This makes seperating sounds for the English Vocaloids much harder to do. As such, Japanese Vocaloids are often more precise then English ones on their diaphonetic sounds.

Constructing Words
Vocaloid uses the method called Frequency-domain Singing Articulation Splicing and Shaping. This one takes a series of diphong and triphonetic samples from a sample library which are specified by the phonetic system and utilizes them to reconstruct the word reassembling them in accordance to how a word would be phonetically pronounced. For example, the word "sing" (IPA: sɪŋ, written as [s I N] in the Vocaloid Phonetic System) can be synthesized by concatenating the sequence of diaphones "#-s, s~ɪ,  ɪ~ŋ, ŋ-#". . Using the phonetic system you can input the phonemes that conforms the word, allowing to the synthesizer, pick the correct sequence of diaphones to reconstruct the word. As the vowel [ɪ] (Vocaloid:[I]) sounds different for the diaphones s~ɪ (s-I) and ɪ~ŋ, the software needs apply a "smoothing processing" in frequency domain, which blends both diaphonic samples in a coherent syllable fragment (if this weren't the case, the results would be unnatural and heavily artificial).

This way of reconstruction of the words is the same for all the languages in which Vocaloid is available and will use the same method of arrangement for the phonetic library. The fundamental difference between them is the number of samples required for reconstruct each language, being determined by its complexity. For example with English, a language with numerous consonant cluster, numerous vowels which includes diphthongs and a complex syllable structure, is required more diphonic and triphonetic samples than the Japanese, which has a simple syllable structure, practically no consonants clusters and a 5-vowel system.

The Vocaloid's dictionary will attempt to match the correct phonemes with the word the user enters, avoiding to have to input them manually. If a user allows the program to auto-find phonemes and it has a particular word that it simply cannot identify or not is registered in the dictionary, it will automatically write it as a default phoneme ([u:] (or English or [a] for Japanese). In that case the user will need input the phonemes manually or add the word to the dictionary, requiring in both cases to known how the word is written phonetically.

If a user knows how words are articulated, the person can infer how to write a word that isn't in the dictionary (Ex: knowing that " bung " is represented as [ bh V N ] and "ban gle " is written as [bh { N g V l ] you can infer that "bungle" has to be written as [bh V N g V l]).

In addition, the user cannot utilize the phonemes that doesn't exist in the used voicebank. If the user tries to enter a phoneme manually that the Vocaloid simply does not have in its voicebank, there will be no sound at all when the Vocaloid is played back.

Due the way the sound is articulated by the synthesizer, phonetic phenomena like the coarticulation and assimilation, where the phoneme sounds are affected by the adjacent phonemes, are present on the synthesized words. For that reason, the phoneme sounds do not always produce the same results; they may sound differently or weakly/strongly according to their previous/following phoneme sound. To make a consonant sound stronger than the following vowel, editing Brightness, the constant sound's Breathiness or Dynamics higher will often work on some level. Another alternative is to switch the phonemes (the affected one or the adjacent to it) with an allophone, approximantor just a similar sounding phoneme.

Editing the phonemes
To create and edit phonemes, a user must right click on a note click and press "Note Properties". Here they can edit a phoneme and add additional effects through the "Note Expression Property" and the "Vibrato Property" windows. As shortcut, the user can double click a note to edit its lyric, then pressing Alt key and down arrow key (Alt. + Down Arrow) at the same time allows edit the phonetic data directly. This also allows the user to use the Tab key to skip to the next note and skip back to the previous one using Shift and Tab keys. In Vocaloid3 since the v3.030 to forward is possible swap the phoneme imput with the lyric imput allowing you edit it directly with a simple double click.

Because some phonemes are written with more than one character, such as the phonemes [u:] (for English) or [ts] (for Japanese), those need to be written separated with a space between them. If the user does not take care of this, the synthesizer will interpret all the characters as just one symbol, being unrecognized and producing no sound. Also, capitalization affects phonemes because some symbols are differentiated just by this (example: [Z] and [z] are different phonemes, so they don't produce the same result).

Additional notes
Due to to the software's musical nature, monophonetic and polyphonetics may also be needed to be considered where needed for closer vocal pitching pronunciation. The user will, however, have access only to the pronunciation at a phonetic levels and the finer levels of vocal speech adjustments that cannot be accessed currently.

Please note that all the Vocaloids simply do not have the same phonemes, such as the breathing phonemes [br1]- [br5]. There are also some phonemes that are found only in one language, so not all of the Japanese and English Vocaloids will share the same phonemes. Also, while a Vocaloid's help guide will list the alphabet of the language, they may not include additional notes.

Using One Language To Create Another
A user can use the phoneme system to create languages from scratch, so long as it is within the Vocaloid's capabilities. Due the differences between both phonetic systems and between the individual voicebanks, there are some considerations that the user must be aware when they attempt to make a Vocaloid sing a language they aren't intended for and, being a difficult task that it may take hours to do through a trial and error process.

Anyway, regardless this, if the user is aware of the Phonology of both idioms: the original one for that voice, and the target language, the task can be easier. Even more, a user may be creative, even going so far as to invent languages of their own if they desire. Essentially, the more time a user spends working to get familiar with the phoneme system, the more they can get out of the Vocaloid program.

However, some voice are easier to work than other ones, or present some sort of advantage. A clear example is Sonika, which is regarded as one of the most potential Vocaloids to "sing in any language" due to her unique set up, or Luka that allows to switch between her English and Japanese voicebanks according to the needs of the user. Users' techniques often make surprising results, however, it is greatly influenced by how much a Vocaloid's Phonetic System has phonologically in common with that of the target language without aids of other music/audio software. Examples; Due the phonetic similarities, the Japanese Vocaloid can achieve a good level of Spanish. In the introduction of SeeU it was confirmed that the Korean language is capable of mimicking a decent amount of English due to its phonetic similairites between the two.

Differences and Considerations
The user must consider several factors when they attempt make a Vocaloid sing in a Language different to it was intended. The first thing is the user must be sure there are enough similarities between both language, the original language of the Vocaloid and the attempting target language, to do try, if isn't is the case is pointless to try. Between more similar are both language it going to be easier

Beside the previous condition, there are factor that the user must consider when they use a Vocaloid to Create another one. Those ones are:
 * The itself Vocaloid or voicebank utilized : Each Vocaloid has their own characteristic, advantages and flaws, requiring their own tricks and considerations at the moment of work with them.


 * Among the considerations the user must be aware is the way how the Vocaloid pronounces, some voicebanks have a more marked pronunciation of the consonants, or sometimes they pronounces the consonants clusters in a different way. The user must have it in mind when is combining phonemes for achieve a closer pronunciation to the intended language.


 * The tempo utilized in the song : Important when you use short notes for some tricks or techniques. The tempo can affect the them, requiring readjust the length or duration of those ones.
 * The pitch range in which you are working : The voicebanks are recorded utilizing at least two registries, one for the higher pitches and one for the lower ones. The software then created the transition between generating all the scale of notes. How they're different records, the pronunciation or quality of some phonemes could be different at the moment they were recorded. This causes the pronunciation differs at different ranges or pitches.
 * The influence of the adjacent phonemes, assimilation and coarticulation phenomena : As was explained previously, the assimilation and coarticulation is present in the synthesizer, so a phoneme can affect their neighbors.

Due the individual differences between the voicebanks, different approaches obtain better results than other ones. A case is that sometimes, phonemes that are not equivalent work better than equivalent ones in the target languages; for example, when Miriam sings in Japanese, [v V] /vʌ/ sound closer to the actual pronunciation of [w a] /wa/ as a Japanese particle は than [w V] /wʌ/.


 * For more explanations on the differences and comparisons between English and Japanese Vocaloids see the conversion list: English - Japanese

Techniques
Due the way how the sound is articulated by the synthesizer, simulating the human speech, some phonologic phenomena of it also appear in the software (like the coarticulation). This allows to the user apply them to the software for increase the capabilities of the voicebanks.

Auxiliar Phonemes
In the Vocaloid Software exist an array of Auxiliar Phonemes, these phonemes are used for get some effects (like breaths) or for alter the default pronunciation (like [Sil] which is utilized for break the diphone transition between two phonemes). It's important consider there are different auxiliar phonemes are present in the different versions of the software, not all are available for the different voicebanks and their effect or funtion may differ between the different voicebanks and versions of Vocaloid.

Coarticulation, Assimilation and Phoneme Combinations
An application of the coarticulation is combining phonemes to achieve new articulations, closer to the desired ones.


 * Examples:


 * Induce palatalization in an English Vocaloid singing in other language like Japanese or Korean (in the case of the palatalized consonants) or Romance Languages (for the case of the palatal nasal).
 * Generate a similar TH (voiceless dental fricative or ) sound using

Glides or Semivowels
The glides or semivowels are sounds that share traits of a vowel (like they produces wth little or no obstruction of the airstream) but that are non-sibilant (in other words, they aren't the nucleus or main element of a syllable). If the user is aware of the glide and its respective vowel counterpart, he/she can utilize it in replace or along it vowel producing interesting results.

Some possible uses of the glides are:
 * Fix choppy vowels combination
 * Facilitate some diphthongs or diphones
 * Replace the vowels when is required.

Use of short notes
An additional technique is the use of short notes (around 1/64 or 1/32 of length). When the note is too short the articulation will be incomplete and the sound will blend with the next note producing interesting results. It's important emphasize this technique is strongly affected by the Tempo, if you work in too high or too low Tempo probably you will need readjust the length of the short notes.

This technique can be utilized for:
 * Improve the pronunciation of some consonant clusters.
 * Generate colored consonants.
 * Blend some phonemes.
 * Achieve new articulations.

Second Voice Support
It is also possible to use a Vocaloid with a similar voice type to hide the flaws of the phonetic mispronunciations of another by having the two Vocaloids sing in a duet, a classic fan example that has become acknowledge as a good example amongst fans is Sonika and Luka.

Other use of this is when a Vocaloid sings in other language. If you put it in duet or chorus with a Vocaloid of the intended language this one will compliment the pronunciation of the first one.

You can also apply both techniques to improve Vocaloid langauges with a real singer vocal acting the second vocal. However, this can potentially defeat the point of using the Vocaloid in the first place as a real singer can more easily adapt to singing words not native to their own language then a Vocaloid can.

Post-Edition and Phoneme Slicing
Besides all the tricks available in the editor, it's possible improve the pronunciation further more during the post edition. After rendering and export the WAV file, the user can edit it in any DAW or sound editor. If the pronunciation of a consonant is too soft or too strong the user can correct its volume.

Another technique that is possible to use on Vocaloids is phoneme slicing. This can be used on Japanese phonemes for Japanese Vocaloids, either in the Vocaloid software itself or the user's DAW. The length of the note is decreased or cut down, until only half the pronunciation needed for the spoken Japanese is heard (example "su" becomes "s"). However, this will affect the singing capabilities of the Vocaloid and the notes being cut have to be much longer than normal. Although this technique may be hard for new users and results in a lack of singing smoothness, it increases the chances of getting a closer match to the intended sound. This can also be applied to English capable Vocaloids. Additionally software like Vocoder software can be used to artificially create or transform Japanese or English phonetics into another language.

Flaws in the Phonetic System
Vocaloids must have the correct diphonetic combinations to avoid sounding choppy. However, the Vocaloid system will attempt to sound out all diphonetic data assigned to the phonemes used, even if that particular sound is not needed. The hidden Phonetic [Sil] will prevent this occuring and can be used with any Vocaloid language. Still this does not resolve all issues or scnarios.

Vocaloid itself has a habbit of trying to sound out all data assigned to it. Yet a natural speaker may not sound out the needed diphonetic sounds when they sing for various reasons such as a naturally slurred vocals, their localized accent, vocal disorders like stuttering or speech impediments such as a lisp. This restriction may limit the ability of a Vocaloid in regard to mimicking the language they are intended for. For example, the American English accents often involve the complete departure of the schwa vowel sound from words where it is featured. This sound is normally a prominent feature of the English language itself and present in British English accents.

Languages themselves have their own sets of rules that breaking are difficult.

For example in English and Japaense Vocaloids;


 * Japanese; Since Japanese Vocaloids do not have to blend their words like English ones and for having just 500 diphone sound to use, Japanese Vocaloids can produce choppier results than English Vocaloids when trying to be used for non-Japanese words, especially very different vocal languages such as English. Often when slicing phonetic information remains ("Su" becoming "s") a small fragment of the missing phonetic sound (in the case of "su" the missing "u" sound), leaving behind awkward vocal sounds that lower the quality of a Vocaloid's results. As a result of Vocaloid 3, voiceless sounds now make this a much easier attempt to do, but is still not a perfect solution to the problem. N' followed by a vowel may produce odd results, however, due to its use within the Japanese language there is no actual call for this phonetic to be followed by a vowel sound anyway so Vocaloid pocesses very limited data related to it. Japanese Vocaloids still have a very limited amount of vowels and in many cases the entirely wrong vowel sound needed for many non-Japanese words.
 * English; In the case of English Vocaloids, attempting to always blend their letters and for having 2,500+ diaphonetic sounds, depending on where the stress accent is will result in closer or more distant sounds to the intended target language. This can often make them complex to construct non-english words. The result is a reliance on [Sil] to prevent unwanted combinations can leave behind choppiness and robotic results, mixing between smooth results and sudden stops. Vocaloid 3 has the capablities to make this easier to resolve and will soften such hard pronounications anyway, but the sounds remain even though they are less apparent. Even still with their large selection of diaphonetic data they cannot be certain to say the right data even when needed and basic control of the diaphonetics may prove to result in random incorrectnes..

In both cases, the language construction is the reason for the issue and if used for their own languages, the results will not sound much natural and flaw much easier. As noted in this section, due to the sheer number of things to take into account, English capable Vocaloids can often be potentially far more complex due to the problems presented by the English language, than the Japanese Vocaloids. Liberally interpreted, English Vocaloids have a greater language capacity than their Japanese cousins for having more vowel and clearly separated consonant sounds and are therefore easier to make sing in other languages, although both will only be using the equivalent or quasi-equivalent phonemes according to the set up of the phonetic system of either language. Japanese Vocaloids can often be far more simple to use, despite the more limited array of phonemes.

There are also a number of known words that have been used by English capable Vocaloids that have more than one pronunciation of the word due to stress accents. However the user failed to be able to separate the correct results from what the software gave them since Vocaloid can currently only store one pronunciation of the word in its dictionary. Without knowing how to sound out the alternative pronunciation, these words can be considered a problem to non-native English speakers;
 * Wind
 * The wind blew (IPA: [ˈwɪnd])
 * you wind me up (IPA: [waɪnd])
 * Read
 * I will read the book (IPA: [riːd])
 * I read the book (IPA : [rɛd])
 * Tear 
 * You have a tear in your eye (IPA: [tɪə]; Vocaloid: [t I@])
 * The paper has a tear in it (IPA: [tɛə] ; Vocaloid: [t E@])
 * Bow
 * You must bow before royalty (IPA: [baʊ])
 * I tie a bow in my hair. (IPA: [bəʊ] or [boʊ])
 * Live
 * The show was broadcast on TV live (IPA: [laɪv])
 * I know where you live (IPA: [lɪv])

Spanish Vocaloids also use stressing for some of the data, so this feature is not unique to English Vocaloids, but it absent from many Asian languages including Japanese.

Also, Vocaloids sometimes have difficulty pronouncing words. For example, Prima and Tonio struggle with the middle section of the word "together" if the middle section is too short when you spread the word out over several notes ("to-geth-er" becomes "to-g'-er" if "geth" has no room). Some Vocaloids singing results may impacted if a user does not consider this. When a Vocaloid fails to pronounce a phonetic it should be able to, there are ways around this. You can move the phonetic data onto another track, increase the "accent" (attack) in note properties, or change the length of the note to allow the vocal room to pronounce the words. VY2 also has a weakness like this, the phonetics あ a with れ re becomes a げ ge sound, but this is fixed by dividing the tracks, breaking the transition with [Sil] or modifying the tone of the voice.

Many Vocaloids also come with optium range. The recommendations are to help direct Producers into the best range fo the Vocaloid as well as describe what vocal range the Vocaloid is (Soprano, Mezzo-Soprano, Tennor, Alto, etc). When hitting the high notes above the Vocaloid's capablities, they may become muffled and lack clarity, while many low notes can be soft and quiet. Singing within the optium range increases the chance of clearer and more stable language skills of the Vocaloid.

Likewise, optimum Tempo helps the Producer know what range will leave the Vocaloid sounding most natural in, too fast may not give the Vocaloid time to sound out the sounds correctly resulting in digital noise in place of natural smooth pronounications. In the opposite direction, too slow can make any digital defects more apparent by allowing them to be heard much clear. The engine version will also affect the results in different ways, with Vocaloid being particular renounced for its heavy digital sounds a lot more then Vocaloid 3.

User related concerns
One of the issues related specifically to the user is that they may not be able to use a Vocaloid from a language they don't know particularly well. What may sound flawless and realisitic to a person who has little knowledge on a language, is actually full of bugs and glitches. A speaker of that language can hear the Vocaloid's flaws much better then someone who knows little on the language. This issue has can easily occur in even the most well tuned Vocaloid songs and can often add a kink to a otherwise perfect example of a Vocaloid's best singing results.

Even if one were to take a VSQ or VSQX file that had been tweaked by another user, even those that are a native speaker, not all Vocaloids have the same strengths and flaws. Therefore, it is vital that users take time to study even the basics of the language structure they are working with and further more spend time comparing results for every song they produce, even if there is alreay pre-tweaking on the VSQ and VSQX file.

Additional Help
Also note, both Zero-G and PowerFX also have tutorials of their own.


 * How To Make a Vocaloid Breathe Using VOCALOID: Explanation on how some of the Japanese Vocaloids sound when you use the breathing effects
 * Comparative Table of English and Japanese Phonetic System of Japanese and English Vocaloids, including notes on if the vocaloid has this phoneme. List also includes information on how to transform the quasi- equivalent phonemes in Japanese and English into the opposite language effectively.
 * Vocaphonetic: A Japanese community site for creating and distributing Japanese dictionary data for English Vocaloids to sing better in Japanese. The dictionary data for Vocaloid and Vocaloid2 are respectively available.
 * Vocaloid Phonetic Library - a quick look up guide for Phonetics of all Vocaloids.
 * From English to Japanese - Using Tonio, this is the instructions for how Japanese users can make Tonio sing in Japanese. Also shown is how close to and how much of the Japanese language Tonio can reproduce.
 * Tutorial - here you see a tutorial showing a user making Miku sing in "english" Japanese phonemes.
 * Making Big-Al sing Japanese

Trivia

 * One of the reasons for the large length of time between Vocaloid releases for english Vocaloid is owed to the length of time consumed in recording the phonetic samples (estimate; 2,500 samples needed for English vs 500 for Japanese per each pitch). It took 25 hours (4 hours a day) to record all the Kagamine "Appends". In contrast, according to Anders, it takes anything from 1-3 weeks onwards to record a single english voicebank.
 * The more samples involved in making a synthesized voice the harder it is to maintain quality and the lack of smoothness of older synthesizing software voicebanks can often reflect the difficulty it presents.
 * More complex languages such as English struggle much more to maintain quality while singing due to the sheer number of samples involved.
 * This is also why older voicebanks may be harder to use such as the vocaloid voicebanks. For instance, "now" is often pronounced as "no-ow" by the English Vocaloid voicebanks. In contrast, Vocaloid 2 voicebanks have no problems with this word.
 * Some fans struggle to understand how synethized vocals have developed over a single decade and do not understand why Vocaloid results are as they are. Here are Microsoft Mike, Mary, Sam and Ann, speaking (mature Content) showing the various stages of this particular software and progression the vocals for the Microsoft text-to-speech voices software. Vocaloid was released soon after this software was being developed, yet are much more advance software packages, but there are common problems shared between all synethizing software packages.
 * Studies of the brain prove that if the words are close enough to the intended words they are suppose to be when spoken the mind is capable of working out, or attempt to work out, what they actually are even if the actual words spoken are gibberish. This plays a role in the matching of phonetics from one language to another, and can make the mind believe that a word sounds closer to the intended word then it is.