Vocaloid Wiki
! The following is a tutorial made for VOCALOID fans by fellow VOCALOID fans. !

English Phonetics in use on the VOCALOID piano roll system

English VOCALOIDs are VOCALOIDs that are capable of mimicking the English language much easier than VOCALOIDs of other languages. The following is a list of phonemes needed to make an English VOCALOID sing in English.


The English language has one of the greatest variations in dialects in the world. Thus, there is much more variety of pronunciation for English VOCALOIDs than there is for those that sing in other languages.

English itself comes from England and was the native language of the English people. The language's complexity is owed mostly to it being made up of loanwords, owed to the migration of various ethnic populations over hundreds of years. The language consists of various major influences from Celtic, Roman, Saxon, Norman, Viking, and Dutch origins. In addition minor influences from India due to the occupation of British India also have contributed to the language, among a number of other much minor sources. These latter global additions to the language are owed to the British Empire's widespread influence across the world, spreading the language to Africa and the Americas, as well as some countries in Asia. While the British Islands are also home to Gaelic (the original language of Scotland), Gaeilge (original language of Ireland), and Welsh (the original language of Wales), over time English became the Lingua franca of these countries. In modern times English is the primary language of the United Kingdom and the Republic of Ireland, leaving these mostly as secondary languages. In addition, the use of English as a Lingua franca is quite common globally.

The English language itself is made up of about 20 vowel sounds and 24 consonant sounds, give or take depending on the dialect. English doesn't have precise orthography, so there is not a one-to-one or near one-to-one match between letters and sounds as with other languages like Spanish or Japanese.

Example. "W" can sound /w/ in "what" and /u:/ in "few", "Y" can sound /j/ in "yes" and /i/ in "play".[1] There are also differences between spellings of words, such as those seen in British and American spellings of words such as "colour/color".

VOCALOID and the English Language[]

VOCALOID and VOCALOID2, only support American spelling for the lyrics by default. From VOCALOID3 the VOCALOID engine was confirmed to be capable of localisation, but it is unknown if this will ever open up the ability to have both American and British spelling.

However, the phonetic notation doesn't follow this, and instead uses the Received Pronunciation written in X-SAMPA, with some minor modifications when it's required, like is the case of the allophones.

English is one of the most difficult languages to recreate for VOCALOID itself, being several times larger then simpler languages such as Japanese, with much more random and unpredictable results. This is solely contributed to the sheer size of the voicebank and many factors that may have to be taken into account when using and even developing the vocals. Even when recording sounds, a slight change in the way the language is spoken can create oddities among the overall way a VOCALOID sings, changing the tone or pronunciation and at times randomly even compared to another vocalist with the same tone or accent of voice.

It is important to remember that the engine itself is not impacted by language and singing results are the same for any language overall. However, due to the higher amount of variations of overall sound, English can be much more unpredictable compared to some of the other languages VOCALOID supports.

Like Chinese, there are many variations of "Native English"; however, unlike Chinese there are much larger contrasts for the different accents and several major accents have appeared over time. For example, there are significant differences between British, American, and Australian English accents alone, so much so that in truth all 3 would need a unique English script to effectively record all 3. Whereas with Chinese, the Beijing script is considered the "standard" pronunciation, therefore only a script for this variant needs to be taken into account for. Even when it comes down to producing an English voicebank for a non-native speaker, the script may have to be catered for each accent variation. The overall result is that what adjustments, editing, and effects are added to one VOCALOID voicebank in English, can be counterproductive to another's result.

During the VOCALOID2 era it was also confirmed that English voicebanks needed their samples cut at a length of more than 0.5 seconds on many sounds, longer than the Japanese sample length. If not done so, the English vocals had a habit of cutting out when used for short notes.[2]

English Scripts[]

The recording scripts used for English VOCALOIDs also has been confirmed to have an impact on the way an English VOCALOID sounds. The scripts are the list of sounds a voice provider has to record in order to obtain all the sounds essential for successful English replication. A bad script results in more errors being present in the final voicebank.

For those not familiar with the English script at the vocal's time of recording, regardless of the version used for recording, it can be a challenge. This was noted when Saki Fujita, a veteran of Japanese VOCALOID script reading, faced it for the first time as she thought she would have to relearn script reading from scratch. The reason for this is because the script used is very different in comparison to ones used for other languages such as the script for Japanese.[3]

According to the developer's notes, in regards to CYBER DIVA, the VOCALOID engine itself uses a combination of both British and American phonetic sounds. The result is that sometimes certain sounds may sound off because that particular combination would not typically be used together by neither a British nor American accented speaker.

Original YAMAHA Script[]

The VOCALOID English script used prior to VOCALOID4 was confirmed to have contained errors, and thus VOCALOIDs that were recorded using it, such as YOHIOloid, have incorrect pronunciations. This is important to note that it's a common occurrence for pre-VOCALOID4 that when a combination of phonemes is entered into the editor, unexpected results may occur.

Due to their differences, the majority of the pre-VOCALOID4 voicebanks will not produce the same results as post-VOCALOID4 voicebanks because of the issues with this script.

Many also lack the schwa sound despite the symbol for it being registered by the engine, and there was also an "aspiration problem," where aspirated and non-aspirated consonants were not differentiated.

Cyber Diva Script[]

Upon the development of the CYBER DIVA vocal a number of issues were noted that had existed and were finally addressed resulting in the base YAMAHA script being improved.

A subtle differences between the old YAMAHA English Dev Kit script and the CYBER DIVA script is that the newer script produces less expressive tones then the older script, as it focuses on obtaining more clarity per sound.

The CYBER DIVA script also fixes the "aspiration problem" and includes the schwa sound recording.

Ruby Script[]

Ruby also uses a new script that was created by Syo. Once again, the creation of the new script was due to the errors contained within the previous YAMAHA script. Because of it, Ruby shows various pronunciation improvements over older VOCALOIDs like YOHIOloid. This script is written focusing on the American accent.

Part of the reason for Ruby having a different script than CYBER DIVA is that the improved script used for CYBER DIVA hadn't been shared at that time.[4] The aspiration issue is also fixed in this script.

DEX and DAINA Script[]

In June 2015, Syo also revealed he had created another English script for Zero-G Limited's two American VOCALOIDs DEX and DAINA, which was similar to Ruby's but had different lyrics.[5] This script also focuses on the American accent.

Cyber Songman Script[]

CYBER SONGMAN was recorded with a brand new phonetic script developed personally by the lead developer of the project, Michael Wilson. This new script is an update of CYBER DIVA's, and according to Wilson, tests proved that it was easier to read and pronounce, which increases the clarity in the pronunciation while maintaining a natural, expressive sound.

Notes on Accents[]

Despite the general belief that singers completely lose their accents when they sing, this is not the case in every instance and an accent is possible to be heard even in singing vocals.

However, the reason many are led to believe this is that there are several methods of training singers to disguise or otherwise hide their natural accents - they may even adopt an accent that isn't their own for singing. Samples include genres such as western or country, black music such as Jazz or Soul. Singing also uses different muscles to speech, resulting in difference of air pressure and way the throat moves. Genres such as Opera are most likely to make a accent appear almost entirely absent thanks to the impact of the opera vibrato.[6][7]

VOCALOID will capture any form of accent quite easily at times. It depends on the recording method used on the voicer, type of sound being recorded per sample (accent impact varies per sample and language), and overall number of samples that make up the voicebank (the more samples, the more chance of it slipping in).

VOCALOID commonly captures English voicebank accents, which then become present on the voicebanks results. In fact every VOCALOID made in English thus far has managed to capture some form of accent. Some sounds are accent exclusive and these sounds are present in the Enlist itself. VOCALOID isn't alone with this problem and any similar synthesizing method can have these same issues with accent.

Though the English language is not alone in the problems of accent as other languages may suffer from this same problem, English VOCALOIDs have proven to be difficult to avoid issues with accents. Even the first two VOCALOIDs in English, LEON and LOLA, were noted their distinctly "British" accent. The result is that the accent has been known to aid or add difficulty to the use of synthesizing software and VOCALOID is no stranger to this effect. English VOCALOIDs have ended up with the most variation on how they sound out of all the current languages offered for the VOCALOID software so far produced.

The impact of the dialect/accent on English VOCALOIDs can result in a notorious variation of certain sounds, being notorious in the case of the diphthongs and rhotic vowels. Users who are not aware of the potential difficulty of accents may overlook odd pronunciations that need to be adjusted for better results. This is true for non-native based accents voicebanks more so, due the voice provider may have pronunciation issues with a non-native language.

In some instances, Producers may be found to have adjusted VSQ, VSQx, and VPR files so heavily to make them work for 1 particular English VOCALOID that they become "VOCALOID specific" and are unable to work particularly well without further adjustments on other English VOCALOIDs. Cases like this are often rare in languages such as Japanese, though not foreign to them and many VSQ, VSQx, and VPR files will work without too much adjusting.

Native Accented[]

British-English Accented[]

British-English accented VOCALOIDs were VOCALOIDs whose provider was known to have been of "British" nationality. As Great Britain is the main origin of English, British-English VOCALOIDs sing in a native English accent. Originally, they were the standard English accent type used to develop the English engine. British-accented VOCALOIDs mostly came originally from Zero-G who worked solely with British artists to collect their vocal samples from.

A issue with this accent that has commonly been reported is weakness in certain consonants, an issue that other accents such as American accents simply do not have, with some of the older VOCALOID and VOCALOID2 vocals having issues with sounds such as "G", "T" or "R" being particularly weak. On the other hand, vowels are often distinct and soft, with only issues related to incorrectly placed sounds ever being reported as a major issue (See English Scripts).

Those not used to the accent have generally reported a lack of clarity, such as American speakers. In particular "soft" type vocals being considered the worst on the issue.

Note: The term 'British' applies to anyone from England, Scotland, Wales and Northern Ireland and therefore the variation of the accent can differ greatly overall. The British Isles have the greatest variation of accents for English in the world per sq. mile of land. (For more information see Wikipedia.)

Though LOLA is regarded as having a "British" accent, this is non-native. LOLA's accent reverts to her provider's natural Caribbean accent when not singing in ideal Soul music conditions.

Though Japanese accented as well, Fukase English has influence from this accent and is the only Japanese-English vocal to base itself on the British accent instead of the American one.

American-English Accented[]

American accented VOCALOIDs have providers that came from the United States of America, and for this they are native speakers of the English language. The most notorious difference with the British accented voicebanks is in the rhotic vowels.[8] This is because the British dialects usually are non-rhotic; in North America rhotic dialects of the English are predominant.[9] (For more information see Wikipedia.)

Overall, In regards to VOCALOID the result is that consonants are usually stronger then British ones. VOCALOIDs with this accent generally are considered to have more clarity overall with slightly less distinction, though often bolder or harder vowel sounds. There are far less clarity issues reported as a result, even from those not used to the accent.

Due to the user base preference for this accent, PowerFX have confirmed since that YOHIOloid's vocal was made to have a American sounding ring to it. Hatsune Miku English also was made to match the American way of speaking by Crypton Future Media. The American accent preference has become the preferred accent overall among VOCALOID, since many users are American.

Australian Accented[]

Australian Accents are the normal English accent for individuals from Australia. This particular accent is normally very distinct compared to all other English accents, with features unique from all other English dialects. (For more information see Wikipedia.)

  • Sweet ANN - Her provider "Jody" supposedly came from Australia.

South-African Accented[]

South African accents are accents belonging to individuals from South Africa. English was not a native language to Africa and was introduced during the colonisation of African countries by the English, resulting in the English language becoming widely used in South Africa itself as the general Lingua franca between regions. Variation in impact of native languages on the English language results in a large variation of strength and tone of the accent, though in general most South African accents resemble closely to South England accents in nature. (For more information see Wikipedia.)

Irish-English Accented[]

Irish accents belong to those hailing from the island nation of Ireland. English is its most widely-spoken language, and one of its two official languages, along with the Irish language. (For more information, see Wikipedia.)

Non-native Accented[]


Japanese-accented English VOCALOIDs are produced by those who came from Japan. Their voice providers have the Japanese language as their native language, but were used to produce English voicebanks. Therefore the Japanese-English accent is a non-native English accent, showing significant and notorious differences in comparison to the native English accents. As more releases of such voicebanks have been produced by studios, common traits that are clearly able to be picked out amongst these vocals.

The major issue seen with Japanese accents is that they often struggle with distinction of some sounds. This usually happens because the providers and producing studios/companies aren't familiarized with these foreign sounds. Among the most common issues are:

  • Lack of distinction and stress in vowel sounds. These ones usually are either too tense or too lax, as the speaker tends to approximate the vowel sound to their 5-vowel system.
  • Lack of distinction in the liquids consonants (R & L). Luka's use of English to pronounce the words "Road Roller", which risks coming out as sounding like "roe rorora", is the most famous case.
  • Distortion of some sounds toward similar Japanese sounds. As example, the [f] phoneme pronounced as a voiceless bilabial fricative instead a voiceless labiodental fricative, as it should be.

These traits depends of the providers efficiency in English and the experience of the studio/company with the language. Despite this, Japanese-accented English VOCALOIDs still are a better option for mimicking the English language than use purely Japanese voicebank, having the wide array of phonemes and work-arounds available from the English phonetic system.


Korean-accented English VOCALOIDs are produced by those who come from South Korea. As there is only one unreleased VOCALOID voicebank with this accent, details cannot be released.

SeeU's Korean voicebank is a special case as it was given English phonemes to mimic the language to certain degree. However, this feature was left largely incomplete due to deadline issues and again this does not produce quality results enough to comment on.

  • SeeU - An English Voicebank was set for production but is currently on hiatus as of Feb 2013.
  • UNI - An English Voicebank was set for production but had been stalled since 2016. As of 2020, there were remarks from ST MEDiA Co., Ltd. that YAMAHA no longer supported the VOCALOID4 engine, which now limits the possibility of UNI receiving more voicebanks.


  • YOHIOloid - He is voiced by the Swedish singer and songwriter, YOHIO, who provided this English and Japanese voicebank.


  • Prima - Accent unconfirmed
  • Tonio - Accent unconfirmed
  • Amy - Accent unconfirmed
  • Chris - Accent unconfirmed
  • SARAH - Accent unconfirmed
  • ALLEN - Accent unconfirmed

Custom Dictionaries[]

More information on dictionaries can be found on Phoneme List.

English VOCALOIDs rely on the VOCALOID editor dictionary greatly due to the language's lack of a systematic orthography. Custom dictionaries can take advantage of the large array of English sounds found within VOCALOID to improve the way they sound, by using different combinations of sounds or by creating a accent/dialect to appear by default. This is not isolated to English vocals, but has been known to impact them greatly at times.

Be aware that the language is full of examples of homonyms that take the form of homographs (a word that has the same spelling as another word but has a different sound and a different meaning; such as "bow", "minute" and "tear") or homophones (a word that has the same sound as another word but is spelled differently and has a different meaning; such as "pair"/"pear" or "bare"/"bear") or both. VOCALOID's dictionary has limitations that make such words difficult to record within it, at times users may simply have little choice but to write the word phonetically rather then lyrically.

Note that if a user creates lyrics via phonetic entry rather then written text, they will not have to consider dictionaries at all.

Megurine Luka[]

With the initial release of Megurine Luka, Crypton released a custom dictionary for Luka which could be downloaded from their site. This dictionary included support of Japanese characters and the names of other Crypton VOCALOIDs.[10]


VOCALOID3 English vocals were given a new dictionary. This was said to "improve" the way English Vocaloids sounded.[11]

Megpoid English[]

Internet Co., Ltd. provided a custom dictionary for GUMI's Megpoid English vocal. This was done to avoid certain problematic combinations that were known to the vocal. Without this script, GUMI naturally has errors that will be encountered, such as skipping of sounds or incorrect sound combinations.[12]


CYBER DIVA was created with a new script for VOCALOIDs. With this script, YAMAHA created a new custom dictionary for the voicebank with new words that weren't available before and more natural pronunciations.


Including the 300 most common words, Syo confirmed that Ruby knew over 5,900 words.[13] 100 of these words were randomly chosen.[14] Ruby was also set up to pronounce some words such as "fire" and "hour" in one syllable.[15][16][17]

Syo's Twitter account lists many of Ruby's dictionary word adaptations and added words.


CYBER SONGMAN's dictionary was an update of his counterpart's. It also makes use of his extra phonemes [4] and [@l]. While [4] was given to various third-party VOCALOIDs, the latter was initially exclusive to SONGMAN, but was later given to Amy and Chris.

Phonetic System's Characteristics[]

There are 52 phonetic pronunciations which make up the English VOCALOID library; these phonetic inputs will use any set of the estimated 2500 samples per pitch.[18] According to development notes on Megpoid English, there were over 4,000 phonetic connections for that particular vocal alone;[19] a similar number is therefore likely for all English VOCALOIDs.


The English phonetic system includes 3 types of vowels: monothongs, diphthongs and R-colored vowels. Being the nucleus of the syllable, the vowels can be encoded alone

The English phonetic system includes 10 vowels of the 11 monophthongs or pure vowels of the English Language, missing the phoneme /ɑː/ or open unrounded vowel.

The pronunciation of some vowels may change slightly, depending on the dialect or the way the VOCALOID was recorded.

  • Example: OLIVER's [{] phoneme has been reported to sound more like an /a/ than an /æ/.

Also the target or optimal musical genre of the VOCALOID can affect the pronunciation of the vowels.

  • Example: Tonio & Prima had been reported to have an "opera"-like pronunciation of the vowels, more fitting for romance languages than standard English. This probably is attributed as they're Opera-specialized voicebanks.

The English phonetic also includes an array of 5 diphthongs or gliding vowels : 3 y-colored diphthongs and 2 w-colored diphthong. The diphthongs behave as a single vowel, despite the glide at the end of them.

It's important consider the diphthongs, like the monothongs and the rhotic vowels, can vary their pronunciation, depending the dialect, recording and stress of the word.

  • Example: The diphthong [eI] can be pronounced with different degrees of stress, being realized either as [eː] (unstressed monothong), [eɪ] (diphthongized [e], lax glide), [ei] (diphthongized [e], tense glide) or [ej] (diphthongized [e], short tense glide). BIG AL is known to vary noticeably the pronunciation of this phoneme according to context.[20]

The English phonetic also includes including 6 r-colored or rhotized vowels. These ones are used mainly used for the vowel + R combinations. These vowels are modified by the R that follows them, incorporating to them and forming a single unit, as it's in the case of the diphthongs.

Like the diphthongs, these ones tend to vary in their pronunciation, especially if the voice provider has a rhotic accent or not.

  • Example: Depending the speaker's dialect and context of the sound, the VOCALOID phoneme [I@] may be realized as [ɪː] (non-rhotic, long vowel), [ɪə] (non-rhotic, schwa diphthong), [ɪɚ] (rhotic; r-colored schwa diphthong), [ɪɹ] (rhotic; vowel-consonant), etc.

The diphthongs and rhotic vowels tends to cause some problems for the user when they need to be extended across 2 or more notes if this one attempts to do it manually.[21]

For work around this, the English voicebanks allows split the words in syllables across the notes using the hyphen symbol "-" within the lyrics.

  • Example:
    Remember split

while in the case of extend a syllable across various notes is required a combination of hyphen '-' and slash '/' within the lyrics for state how many note will it last.

  • Example:
    Sound extend V2

In VOCALOID2's case, is obligatory use the hyphen/slash for effectively divide the words across the notes, unless the user prefer take the risk, working around this manually using phoneme replacement.

In the case of VOCALOID3 and VOCALOID4, the task is easier as the [-] phoneme allows extend any kind of vowel it follows. The hyphen/slash still works, however this one simply adds the [-] phoneme when is required.

  • Example:
    Sound extend V3

In VOCALOID5's case, there is no longer a need to use the slash, as only the hyphen is necessary. It will add the [-] phoneme, and when applicable move the final consonant phonemes to the end of the last note.

  • Example:
    Sound extend V5


The Phonetic System also includes 31 consonant phonemes. From the English consonants only the plosives and the liquids have their allophones as their own phonemes, these ones are required for achieving a correct stressing and pronunciation of the words.


Plosives and aspirated allophones[]

Because it's an important element of consonant stress within the language, the English phonetic system makes distinctions between with normal plosives and their aspirated allophones.

The aspiration is the strong burst of air that accompanies at the release of of some obstruents.

In the English language, the plosives [b], [d], [g], [p], [t], [k] became aspirated at the beginning of the words or at the beginning of a stressed syllable

  • Example: The word 'potato' is aspirated in two consonants:The initial P, because it's the beginning of the word; and the middle T, because it's a stressed consonant.

In 'International Phonetic Alphabet' the aspirated phonemes are indicated by a small superscript ‹h›, as with /kʰ/ for a aspirated /k/, while in VOCALOID's English phonetic system the aspirated phonemes are distinguished from their standard versions due to the addition of a h which represents the IPA's small superscript ‹ʰ›.

The English Phonetic system includes an array of 3 to 4 liquid consonants. These ones includes to both English's allophones of the L. The English R usually is used at the beginning of the syllables, as the 'R's after a vowel, are included in the R-colored vowels.

Additionally, it can include the non-native English phoneme, the Rolling R. This one is mainly used for loan words, for sing in other languages or for some particular genres as the case of the opera.

Dark L and Clear L[]

The system includes both allophones for the L in the English, the [l0] or alveolar lateral approximant, also known as Light L or Clear L (used at the beginning of the syllables); and the [l] phoneme or velarized alveolar lateral approximant, also known as Dark L (which is used at the end of the syllables).

These phonemes aren't designed to be encoded alone; however, the [l0] seems to handle better to be reproduced without a vowel in comparison to the [l] phoneme. The former results in audio loop, while the latter generates electronic buzzing or doesn't produce sound at all without a vowel. The only exception to this was Megurine Luka, which her [l] phoneme behaves as a syllabic consonant, so it can be used alone and extended without suffering distortion.[22]The lack of proper syllabic Dark L was a minor issue that finally was adressed with the release of CYBER SONGMAN, which it included a its own phonetic symbol [@l] for said allophone, allowing a more colloquial pronunciation if the user requires it.

Rolling R[]

Although it isn't a native phoneme of the English language, the alveolar trill or rolling R was included to the English phonetic system to increase the Opera singing capabilities of Prima. After this, it became a common phoneme in the VOCALOID2's English voicebanks released after Prima (with exception of Luka).[23] However, its addition to the VOCALOID3's English voicebanks seems to be deprecated.

Nonetheless, the performance of this phoneme may vary between different English VOCALOID. For example, it is known that BIG AL is capable of using it only at the end of words and requires some techniques and further edition to use it in the beginning or middle of a word.

This "R" sound can also be used to imitate other languages.

The symbol which represents it in the English Phonetic System is the phoneme [R].

Phonetic List[]

Special note: This was the list is based in the Big Al's help file, complimented with the chart of VOCALOID-User.Net[24] and expanded to include the IPA's symbols and names. However there were some incorrect entries within the released list. Entering some of the words provided here as examples for the phoneme usage will not result in the expected phonemes that were used for the list. In addition, the list did not indicate which particular letters the phoneme applied to; this section has underlined the relevant letters for the benefit of readers.

Symbol Classification IPA's Symbol / Name Sample Notes Related Phonemes
[@] vowel ə schwa aware, synthesis, harmony, the In the VOCALOID program, it is not actually used by itself but rather with other phonetics. However, Luka can use this phoneme to make a the "a" sound in aline

[V] (stressed)

[@r] (r-colored)


[V] vowel ʌ open-mid back unrounded vowel strut, unclean, cut,
Actually it's an /ɐ/ in various most of the dialects. Despite this, the notation /ʌ/ still is used for tradition and because some dialects still retains the old pronunciation.

[@] (unstressed)

[{] (fronted)

[Q@] (r-colored)

[e] vowel ɛopen-mid front unrounded vowel them, egg Usually transcribed as /e/ by the AHD

[e@] (r-colored)

[eI] (diphthongized)

[I] vowel ɪnear-close near-front unrounded vowel kit, it, synthesis

[i:] (tense)

[I@] (r-colored)

[i:] vowel close front unrounded vowel beef, eat, harmony

[I] (lax)

[I@] (r-colored)

[{] vowel ænear-open front unrounded vowel trap, axe In some dialects, it may be diphthongized into /eə/ or similar due Æ-tensing}.

[aI] (diphthongized)

[aU] (diphthongized)

[O:] vowel

ɔːopen-mid back rounded vowel

taught, ought, ball This vowel has a lot of variations depending on the dialect. In US dialects it varies between /ɑ/ for the cot–caught mergers and /ɒ~ɔ/ for the rest.

[Q] (lax)

[O@] (r-colored)

[Q] vowel ɒopen back rounded vowel lot, off


[OI] (diphthongized)

[U] vowel ʊnear-close near-back rounded vowel put, look

[u:] (tense)

[U@] (r-colored)

[u:] vowel close back rounded vowel boot, view

[w] (semivowel)

[U] (lax)

[U@] (r-colored)

[@r] rhotic vowel

əɹ, ɚ or ɝ (US)

ɜː (UK)

urge, bird, marker r-colored schwa

[@] (non-rhotic)


[eI] diphthong eɪ̯ pay, age, date j-colored /e/ [e] (monothong)
[aI] diphthong aɪ̯ buy, eye, died j-colored /a/




[OI] diphthong ɔɪ̯ boy, oil, choice j-colored /ɔ/




[@U] diphthong

oʊ̯ (UK)

oʊ̯~o (US)

oat, soak, show w-colored /o/. Usually transcribed as /əʊ̯/ or /oː/ [@]
[aU] diphthong aʊ̯ loud, out, cow w-colored /a/



[I@] rhotic vowel

ɪə (UK)

i(ə)ɹ (US)

beer, ear r-colored /ɪ/

[I] (uppercase i)


[e@] rhotic vowel

ɛə~ɛː (UK)

ɛɹ (US)

bear, air, aware r-colored /ɛ/ [e] (non-rhotic)
[U@] rhotic vowel

ʊə (UK)

ʊɹ (US)

cure, surely r-colored /ʊ/

[U] (non-rhotic)

[u:] (non-rhotic)

[O:] (non-rhotic)


[O@] rhotic vowel

ɔː(ɹ) (UK)

ɔɹ~oɹ (US)

pour, sort r-colored /ɔ/

[O:] (non-rhotic)

[Q] (non-rhotic)

[Q@] rhotic vowel

ɑː(ɹ) (UK)

ɑɹ (US)

star, are, harmony r-colored /ɑ/



[w] consonant wlabio-velar approximant way

[u:] (syllabant)


[j] consonant jpalatal approximant yellow

[i:] (syllabant)

[I] (uppercase i)

[b] consonant bvoiced bilabial plosive cab

[p] (voiceless)

[bh] (aspirated)

[bh] consonant aspirated voiced bilabial plosive big at the beginning of syllable, /b/ with aspiration

[ph] (voiceless)

[b] (deaspirated)

[d] consonant dvoiced alveolar plosive bad

[t] (voiceless)

[dh] (aspirated)

[D] (lenited, lowered)

[dh] consonant aspirated voiced alveolar plosive dog at the beginning of syllable, /d/ with aspiration

[th] (voiceless)

[d] (deaspirated)

[D] (lenited, lowered)

[g] consonant gvoiced velar plosive bag

[k] (voiceless)

[gh] (aspirated)

[N] (nasalized)

[gh] consonant aspirated voiced velar plosive god at the beginning of syllable, /g/ with aspiration

[kh] (voiceless)

[g] (deaspirated)

[dZ] consonant ʤvoiced postalveolar affricate jeans

[tS] (voiceless)

[Z] (spirantizated)

[d] (deaffricated)

[v] consonant vvoiced labiodental fricative vote [f] (voiceless)
[D] consonant ðvoiced dental fricative their

[T] (voiceless)

[d] (fortited)

[dh] (aspirated)

[v] (Th-fronting)

[z] consonant zvoiced alveolar fricative resort

[s] (voiceless)

[Z] (palatalized)

[Z] consonant ʒvoiced postalveolar fricative Asia

[S] (voiceless)

[z] (depalatalized)

[dZ] (affricated)

[m] consonant mbilabial nasal mind

[n] (alveolarized)

[n] consonant nalveolar nasal night

[N] (velarized)

[m] (labialized)

[N] consonant ŋvelar nasal long [n] (develarized)
[r] consonant ɹalveolar approximant red The /r/ is the symbol for the alveolar trill or rolling R for the IPA and the X-SAMPA, the symbol in this case seems be based on AHD

[R] (rolled)

[w] (gliding)

[l] consonant ɫvelarized alveolar lateral approximant feel Dark L, at the syllable coda position

[l0] (develarized)

[w] (L-vocalized)

[u] (L-vocalolized)

[U] (L-vocalized)

[l0] consonant lalveolar lateral approximant list Clear L, at the beginning of syllable

[l] (velarized)

[p] consonant pvoiceless bilabial plosive dip

[b] (voiced)

[ph] (aspirated)

[ph] consonant aspirated voiceless bilabial plosive peace At the beginning of syllable, /p/ with aspiration

[bh] (voiced)

[p] (deaspirated)

[t] consonant tvoiceless alveolar plosive sit

[d] (voiced)

[th] (aspirated)

[th] consonant aspirated voiceless alveolar plosive top At the beginning of syllable, /t/ with aspiration

[dh] (voiced)

[t] (deaspirated)

[k] consonant kvoiceless velar plosive rock

[g] (voiced)[kh] (aspirated)

[kh] consonant

aspirated voiceless velar plosive

kiss At the beginning of syllable, /k/ with aspiration

[gh] (voiced)

[k] (deaspirated)

[tS] consonant ʧvoiceless postalveolar affricate touch

[dZ] (voiced)

[S] (spirantizated)

[t] (deaffricated)

[f] consonant fvoiceless labiodental fricative feel [v] (voiced)
[T] consonant θvoiceless dental fricative think

[D] (voiced)

[s] (Th-alveolarization)

[f] (Th-fronting)

[s] consonant svoiceless alveolar fricative sea

[z] (voiced)

[S] (palatalized)

[S] consonant ʃvoiceless postalveolar fricative share

[Z] (voiced)

[tS] (affricated)

[s] (depalatalized)

[h] consonant hvoiceless glottal fricative hat

Additional phonetics[]

The following is a list of additional complementary phonemes available within some of the English VOCALOIDs. Most of them are allophones and it's possible to use the voicebank without having to ever touch these set of data. However, use of them within a song can improve the pronunciation and the VOCALOID's ability to sound more colloquial. In most of the cases, the data has to be entered manually through the note properties selection.

Symbol Classification IPA's Symbol / Name Sample Notes Related Phonemes Applies to
[e@0] diphthong [ɛə~eə~æ] man, land Tense allophone of /æ/, often diphthongized (/æ/-tensing)

[{] (allophone)


[4] consonant ɾalveolar flap better Unstressed allophone of /t/ or /d/ phonemes (Alveolar Flapping)

[t], [d] (allophone)

[R] (trill)

[r] (approximant)

[R] consonant ralveolar trill tierra (earth)

Rolling R. Generally used in non-English words

[4] (tap)

[r] (approximant)

Prima, SONiKA, Big AL, Tonio, RUBY
[@l] consonant ɫ̩ syllabic alveolar approximant apple, awful Syllabic allophone of the Dark L [l] (non syllabic) CYBER SONGMAN Amy, Chris
[h\] consonant ɦvoiced glottal fricative behind Possible allophone of /h/ between voiced sounds [h] (voiced) RUBY, DEX, DAINA


Phoneme Replacement[]

Due the large array of allophones and similar sounding phonemes available in the English Language, there exists a great flexibility for phoneme replacement. This has a lot of applications, like altering the emphasis or stress of a word, or correcting a strange pronunciation found in a voicebank,[25] to alter the accent or general pronunciation of a particular VOCALOID,[26] etc.

This added to some auxiliar phonemes allows a great diversity of combinations and possibilities to experiment. However, the user must consider the results may vary between the different voicebanks due the individual differences like accent, pronunciation and samples' quality present in the voicebank. The most recommended is take these tips as a guide and experiment by yourself.

For the consonants is possible:

  • Replace the plosives for the respective aspirated allophones. If a consonant sounds too strident or too weak, it's possible to replace it with the corresponding allophone. However is important it may affect the stress, as the aspiration is related to it.
  • Swap a consonant for its respective (un)voiced counterpart.
    • This applies specially well for the end of the syllables, where the coda consonant is prone to assimilate the voicing of the neighbor phonemes.
    • This is harder for a onset consonant (beginning of a syllable), as the voicing can alter the meaning of the word; however still is possible if the user is careful with the consonant length. As the consonant lenght  becomes shorter, it's harder distinguish it's voicing.
  • Replace the alveolar plosives, [t] & [d] by their respective postalveolar affricates, [tS] & [dZ].
    • This often occurs when the alveolar plosive is palatalized by a nearby phoneme.
      Example: 'Don't you' /doʊnt.juː/ → 'Don't ya' /doʊn.jə/ → 'Don't cha' /doʊnə/
    • Similar to the voicing swap, this replacement also is possible when the consonant length and stress somewhat neutralizes the differences between both phonemes.
  • The Dark L is prone to series of phonological processes and sound changes. Taking these ones into account, it's possible replace the [l] accordingly.
    • In the L-vocalization process, where the Dark L is prone to be warped into a close back vocoid, it's possible to replace the [l] phoneme with the [O:], [@U], [U], [u:] or [w] phonemes.
    • Similar previous case, the vowels before the Dark L can be coloured by the velarized lateral consonant. Simplifying, the front vowel tends to become more centralized meanwhile the central and back vowels tends to shift to a close back vowel. Knowing this sound change, is possible replace or insert another vowel before the Dark L in case the phoneme combination sounds awkward.
      Example: [i: l] → [I l] → [@ l]; [V l] → [@ l] → [U l]
    • Also an unstressed vowel before the Dark L can be completely ommited, leaving a naked syllabic L.
      Example: The word 'Twinkle' actually it's pronounced as /ˈtwɪŋkl̩/ instead /ˈtwɪŋkəl/.
      Previously there wasn't a way to imitate this in the synthesizer, as few voicebanks had a syllabic [l] phoneme that could be produced on their own, without a vowel. In general, this was patched adding a vowel like [V] or [U] before the Dark L (Example: Twinkle [th w I N k][U l]), however in some cases this sounded overpronounced. This was fixed with the release of Cyber Songman, which included his own [@l] phoneme for the syllabic L, allowing produce a more cololloquial pronounciation if the user requires it. Example: Twinkle → [th w I N k][@l l].

Monothong Replacement[]

The English phonetic system has one biggest number of available vowels among the 5 languages currently available for Vocaloid (including monothongs, diphthongs and rhotic vowels).

For replace a vowel, you need to have an idea of which are the closest vowels in terms of sound quality. For this reason, it's a good idea to know which is

Ipa vowel chart for Vocaloid

Vowel chart for English, showing the rough position of the monothongs along with the respective IPA's symbols (black) and symbols for VOCALOID (gray). The relative position/pronunciation of the vowels , may vary according the regional accent/dialect.

  • Open Vowels:
    ←unrounded [{] [V] [@] [Q@] [Q] rounded→
    • Also known as low vowels, they are pronounced with the mouth open and with tongue in low position in relation from the roof of the mouth. They are characterized by their 'ah' to 'uh' sound quality and they positioned in the bottom of the IPA vowel chart.
  • Front unrounded vowels:
    ←open (lax) [{] [e] [e@] [eI] [I] [i:] close (tense)→
    • Also known as bright vowels, they're placed at the left side of the chart and are pronounced with the tongue positioned as far in front possible and with the lips unrounded. It sounds tends to vary from an lax 'eh'-like sound toward a tenser 'ee'-like or y-like sound, as the mouth progressively closes.
  • Back rounded vowels:
    ←open (lax) [Q@] [Q] [O:] [O@] [@U] [U] [u:] close (tense)→
    • Also known as dark vowels, they're placed at the right side of the chart and are pronounced with the tongue positioned as far in front possible and with the lips rounded. It sounds tends to vary from an lax 'oh'-like sound toward a tenser 'oo'-like or w-like sound, as the mouth progressively closes.
  • Central vowels:
    ←unstressed [@] [@r] [V] stressed→
    • Located at the center of the chart, these vowels tends to have an undefined 'uh'-like sound. When a vowel is reduced, it may tend to shift toward a central vowel.

Knowing this is relatively easy known how to replace a phoneme.

Example: The vowel [e] may be replaced by a [{] in case it's needed a more open pronunciation, or a [I], if it's needed a more closed one

In some instance, some diphthongs and rhotic vowels may be used as replacement of the monothongs, if it's pronunciation is closer to a pure vowel. In the case of the diphthong, this is possible for the mid vowels [eI] and [@U] if their diphthongization isn't too marked:

Example: The phoneme [eI] tends to sound like a tense [e] in some dialects.

In the case of the R-colored vowels, if the pronunciation is non rhotic, these ones

Example: The phoneme [Q@] in non-rhotic pronunciation is /ɑ:/, which allows use it as replacement of other open vowels like [V], [Q] and [@].

English Phonetics/DiphoneR

Diphone Replacement/Splitting[]


Original Diphone Type IPA's notation Replacement for First Phoneme Replacement for Second Phoneme
[aI] Diphthong aɪ̯ [V], [{] or [Q] [e], [I], [i:] or [j]
[eI] Diphthong eɪ̯ [e] [I], [i:] or [j]
[OI] Diphthong ɔɪ̯ [Q] or [O:] [I], [i:] or [j]
[aU] Diphthong aʊ̯ [V], [{] or [Q] [O:], [U], [u:] or [w]
[@U] Diphthong oʊ̯ [Q] or [O:] [O:], [U], [u:] or [w]
[@r] Rhotic Vowel əɹ or ɚ


[Q@] Rhotic Vowel

ɑː(ɹ) (UK)

ɑɹ (US)

[V], [{] or [Q]


[@r] or [r]

[e@] Rhotic Vowel

ɜː (UK)

ɝ (US)



[@r] or [r]

[I@] Rhotic Vowel

ɪə (UK)

i(ə)ɹ (US)

[I] or [i:]


[@r] or [r]

[O@] Rhotic Vowel

ɔː(ɹ) (UK)

ɔɹ~oɹ (US)

[Q] or [O:]


[@r] or [r]

[U@] Rhotic Vowel

ʊə (UK)

ʊɹ (US)

[U] or [u:]


[@r] or [r]


continued development[]

English VOCALOID currently offers the second largest selection of voicebanks overall, though this is mostly due to being present since VOCALOID's earliest days. It is not the second most popular language: as of 2018, this is considered to be Chinese, although for a long time English was the second most popular. However, this was because at one stage it was the only other language besides Japanese available for sale, and Chinese had yet to take off fully and gain momentum. It has received the second highest number of refinements over time with the most being seen from VOCALOID4 onward.

The language was mostly held back in the pre-2012 period due to issues. English voicebanks are harder to produce as previously mentioned, more expensive than even Japanese-made voicebanks, and can bring in less profit per sale. They take longer to make and are slower to develop, requiring much more work overall. With their sluggish start from 2004-2012 overall, English vocals have been held back for quite some years, with studios acknowledging there is a demand but not being in a position to meet it.[30]

The majority of the post-VOCALOID2-era voicebanks have been from non-native sources with Japanese companies producing more English voicebanks than native ones. In VOCALOID3 the number of non-native English voicebanks (Miku, GUMI, KAITO, MEIKO, Macne Nana) exceeded that of natives (OLIVER, AVANNA, YOHIOloid) for the first time. In VOCALOID4 the number of non-natives (Miku, Luka, Fukase, Macne Nana, and the Kagamines) succeeded also the number of natives (Ruby, DEX, DAINA, CYBER DIVA, and CYBER SONGMAN). Every single released non-native English voicebank has also been with the Japanese accent.

One of the issues is that the number of studios producing native voicebanks with VOCALOID5 see the loss of PowerFX Systems AB., which has greatly reduced the potential of future VOCALOID releases. This is followed by Zero-G Limited who responded in 2020 to a fan in an e-mail that they have no VOCALOID voicebanks planned. This has left YAMAHA Corporation, who released their 3rd and 4th voicebanks Amy and Chris, as the only currently confirmed company releasing English vocals. The issue with the lack of native involvement stems from VOCALOID's early days before it was even released and every studio approached by Crypton Future Media, Inc., who were left in charge of finding English studios for support, were unimpressed by the technology, one company having infamously called it a "toy". The long term effects of the situation has only served to cause problems for continued development of English VOCALOID.

See also[]

Conversion Lists
Interwiki articles


External links[]