While there are many theoretical papers online about how to do this, I couldn’t find anything that talked specifically about this beyond the following resource. The image on the right is from the bookmarked site.
There are 14 different images (including silence) which roughly correspond to the Disney 13. There are 40 some odd phonemes and 22 viseme events. I first need to reduce the 22 visemes down to 13.
Here is a list of the Visemes:
0, // silence 1, // ae, ax, ah 2, // aa 3, // ao 4, // ey, eh, uh 5, // er 6, // y, iy, ih, ix 7, // w, uw 8, // ow 9, // aw 10, // oy 11, // ay 12, // h 13, // r 14, // l 15, // s, z 16, // sh, ch, jh, zh 17, // th, dh 18, // f, v 19, // d, t, n 20, // k, g, ng 21, // p, b, m
Reducing this down to 13 goes as follows (based on looking at the list and the chart:
In the following, the subgroups of visemes are grouped accordingly, so visemes 1,2,3,5 and 11 would all have the same mouth animation. The order corresponds to the order of the images in the diagram.
- 15, 19
- 6, 9 , 12, 20
- 1, 2, 3, 5 & 11
Now our animated face only has two visemes! (Mouth open and Mouth closed) The nice thing about a mechanical system is that it will automatically “morph” between the two states in a natural way. Reducing the above states down to two mouth states might be something like the following:
Mouth open: 6,7,10,11,12,13,14
Mouth closed: 1,2,3,4,5,8,9
Now to test this with the animated mouth.
Here is a video of the mouth animated using the following phoneme list:
triggerphonemes = [17,19,24,25,26,29,30,31,32,33,37,38,39,40,41,42,45,46,47,48]
Now here is a video of the mouth animated using the following viseme list:
triggervisemes = [6,7,10,11,12,13,14] or in terms of the actual SAPI visemes:
triggervisemes = [17,14,6,9,12,20,4,1,2,3,5,11,8,10]
The visemes seem better, but there are a few things that are missing. The K in OK doesn’t come out.
The problem is that if there are two visemes generated that are mouth open events sequentially, the mouth doesn’t close in between. For example in “OK” your mouth say “O” open close and then “K” open close.
I updated the code so that if there is a mouth open viseme, it looks forward to the next viseme. If this is also a mouth open viseme, then it holds the mouth open for half the viseme duration and then closing it giving a much more realistic animation.
Here is an animation of the updated viseme based animation:
The next problem that exists is as follows: The mouth does not open and close properly when saying “my microprocessor” I have the animation set to trigger on each word event so that the mouth opens at the beginning of of a word (reasonable) If I have a word that starts with a mouth closed sound and ends with a mouth open sound (like My) this doesn’t work out! I will turn off the word trigger and see how things look.
Thats it! I have slowed the speech rate way down so you can see all the details. The syncing is just about perfect! I think it is very believable. Take a look at the following movie to see the final product: