Read My Lips: Facial Animation
Techniques
Anyone who has ever been in a professional production situation
realizes that real-world coding these days requires a broad area of
expertise. When this expertise is lacking, developers need to be humble
enough to look things up and turn to people around them who are more
experienced in that particular area.
As I continue to explore areas of graphics technology,
I have attempted to document the research and resources I have used in
creating projects for my company. My research demands change from month to
month depending on what is needed at the time. This month, I have the need
to develop some facial animation techniques, particularly lip sync. This
means I need to shelve my physics research for a bit and get some other
work done. I hope to get back to moments of inertia, and such, real soon.
And Now for Something
Completely Different
My problem right now is facial animation. In particular, I need to
know enough in order to create a production pathway and technology to
display real-time lip sync. My first step when trying to develop new
technology is to take a historic look at the problem and examine previous
solutions. The first people I could think of who had explored facial
animation in depth were the animators who created cartoons and feature
animation in the early days of Disney and Max Fleischer.
Facial animation in games has built up on this
tradition. Chiefly, this has been achieved through cut-scene movies
animated using many of the same methods. Games like Full Throttle
and The Curse of Monkey Island used facial animation for their
2D cartoon characters in the same way that the Disney animators would
have. More recently, games have begun to include some facial animation in
real-time 3D projects. Tomb Raider has had scenes in which the 3D
characters pantomime the dialog, but the face is not actually animated.
Grim Fandango uses texture animation and mesh animation for a basic
level of facial animation. Even console titles like Banjo Kazooie
are experimenting with real-time “lip-flap” without even having a dialog
track. How do I leverage this tradition into my own project?
Phonemes and
Visemes
No discussion of facial animation is possible without
discussing phonemes. Jake Rodgers’s article “Animating Facial Expressions”
(Game Developer, November 1998) defined a phoneme as an abstract
unit of the phonetic system of a language that corresponds to a set of
similar speech sounds. More simply, phonemes are the individual sounds
that make up speech. A naive facial animation system may attempt to create
a separate facial position for each phoneme. However, in English (at least
where I speak it) there are about 35 phonemes. Other regional dialects may
add more.
Now, that’s a lot of facial positions to create and keep
organized. Luckily, the Disney animators realized a long time ago that
using all phonemes was overkill. When creating animation, an artist is not
concerned with individual sounds, just how the mouth looks while making
them. Fewer facial positions are necessary to visually represent speech
since several sounds can be made with the same mouth position. These
visual references to groups of phonemes are called visemes. How do I know
which phonemes to combine into one viseme? Disney animators relied on a
chart of 12 archetypal mouth positions to represent speech as you can see
in Figure 1.
 |
Figure 1. The 12
classic Disney mouth
positions. |
Each mouth position or viseme represented one or more
phonemes. This reference chart became a standard method of creating
animation. As a game developer, however, I am concerned with the number of
positions I need to support. What if my game only has room for eight
visemes? What if I could support 15 visemes? Would it look
better?
Throughout my career, I have seen many facial animation
guidelines with different numbers of visemes and different organizations
of phonemes. They all seem to be similar to the Disney 12, but also seem
like they involved animators talking to a mirror and doing some
guessing.
I wanted to establish a method that would be optimal
for whatever number of visemes I wanted to support. Along with the
animator’s eye for mouth positions, there are the more scientific models
that reduce sounds into visual components. For the deaf community, which
does not hear phonemes, spoken language recognition relies entirely on lip
reading. Lip-reading samples base speech recognition on 18 speech
postures. Some of these mouth postures show very subtle differences that a
hearing individual may not see.
So, the Disney 12 and the lip reading 18 are a good
place to start. However, making sense of the organization of these lists
requires a look at what is physically going on when we speak. I am
fortunate to have a linguist right in the office. It’s times like this
when it helps to know people in all sorts of fields, no matter how
obscure.
_______________________________________________________________
Science
Break