Squeak Class Documentation category index | class index  
 
PhonemeRecognizerMorph
  category: Speech-Phoneme Recognizer
  superclass: AlignmentMorph
  subclasses:

I am experimental phoneme recognizer. My approach to phoneme recognition is fairly crude, but the goal is not full speech recognition, but merely a close enough approximation to drive the mouth of an animated character from speech input.

How it works:

The phoneme recognizer has a collection of phoneme examples that were recorded in advance. Each of these phomemes has a "features vector" that describes that phoneme. Currently, feature vectors are basically a simplified version of the frequency spectrum measuring the sound energy in about two dozen frequency bands up to around 4000 Hz. The exact parameters are described in the class initialization method of PhonemeRecord and can be tweaked.

To do phoneme recognition, a short window of sound is analyzed via FFT, and its feature vector is extracted. This feature vector is then compared to the feature vectors of all phonemes in the example set. The phoneme that matches most closely is considered the currently sounding phoneme. This phoneme's name is presented in the PhonemeRecognizer display. On a reasonably fast machine, the current phoneme can be watched by a program and used to drive real-time mouth animation . This phoneme matching approach is similar to some of the earliest speech recognition work. However, current speech recognition software is generally driven by features derived from a linear predictive or vocal tract model of speech, rather than the raw spectrum data.

How to Use It

The first step is to plan how many different mouth positions will be used by the animation, and which phonemes map to which mouth positions. Traditional animators might draw four mouth positions for vowels and four to six for consonants.

The person whose speech is to be recognized then records phoneme examples for the phonemes to be recognized. For animation, these phonems might consist of the vowel sounds "eh", "ee", "ah", "o", "u" and the consonants "n", "r", "s", "sh", "th", "z", "l", "r", "w", and "m". In some cases "f" and "v" might also be included. The consonants "b" and "p" are also significant in animation, since these sounds, like "m", bring the lips together. Unfortunately, "b" and "p" are tricky to recognize with the scheme used here because they are actually two things in quick succession: a momentary silence followed by the sound of the released breath. However, if the mouth position for silence is drawn with the mouth closed, then the animation of "b" and "p" will probably look okay.

Each phoneme example is recorded by clicking the "add" button and speaking the phoneme into the microphone. Leading and trailing silence is automatically removed. The user is prompted for the phoneme name and a mouth position index. The name is just a mnemonic for the user. The index can be used to select a costume from a holder during animation. It is handy to list and number the mouth positions before recording the phoneme example set.

A phoneme can be reviewed with the "play phoneme" menu command. If it contains noise, includes slides between several different sounds, or doesn't sound like a representative example of the phoneme, delete it and record it again. English contains a number of "diphthongs"--vowel sounds that are actually slides between two different vowel sounds, as in the words "boy" or "boat". It is best to record each component of a diphthong individually. You can also set the name and mouth position index for the "silence" phoneme, the phoneme that is reported whenever the input sound falls below a certain threshold. A graphical view of the features vector for a given phoneme can be generated by selecting "show phoneme features" from the menu. A phoneme set can be saved to a file and restored later.

Once you have recorded your phoneme examples, you can try them by clicking the "run" button and speaking into the microphone. You should see the phoneme display update to report the current match. The "match sound file" menu command can be used to analyze an entire AIFF or a WAV sound file at once. The resulting phoneme stream is currently reported by opening an inspector on the phoneme list. There is one phoneme in this list for each 1/24th of a second window of sound in the sound file.

To allow use of the phoneme recognizer in tile scripts, the "mouth position tile" menu command creates a tile that reports the current phonemes mouth position index. This can be used to set the cursor of a holder containing the set of mouth position drawings. A two-line tile script can thus drive the mouth of an animated character.

instance methods
  accessing
  currentPhonemeMouthPosition
currentPhonemeName
getMouthPosition

  analysis
  findMatchFor:samplingRate:

  button and menu commands
  addPhoneme
changePhonemeDetails
deletePhoneme
invokeMenu
makeTile
matchSoundFile
playPhoneme
readPhonemes
savePhonemes
setSilentPhoneme
showPhonemeFeatures
startRecognizing
stopRecognizing

  initialization
  initialize

  private
  addButtonRows
addLevelSlider
addPhonemeDisplay
addTitle
buttonName:action:
makeLevelMeter
makeStatusLight
promptForDetailsOfPhoneme:
selectPhonemeFromMenu
selectPhonemeFromMenu:

  stepping
  step
stepTime
stopStepping

class methods
  no messages
 

instance methods
  accessing top  
 

currentPhonemeMouthPosition

Answer the mouth position index (a position integer) of the currently matching phoneme.


 

currentPhonemeName

Answer the name of the currently matching phoneme.


 

getMouthPosition

Answer the mouth position index (a position integer) of the currently matching phoneme. Sent by tile scripts.


  analysis top  
 

findMatchFor:samplingRate:

Find the phoneme whose features most closesly match those of the given sound buffer.


  button and menu commands top  
 

addPhoneme

Record and add a new phoneme example to my phoneme set. Prompt the user for its name and mouth position.


 

changePhonemeDetails

Change the name and mouth position index of a phoneme specified by the user.


 

deletePhoneme

Delete a phoneme specified by the user.


 

invokeMenu

Invoke the settings menu.


 

makeTile

Make a scripting tile to fetch the current phoneme's mouth position. Attach it to the hand, allowing the user to drop it directly into a tile script.


 

matchSoundFile

Process an AIFF or WAV sound file and generate a sequence of phoneme matches for that file in the Transcript. When done, open an inspector on the resulting collection of phonemes.


 

playPhoneme

Play a phoneme specified by the user.


 

readPhonemes

Read a previously saved phoneme set from a file.


 

savePhonemes

Save the current phoneme set in a file.


 

setSilentPhoneme

Prompt the user for the name and mouth position associated with silence.


 

showPhonemeFeatures

Show a graph of the features array for the phoneme selected by the user.


 

startRecognizing

Start recognizing phonemes from the sound input.


 

stopRecognizing

Stop listening.


  initialization top  
 

initialize


  private top  
 

addButtonRows

Create and add my button row.


 

addLevelSlider

Create and add a slider to set the sound input level. This level is used both when recognizing and adding phonemes.


 

addPhonemeDisplay

Add a display to show the currently matching phoneme.


 

addTitle

Add a title.


 

buttonName:action:

Create a button of the given name to send myself the given unary message.


 

makeLevelMeter

Create a recording level meter.


 

makeStatusLight

Create a status light to show when the recognizer is running.


 

promptForDetailsOfPhoneme:

Prompt the user for the name and mouth position of the given phoneme.


 

selectPhonemeFromMenu

Answer the phone selected by the user from a menu of the current phoneme records. Answer nil if the user does not select any phoneme.


 

selectPhonemeFromMenu:

Answer the phone selected by the user from a menu of the current phoneme records. Answer nil if the user does not select any phoneme.


  stepping top  
 

step

Update the record light, level meter, and display.


 

stepTime

Answer the desired time between steps in milliseconds. This default implementation requests that the 'step' method be called once every second.


 

stopStepping

Turn off recording.


class methods
  no messages top