Audio Video

 

Audio Speech Recognition

  • Open Source
  1. CMU Sphinx – Unlike its former versions are fully developed in C/C++, the current version Sphinx4 is thoroughly coded in Java.
  2. Cambridge HTK – C/C++ based. It’s not a full open source due to its special license restriction.
  3. Kyoto Julius – C/C++ based. However, to recompile Julius is pretty tough.
  4. Mississippi ISIP – C/C++ based. Never tried.
  5. IBM ViaVoice – It’s announced by IBM that ViaVoice will release its source code open to the public, but it seems that IBM was cheating.
  6. slurred – Java based. A very handy and cute module. Compared to Sphinx4, slurred is smaller and easier to cope with.
  7. VoxForge – “VoxForge collects user-submitted speech audio files for the creation of Acoustic Models for Free and Open Source Speech Recognition Engines such as HTK, Julius, ISIP and Sphinx.” (Cited from Sourceforge)
  • Key Technologies
  1. MFCC – “In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.” (cited from Wikipedia)
  2. DTW – “Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed.” (cited from Wikipedia)
  3. HMM – “A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. An HMM can be considered as the simplest dynamic Bayesian network.” (cited from Wikipedia)
  4. GMM – “In statistics, a mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering.” (cited from Wikipedia)
  5. VAD – “Voice activity detection (also known as speech activity detection or, more simply, speech detection) is a technique used in speech processing wherein the presence or absence of human speech is detected in regions of audio (which may also contain music, noise, or other sound). ” (cited from Wikipedia)

A brief summary about Automatic Speech Recognition is given out at ASR.

 

Audio Visual Speech Recognition

  • Open Source
  1. Intel AVCSR – It’s not a full open source, which only affords parts of the codes in MS Windows with the key part hidden in several binary “.dll”s.
  2. IBM AVSTG – IBM’s audio-visual speech technology.
  3. UIUC AVICAR – Audio-visual speech recognition at UIUC for a car controlling.
  • Key Technologies
  1. CHMM — Coupled hidden Markov model was first proposed by Dr. Matthew Brand in the paper Coupled hidden Markov models for modeling interacting processes in 1999 when he was in MIT.
  2. FHMM — Fused hidden Markov model was first proposed by Dr. Hao Pan from UIUC in the paper A Fused Hidden Markov Model with Application to Bimodal Speech Processing.
  3. MSHMM — Multistream hidden Markov model was first proposed by Dr. Dupont and Dr. Luettin in the paper Audio-visual Speech Modeling for Continuous Speech Recognition in 2000.
  4. AHMM — Asynchronous hidden Markov model was first proposed by Dr. Samy Bengio in the paper Multimodal Speech Processing Using Asynchronous Hidden Markov Models from IDIAP.

Latest Forum Discussions