Audio Video
Audio Speech Recognition
- Open Source
- CMU Sphinx – Unlike its former versions are fully developed in C/C++, the current version Sphinx4 is thoroughly coded in Java.
- Cambridge HTK – C/C++ based. It’s not a full open source due to its special license restriction.
- Kyoto Julius – C/C++ based. However, to recompile Julius is pretty tough.
- Mississippi ISIP – C/C++ based. Never tried.
- IBM ViaVoice – It’s announced by IBM that ViaVoice will release its source code open to the public, but it seems that IBM was cheating.
- slurred – Java based. A very handy and cute module. Compared to Sphinx4, slurred is smaller and easier to cope with.
- VoxForge – “VoxForge collects user-submitted speech audio files for the creation of Acoustic Models for Free and Open Source Speech Recognition Engines such as HTK, Julius, ISIP and Sphinx.” (Cited from Sourceforge)
- Key Technologies
- MFCC – “In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.” (cited from Wikipedia)
- DTW – “Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed.” (cited from Wikipedia)
- HMM – “A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. An HMM can be considered as the simplest dynamic Bayesian network.” (cited from Wikipedia)
- GMM – “In statistics, a mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering.” (cited from Wikipedia)
- VAD – “Voice activity detection (also known as speech activity detection or, more simply, speech detection) is a technique used in speech processing wherein the presence or absence of human speech is detected in regions of audio (which may also contain music, noise, or other sound). ” (cited from Wikipedia)
A brief summary about Automatic Speech Recognition is given out at ASR.
Audio Visual Speech Recognition
- Open Source
- Intel AVCSR – It’s not a full open source, which only affords parts of the codes in MS Windows with the key part hidden in several binary “.dll”s.
- IBM AVSTG – IBM’s audio-visual speech technology.
- UIUC AVICAR – Audio-visual speech recognition at UIUC for a car controlling.
- Key Technologies
- CHMM — Coupled hidden Markov model was first proposed by Dr. Matthew Brand in the paper Coupled hidden Markov models for modeling interacting processes in 1999 when he was in MIT.
- FHMM — Fused hidden Markov model was first proposed by Dr. Hao Pan from UIUC in the paper A Fused Hidden Markov Model with Application to Bimodal Speech Processing.
- MSHMM — Multistream hidden Markov model was first proposed by Dr. Dupont and Dr. Luettin in the paper Audio-visual Speech Modeling for Continuous Speech Recognition in 2000.
- AHMM — Asynchronous hidden Markov model was first proposed by Dr. Samy Bengio in the paper Multimodal Speech Processing Using Asynchronous Hidden Markov Models from IDIAP.




