A Study of Features and Processes Towards Real-time Speech Word Recognition
Type of content
Publisher's DOI/URI
Thesis discipline
Degree name
Publisher
Journal Title
Journal ISSN
Volume Title
Language
Date
Authors
Abstract
Word recognition techniques are reviewed. An exhaustive comparative study of many of the factors that affect recognition accuracy is presented. Experiments centred on four major areas of word recognition are described: pre-processing techniques, recognition features, recognition algorithms and distance measures. Recognition accuracy, in the context of each of these four areas, is investigated using the digit vocabulary spoken by 10 New Zealand (6 male and 4 female) and 38 American (20 male and 18 female) speakers. Pre-processing techniques examined are the type of window, the length of the data name, data frame overlap, and pre-emphasis. Acoustic features tested include temporal features such as energy and zero-crossing rate, as well as frequency based acoustic representations such as linear prediction coefficients, cepstral coefficients, dynamic (transitional) cepstral coefficients, and perceptual linear prediction coefficients. Three types of distance measures are also reported on the Euclidean, the weighted Euclidean, and the projection. Two methods of training, random template selection and clustering, are investigated. Accuracy improvement by combining different features is also examined. Implementation of a real-time word recognition system designed on the basis of the comparative study and experiments, is described. The system is based on a TMS320C30 and takes around 0.03 seconds per recognition. The real-time system achieves speaker-dependent accuracies greater than 95% and speaker-independent accuracies greater than 70% for the digit vocabulary. An examination is also made of two methods of continuous recognition using sub-word representations. Both these methods take advantage of isolated word recognition techniques such as dynamic programming. A segmentation method and anon-segmentation method were investigated. Accuracy of the segmentation recognition method is found to depend linearly on the accuracy of the segmenter. With a segmentation error of 22%, an average recognition accuracy of 90.7% was obtained for 10 vowels and 2 consonants. For the non-segmentation recognition method, an average accuracy of 75% was obtained. Although the segmentation method produced higher accuracies than the non-segmentation method, it is argued that the removal of the segmentation is an advantage that greatly simplifies the recognition strategy.