Student's Theses

SPEECH ENHANCEMENT USING A LAPLACIAN-BASED MMSE ESTIMATOR

OF THE MAGNITUDE SPECTRUM

Chen Bin, PhD

December 2005

A number of speech enhancement algorithms based on MMSE spectrum estimators have been proposed over the years. Although some of these algorithms were developed based on Laplacian and Gamma distributions, no optimal spectral magnitude estimators were derived. This dissertation focuses on optimal estimators of the magnitude spectrum for speech enhancement. We present an analytical solution for estimating in the MMSE sense the magnitude spectrum when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Furthermore, we derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model. Results indicated that the Laplacian-based MMSE estimator yielded less residual noise in the enhanced speech than the traditional Gaussian-based MMSE estimator. Overall, the present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech.

Download thesis:  [PDF - 1.7M]

 


NOISE ESTIMATION ALGORITHMS FOR HIGHLY NON-STATIONARY ENVIRONMENTS

Sundarrajan Rangachari, M.S.E.E.

August 2004

The quality and intelligibility of the speech in the presence of background noise can be improved by speech enhancement algorithms. This thesis addresses the issue of estimating the noise spectrum for speech enhancement applications. Two noise estimation algorithms are proposed for highly non-stationary noise environments. In method-1 a voice activity detector is first used to classify each frame of speech continuously into the speech present/absent frames, and the noise spectrum estimate is updated using a constant smoothing factor for speech absent frames and a frequency dependent smoothing factor for speech present frames. In method-2 the noise spectrum estimate is updated using a frequency dependent smoothing factor irrespective of speech present/absent frames. In both methods, the frequency dependent smoothing factor is calculated based on estimated speech presence probabilities in subbands. Speech presence is determined by computing the ratio of the noisy speech power spectrum to its local minimum, which is computed by averaging past values of the noisy speech power spectra with a look-ahead factor. The local minimum estimation algorithm adapts very quickly to highly non-stationary noise environments. This was confirmed with formal listening tests that indicated that the proposed noise estimation algorithms when integrated in speech enhancement were preferred over other noise estimation algorithms.

Download thesis:  [PDF - 556 kB]


DICHOTIC SPEECH RECOGNITION: ACOUSTIC AND ELECTRIC HEARING
ArunVijay Mani, M.S.E.E.
May 2004

It is generally accepted that the fusion of two speech signals presented dichotically is affected by the relative onset time. This study investigated the hypothesis that spectral resolution might be an additional factor influencing spectral fusion when the spectral information is split and presented dichotically to the two ears. Two different methods of splitting the spectral information were investigated. In the first method, the odd-index channels were presented to one ear and the even-index channels to the other ear. In the second method the lower frequency channels were presented to one ear and the high frequency channels to the other ear. The experiments were conducted with both normal hearing listeners and bilateral cochlear implant listeners. Results with normal hearing listeners indicated that spectral resolution did affect spectral fusion. Results with bilateral cochlear implant users indicated that subjects were able to fuse information presented to the two ears accurately in quiet but not in noise.

Download thesis:  [PDF - 520 kB]

 

SUBSPACE AND MULTITAPER METHODS FOR SPEECH ENHANCEMENT
Yi Hu, Ph.D.
December 2003

Several speech enhancement algorithms have been proposed over the years. Although most algorithms improve the quality of speech, they introduce speech distortion and suffer from the ``musical noise" artifact. To minimize speech distortion, we propose subspace methods which can be generally applied for colored noise environments. To make the residual noise perceptually inaudible, we propose two methods for incorporating psychoacoustical models. In the first method, we use a well known perceptual weighting technique from speech coding to shape the residual noise spectrum. In the second method, we constrain the noise spectrum to be less than the masking threshold of the speech signal. To eliminate musical noise, we propose the use of multitaper spectrum estimators which have low variance. We further wavelet threshold the multitaper spectrum to reduce the estimation variance. For subspace methods, we propose the use of multiwindow covariance matrix estimation.

   Results, based on formal listening tests and objective measures, indicated significant improvements in speech quality with the proposed algorithms. Furthermore, the proposed subspace methods yielded improved speech intelligibility when tested with cochlear implant listeners.

Download thesis:  [pdf - 635 kB]


A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT
Sunil Devdas Kamath, M.S.E.E
May 2001

The corruption of speech due to presence of additive background noise causes severe difficulties in various communication environments. This thesis addresses the problem of reduction of additive background noise in speech. The proposed approach is a frequency-dependent speech enhancement method based on the proven spectral subtraction method. Most implementations and variations of the basic spectral subtraction technique advocate subtraction of the noise spectrum estimate over the entire speech spectrum. However, real world noise is mostly colored and does not affect the speech signal uniformly over the entire spectrum. This thesis explores a Multi-Band Spectral Subtraction (MBSS) approach with suitable pre-processing of the speech data. Speech is processed into   frequency bands and spectral subtraction is performed independently on each band using band-specific over-subtraction factors. This method provides a greater degree of flexibility and control on the noise subtraction levels that reduces artifacts in the enhanced speech, resulting in improved speech quality. The effect of the number of frequency band and the type of filter spacing (linear, logarithmic or mel) was investigated. Results showed that the proposed MBSS method with four linear-spaced frequency bands outperformed the conventional spectral subtraction method with respect to speech quality and reduced musical noise.

Publications
Kamath, S. and Loizou, P. (2002). “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, “ Proceedings of  ICASSP-2002,
Orlando, FL, May 2002.

Download thesis:  [pdf - 906 kB]


A MODIFIED SPECTRAL SUBTRACTION METHOD COMBINED WITH PERCEPTUAL WEIGHTING FOR SPEECH ENHANCEMENT
Mukul Bhatnagar, M.S.E.E.
August 2002

Reducing noise in corrupted speech remains an important problem and has a broad range of applications, most of which are driven by the explosive growth of mobile communications. Numerous approaches have been proposed for speech enhancement, with the spectral subtraction method being one of the most popular, due to its relatively simple implementation and computational efficiency. The spectral subtraction method has some inherent limitations and drawbacks. This thesis proposes a modification to the conventional spectral subtraction approach in order to address the problem of musical noise and speech distortion that is inherent to the conventional spectral subtraction based approach. Further enhancements in speech quality were obtained by applying a perceptual weighting function (estimated using a psychoacoustics model) that was designed to minimize noise distortion.   Objective measures and informal listening tests showed that the proposed modified spectral subtraction method combined with perceptual weighting outperformed the conventional power spectral subtraction method resulting in better speech quality and reduced levels of musical noise.

Papers

Download thesis: [pdf - 1.1 Mb]


THE EFFECT OF NOISE ON THE SPECTRUM OF SPEECH
Gaurang Kishor Parikh, M.S.T.E.
August 2002

Real world noise is mostly colored and does not affect the speech signal uniformly over the entire spectrum. Little is known about the effect of noise on the spectrum of speech. Such knowledge could potentially help us develop better speech enhancement algorithms. This thesis investigates the affect of colored noise viz. multi-talker babble and speech-shaped noise on the spectrum of vowels and consonants. Multi-talker babble and speech-shaped noise were added to vowels and stop consonants at -5 to 15 dB SNR and the spectral effect of noise was quantified in terms of various acoustic measures: (a) spectral contrast of the noisy vowel spectra, (b) spectral distance between the noisy and clean vowel and consonant spectra for three frequency bands, (c) detection and estimation of first two formant frequencies in noise, (d) frequency deviation of first two formant frequencies in noise, (e) spectral tilt of stop consonants, and (f) burst frequency for the stop consonants.   Results showed that for vowels and stop consonants, the effect of colored noise on the frequency spectrum was non-uniform.

Download thesis: [pdf  - 436 Kb]


INTELLIGIBILITY OF FILTERED SPEECH AND ESTIMATION OF FREQUENCY-IMPORTANCE FUNCTIONS
Kalyan S. Kasturi, M.S.E.E.
August 2002

An understanding of how information about the speech signal is spread among the various frequency bands of the spectrum is essential in numerous communications, audio and hearing related applications. Although many studies investigated the intelligibility of high-pass, low-pass and band-pass filtered speech, not many studies investigated the perception of band-stop filtered speech (i.e., speech with holes in the spectrum) or speech composed of disjoint frequency bands. The most recent studies examined speech recognition either for a single hole varying in frequency location and size or for a single hole in the middle of the spectrum. The scope of these studies is limited in the sense that they did not consider perception of speech composed of multiple disjoint bands involving low, middle and/or high frequency information. The present study addresses this question in a systematic fashion, considering all possible combinations of missing disjoint bands from the spectrum. In this work, we also derive frequency-importance functions for consonant and vowel recognition using (a) a least squares approach that utilizes the results of intelligibility tests for speech with holes in the spectrum and (b) an information theoretic approach based on the calculation of mutual information between frequency bands and phonetic labels.

Download thesis: [pdf  - 320 Kb]


SUBBAND FEEDBACK ACTIVE NOISE CANCELLATION
Bharath M Siravara,  M.S.E.E.
August 2002

This thesis presents a new technique for subband feedback active noise control. The problem of controlling the noise level in the environment has been the focus of a tremendous amount of research over the years. Active Noise Cancellation (ANC) is one such approach that has been proposed for reduction of steady state noise. ANC refers to an electromechanical or electroacoustic technique of canceling acoustic disturbance to yield a quieter environment. The basic principle of ANC is to introduce a canceling “antinoise” signal that has the same amplitude but the exact opposite phase, thus resulting in an attenuated residual noise signal. Wideband active noise control systems often involve adaptive filter lengths with hundreds of taps. Using subband processing can considerably reduce the length of the adaptive filter. Conventional subband algorithms are generally based in the frequency domain and use at least 2 sensors. This thesis presents a time domain algorithm for single sensor subband feedback ANC targeted for use in headsets and hearing protectors. The subband processing is done using relatively short fixed FIR filters. The algorithm also adopts the weight constrained NLMS algorithm for feedback ANC. Results showed that the proposed subband feedback ANC algorithm outperformed the traditional single band ANC system.

Download thesis: [pdf -  1.4 Mb]


BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR TELEPHONE-ASSISTIVE APPLICATIONS
Haifeng Qian, M.S.E.E.
May 2002

This thesis addresses the problem of helping hearing-impaired people to use telephones. There are two aspects of this work: a Bluetooth-based wireless phone adapter and a bandwidth-extension algorithm. Built upon the Bluetooth technology, the proposed phone adapter routes the telephone audio signal to the hearing aid or the CI processor wirelessly, and hence disables environmental noise and interference. The proposed bandwidth-extension algorithm has the potential to increase speech intelligibility for the hearing-impaired people by estimating a wide-band signal from the narrow-band telephone signal. This is done by a piecewise linear estimation based on line spectral frequencies, and a statistical speech-frame classification technique based on Hidden Markov Models integrated to overcome the drawback of conventional bandwidth extension algorithms. The phone adapter was tested by CI users, and the proposed algorithm was evaluated by objective measures. Both results showed good performance.

Download thesis: [pdf - 572 Kb]


ANALYSES OF SPEECH PROCESSING STRATEGIES FOR COCHLEAR IMPLANTS AND THE EFFECTS OF ELECTRODE INTERACTION
Ginger Stickney
PhD Dissertation
May 2001

 Multichannel cochlear implants electrically stimulate the auditory nerve to restore partial hearing to the profoundly deaf patient.  The multichannel implant was designed to selectively stimulate discrete populations of spiral ganglion cells along the length of the cochlea.  However, selective stimulation is not often, or at least imperfectly, achieved even with the most modern cochlear implant designs and speech processing strategies.  When multiple electrodes are stimulated simultaneously, electrical fields generated around each electrode can interact with the electrical fields of neighboring electrodes, thereby reducing selectivity.  Several studies have suggested that electrical-field interactions can disrupt the acoustic properties of the signal and severely degrade speech intelligibility, however this relationship has not been directly tested.
 Electrical-field interactions can be reduced by decreasing the current levels delivered to each electrode through improved electrode positioning and design, or by using speech processing strategies that maximize the separation between simultaneously stimulated electrodes or stimulate the electrodes sequentially.  The proximity of the cochlear implant electrode array to the modiolus has been shown to reduce the amount of current required to reach threshold (Rebscher et al., 1994).  When less current is required, current spread and electrical field overlap is reduced.  Recently, cochlear implant manufacturers have taken interest in designing “positioners” which place the electrode array in close proximity to the spiral ganglion cells and new electrode arrays which attempt to direct their current toward spiral ganglion cell bodies.
 The following experiments examine electrical-field interactions and speech recognition performance for three electrode designs: patients implanted with the Enhanced Bipolar Clarion electrode array without a “positioner”, patients implanted with the Clarion Electrode Positioning SystemTM (EPS) and the Enhanced Bipolar electrode array, and patients with the EPS and the Clarion Hi-FocusTM electrode array.  A simultaneous masking task was used to measure electrical-field interactions as a function of electrode separation for monopolar and bipolar configurations.  The relationship between electrical-field interaction and speech recognition was also examined for several speech strategies varying in the number of electrodes stimulated simultaneously.    Subjects identified consonants, vowels, and sentences with each of the following speech strategies, listed in order from sequential stimulation to fully simultaneous stimulation: Continuous Interleaved Sampler (CIS), Paired Pulsatile Sampler (PPS), Quadruple Pulsatile Sampler (QPS), Hybrid Analog Pulsatile (HAPs), and Simultaneous Analog Stimulation (SAS). Based on previous research, susceptibility to electrical-field interactions is expected to vary as a function of electrode design, the speech processing strategy used in the device, and factors specific to each patient.  The contribution from each of these variables was investigated.
 The results showed a moderate to strong negative correlation between electrical-field interaction and speech recognition performance, which indicates that patients with lower levels of electrical-field interaction have higher speech recognition scores than patients with high levels of electrical-field interaction.  In addition, patients with strong susceptibility to electrical-field interactions produced higher speech recognition scores for sequential than simultaneous speech strategies.  An information analysis revealed that vowel recognition and consonant place-of-articulation were most affected by electrical-field interactions, demonstrating that electrode interactions severely disrupt spectral cues.  The pattern of results also suggests that, with acute listening trials, patients achieve the highest speech recognition scores with the speech processing strategy most similar to their own.  Future studies are needed to determine if patients with minimal levels of electrical-field interaction can benefit from the partially-simultaneous QPS or HAPs strategies with more listening exposure.
 

Download dissertation: [pdf - 1.2 Mb]


ANALYSIS OF SPEECH PROCESSING STRATEGIES FOR THE CLARION IMPLANT PROCESSOR
Lakshmi Narayan Mishra, M.S.E.E.
December 2000

The variability in patient performance noticed in cochlear implant users demands the development of new and improved speech processing strategies that will help improve speech recognition for poor users of the device. The Clarion cochlear implant has various parameters that can be manipulated and Clarion patients can be fitted with several speech processing strategies. In this thesis, the Clarion research interface was used to evaluate the performance of commercially available as well as new speech processing strategies. Six different strategies were implemented and tested with 12 Clarion implant patients (10 CIS users and 2 SAS users). The six different strategies included three commercially available strategies (CIS, PPS and SAS) and three new (not commercially available in the Clarion device) strategies: the hybrid, quadruple pulsatile sampler (QPS) and the 6-of-8 strategy. These strategies differed in the degree of simultaneity and rate of stimulation. Speech recognition results showed that the performance obtained with the CIS strategy was not statistically different with the performance obtained with the PPS, QPS, the hybrid strategies in quiet, and with the 6-of-8 strategy in noise. There was a large variability in performance among subjects. In noise, some subjects benefited with the 6-of-8 strategy. In quiet, some subjects obtained higher performance with the PPS, QPS and the hybrid strategies compared to the CIS strategy. We believe that this variability was due to the amount of channel interaction. Subjects with small channel interaction are most likely to benefit with the high rates of stimulation provided by the PPS and QPS strategies.  Further research is needed to identify the various factors that affect implant users' performance.

Download thesis: [pdf  - 653 Kb]


[ Home | Speech Processing Lab | Cochlear Implant Lab ]