Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Учебники / Hearing - From Sensory Processing to Perception Kollmeier 2007

.pdf
Скачиваний:
149
Добавлен:
07.06.2016
Размер:
6.36 Mб
Скачать

Imaging Temporal Pitch Processing in the Auditory Pathway

101

Näätänen R, Picton TW (1987) The N1 wave of the human electric and magnetic response to sound: a review and an analysis of the component structure. Psychophysiology 24:375–425 Palmer AR, Winter IM (1992) Cochlear nerve and cochlear nucleus responses to the fundamental frequency of voiced speech sounds and harmonic complex tones. In: Cazals Y, Demany L, Horner K (eds) Auditory physiology and perception. Pergamon, Oxford, pp

231–239

Palmer AR, Bullock DC, Chambers JD (1998) A high-ouput, high-quality sound system for use in auditory fMRI. NeuroImage 7:S359

Patterson RD, Allerhand M, Giguere C (1995) Time-domain modelling of peripheral auditory processing: a modular architecture and a software platform. J Acoust Soc Am 98:1890–1894 Patterson RD, Uppenkamp S, Johnsrude I, Griffiths TD (2002) The processing of temporal pitch

and melody information in auditory cortex. Neuron 36:767–776

Penagos H, Melcher JR, Oxenham AJ (2004) A neural representation of pitch salience in nonprimary human auditory cortex revealed with functional magnetic resonance imaging. J Neurosci 24:6810–6815

Pressnitzer D, Patterson RD, Krumbholz K (2001) The lower limit of melodic pitch. J Acoust Soc Am 109:2074–2084

Rupp A, Uppenkamp S, Bailes J, Gutschalk A, Patterson RD (2005) Time constants in temporal pitch extraction: a comparison of psychophysical and neuromagnetic data. In: Pressnitzer D, de Chevigne A, McAdams S, Collet L (eds) Proc. 13th ISH; Auditory signal processing: physiology, psychoacoustics and models. Dourdan, France, pp 119–125

Seither-Preisler A, Krumbholz K, Patterson RD, Seither A, Lütkenhöner B (2004) Interaction between the neuromagnetic responses to sound energy onset and pitch onset suggests common generators. Eur J Neurosci 19:3073–3080

Seither-Preisler A, Patterson RD, Krumbholz K, Seither S, Lutkenhoner B (2006a) Evidence of pitch processing in the N100m component of the auditory evoked field. Hear Res 213:88–98 Seither-Preisler A, Patterson RD, Krumbholz K, Seither S, Lutkenhoner B (2006b) From noise to pitch: transient and sustained responses of the auditory evoked field. Hear Res (in press) Uppenkamp S, Bailes J, Patterson RD (2004) How long does a sound have to be to produce a

temporal pitch? Proc. 18th International Congress on Acoustics, Kyoto, vol. I, pp 869–870 Warren JD, Uppenkamp S, Patterson RD, Griffiths TD (2003) Separating pitch chroma and

pitch height in the human brain. Proc Natl Acad Sci Vol 100, No 17, pp 10,038–10,042 Yost WA, Patterson RD, Sheft S (1996) A time-domain description for the pitch strength of

iterated rippled noise. J Acoust Soc Am 99:1066–1078

Comment by Chait

In your MEG experiments you interpret responses to transitions between irregular and regular click-tone sequences, or between white noise and IRN, as reflecting the activation of a pitch-related area. However, these transitions can also be interpreted as between “irregular” and “regular” stimuli. As we show in our talk on Monday, similar responses and asymmetries are obtained for transitions between irregular and regular signals that do not evoke a pitch percept.

Measurable magnetic fields originate from EPSCs in the apical dendrites of tens of thousands of simultaneously active cells. Arguably, the processing of a feature such as pitch should not require the simultaneous and synchronized activation of tens of thousands of cells. Computations that are more likely to evoke the observed MEG responses might be related to object analysis, notification of change, attention switching, etc. (Gutschalk et al. 2004).

102

R.D. Patterson et al.

Conceivably, such global processes may involve the synchronous activation of many cells as a method of notification across mechanisms and brain areas that something new and potentially behaviorally relevant has occurred in the environment.

Reply

For us, the question is basically ‘How and where does the auditory system process the temporal regularity in a sound to produce the perceptions associated with temporal regularity?’ An extended example concerning the perception of click-trains is presented in Patterson et al. (1992). The example is used to motivate the strobed temporal integration mechanism in the Auditory Image Model (AIM) of perception. Briefly, for rates less than 10 clicks per second (cps), we hear individual events, independent of whether the train is regular or irregular. For rates greater than 40 cps, we hear a continuous sound, which has a strong pitch if the train is regular, and no pitch if it is not. The problem for auditory models is to explain the perceptual transition from individual clicks to click-tones, and the perception of flutter in the region of 16 Hz. Similar questions concerning regularity arise with many stimuli, such as those of Chait, Poeppel and Simon in this volume, as the rate of transitions rises from 1 to 64 per second.

The question for brain imaging is whether the processing of slow click trains, which leads to the perception of a stream of separate events, occurs in the same neural structure as the processing of fast click trains, which leads to the perception of a continuous tone? MEG source waves indicate that isolated clicks produce transient N1m responses in Planum Temporale (PT) and the strength decreases as the click rate increases above 1 cps. Irregular click trains with rates greater than 40 cps produce a single N1m in PT at stimulus onset, and a sustained response in PT thereafter (Gutschalk et al. 2002, 2004). Regular click trains with rates greater than 40 cps produce what we have referred to as a Pitch Onset Response (POR) and a Sustained Pitch Response (SPR), in a region of Heschl’s gyrus a little anterior and lateral to primary auditory cortex (PAC). It would be interesting to know the location of the sources associated with the responses reported by Chait, Poeppel and Simon in this volume.

When MEG techniques are sufficiently developed, it would be interesting to track the response to regular and irregular CTs as the click rate decreases from 64 to 1 cps, and the pitch fades out of the perception. In this regard, it would be interesting to extend these experiments concerning the lower limit of pitch to include three stimuli that produce continuous stimulation and repress the N1m. The stimuli are RIS (Krumbholz et al. 2003; Seither-Preisler et al. 2004), repeated frozen noise (Limbert and Patterson 1982) and AABB noise (Wiegrebe et al. 1998), all of which produce a strong pitch when the repetition rate is above about 40 Hz, but which produce the perception of noise with an ambiguous

Imaging Temporal Pitch Processing in the Auditory Pathway

103

repeating feature when the repetition rate is less than 32 Hz. These stimuli should excite the pitch centre when the rate is over 40 Hz, but they are unlikely to produce a repeating N1m at lower rates, even when the rate is as low as 1 cps, because they produce continuous stimulation. The question is how the neural centres in lateral HG and PT interact as the pitch fades from the sound?

We agree with the postulate in the second paragraph that the responses we are measuring with MEG represent processes that might better be thought of in terms of their role in the definition and segregation of streams of auditory events with coherent features over time (Cooke 2006), although we would be more inclined to think of these processes as identifying and segregating sound sources rather than auditory objects.

References

Cooke A (2006) A glimpsing model of speech perception in noise. J Acoust Soc Am 119:1562–1573

Gutschalk A, Patterson RD, Rupp A, Uppenkamp S, Scherg M (2002) Sustained magnetic fields reveal separate sites for sound level and temporal regularity in human auditory cortex. NeuroImage 15:207–216

Gutschalk A, Patterson RD, Scherg M, Uppenkamp S, Rupp A (2004) Temporal dynamics of pitch in human auditory cortex. NeuroImage 22:755–766

Krumbholz K, Patterson RD, Seither-Preisler A, Lammertmann C, Lütkenhöner B (2003) Neuromagnetic evidence for a pitch processing centre in Heschl’s gyrus. Cerebral Cortex 13:765–772

Limbert C, Patterson RD (1982) Tapping to repeated noise. J Acoust Soc Am 71:S38

Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. In: Cazals Y, Demany L, Horner K (eds) Auditory physiology and perception. Proceedings of the 9th International Symposium on Hearing. Pergamon, Oxford, 429–446

Seither-Preisler A, Krumbholz K, Patterson RD, Seither A, Lütkenhöner B (2004) Interaction between the neuromagnetic responses to sound energy onset and pitch onset suggests common generators. Eur J Neurosci 19:3073–3080

Wiegrebe L, Patterson RD, Demany L, Carlyon RC (1998) Temporal dynamics of pitch strength in regular interval noises. J Acoust Soc Am 104:2307–2313

Comment by Hall

In your talk, you suggest that the pitch-related computations performed by lateral Heschl’s gyrus might relate to the across-frequency channel averaging implemented in the autocorrelation model of temporal pitch. If this is the case, then one might expect the same region to be engaged by other analyses of fine temporal structure that also require the computation of a summary correlogram, such as the analysis of interaural correlation. Some of our recent data suggest that this is unlikely because lateral Heschl’s gyrus responded little to the degree of interaural correlation in the noise, while it did respond strongly to the degree of monaural temporal regularity. Would you like to comment?

104

R.D. Patterson et al.

Reply

As a matter of fact, the Auditory Image Model (Patterson et al. 1995) does not use autocorrelation to compute pitch (Patterson and Irino 1998), nor does it use cross-correlation to compute laterality (Patterson et al. 2006), and I doubt that the auditory system does either, because correlation processes (autoand cross-) are expansive in magnitude, symmetric in time, and extremely inefficient. That said, the main issue here is the crosschannel computations that might be used to summarise laterality or pitch information, and whether AIM would predict that cross-channel computations for laterality and pitch both occur in auditory cortex, and in the same neural structure. Although it is an intriguing hypothesis, the answer would appear to be no. The mechanism recently proposed for binaural processing (Patterson et al. 2006) involves a coincidence gate that really should be in the brainstem to minimize temporal distortion of the ITD information, which is in the tens-of-microseconds range. The coincidence gate mechanism is assumed to precede the strobed temporal integration (STI) mechanism (Patterson 1994) used to construct the time-interval histograms that constitute the auditory image (Patterson et al. 1995). The pitch calculation is reviewed in Krumbholz et al. (2003) which also addresses the differences between STI and autocorrelation at the end of the Discussion. The important thing for the present discussion is that the cross-channel pitch computation is applied to the auditory image after it is constructed (see, for example, Krumbholz et al. 2005), which probably means that it is performed farther along in the auditory pathway. Nevertheless, the original hypothesis of Hall et al. (2005), that the two cross-channel mechanisms might reside in the same region of auditory cortex, seems reasonable and very much worth testing because, if true, it might have required drastic restructuring of time-interval models like AIM. Moreover, it led to the discovery of an interaction between pitch salience and sound source location in Heschl’s sulcus. I agree that the interaction suggests that this region of auditory cortex may be involved in the integration of acoustic features as a prelude to source identification, and this very intriguing.

References

Hall DA, Barrett DJK, Akeroyd MA, Summerfield AQ (2005) Cortical representations of temporal structure in sound. J Neurophysiol 94:3181–3191

Krumbholz K, Patterson RD, Nobbe A, Fastl H (2003) Microsecond temporal resolution in monaural hearing without spectral cues? J Acoust Soc Am 113:2790–2800

Krumbholz K, Bleeck S, Patterson RD, Senokozlieva M, Seither-Preisler A, Lütkenhöner B (2005) The effect of cross-channel synchrony on the perception of temporal regularity. J Acoust Soc Am 118:946–954

Patterson RD (1994) The sound of a sinusoid: time-interval models. J Acoust Soc Am 96:1419–1428

Imaging Temporal Pitch Processing in the Auditory Pathway

105

Patterson RD, Irino T (1998) Auditory temporal asymmetry and autocorrelation. In: Palmer A, Rees A, Summerfield Q, Meddis R (eds) Psychophysical and physiological advances in hearing. Proceedings of the 11th International Symposium on Hearing, Whurr, London, pp 554–562 Patterson RD, Allerhand M, Giguère C (1995) Time-domain modelling of peripheral auditory processing: a modular architecture and a software platform. J Acoust Soc Am 98:1890–1894 Patterson RD, Anderson TR, Francis K (2006) Binaural auditory images for noise-resistant speech recognition. In: Ainsworth W, Greenberg S (eds) Listening to speech. LEA, pp

257–269

12 Spatiotemporal Encoding of Vowels in Noise Studied with the Responses of Individual Auditory-Nerve Fibers

MICHAEL G. HEINZ

1Introduction

The neural basis for robust speech perception exhibited by human listeners (e.g., across sound levels or background noises) remains unknown. The encoding of spectral shape based on auditory-nerve (AN) discharge rate degrades significantly at high sound levels, particularly in high spontaneousrate (SR) fibers (Sachs and Young 1979). However, continued support for rate coding has come from the observations that robust spectral coding occurs in some low-SR fibers for vowels in quiet and that rate-difference profiles provide enough information to account for behavioral discrimination of vowels (Conley and Keilson 1995; May, Huang, Le Prell, and Hienz 1996). Despite this support, it is clear that temporal codes are more robust than rate (Young and Sachs 1979), especially in noise (Delgutte and Kiang 1984; Sachs, Voigt, and Young 1983). Sachs et al. (1983) showed that rate coding in low-SR fibers was significantly degraded at a moderate signal-to-noise ratio for which human perception is robust. In contrast, temporal coding based on the aver- age-localized-synchronized-rate (ALSR) remained robust.

Although temporal coding based on ALSR is often shown to be robust, evidence for neural mechanisms to decode these cues is limited. Spatiotemporal mechanisms have been proposed for decoding these types of cues (e.g., Carney, Heinz, Evilsizer, Gilkey, and Colburn 2002; Deng and Geisler 1987; Shamma 1985). However, the detailed evaluation of spatiotemporal mechanisms has been limited primarily to modeling studies due to difficulties associated with the large population responses that are required to study spatiotemporal coding (e.g., see Palmer 1990). For example, Deng and Geisler (1987) used a trans- mission-line based AN model to suggest that spectral coding based on the peak cross-correlation between adjacent best-frequency (BF) channels was robust in the presence of background noise. In the present study, spectral coding of vowels in noise based on rate, ALSR, and a simple cross-BF coincidence detection scheme is evaluated from the responses of single AN fibers. By using data from a single AN fiber, many of the difficulties associated with large-population studies are eliminated.

Department of Speech, Language, and Hearing Sciences and Weldon School of Biomedical Engineering, Purdue University, mheinz@purdue.edu

Hearing – From Sensory Processing to Perception

B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007

108

M.G. Heinz

2Methods

AN recordings were made from pentobarbital anesthetized cats using standard methods (see Heinz and Young 2004). Spikes were measured with 10- s resolution. Each fiber was characterized using an automated tuning-curve algorithm to determine BF, Q10, and SR.

All vowels were created using a cascade synthesizer and were scaled versions of the vowel /eh/, which has its first two formants at F1=0.5 kHz and F2=1.7 kHz, with the intermediate trough at T1=1.2 kHz. To maintain F0 within the voice-pitch range for each BF, a baseline steady-state vowel was resynthesized for each AN fiber. The baseline vowel had F0=75 Hz and was created with F2 at BF. The other formant frequencies and all bandwidths were scaled based on the frequency shift from the nominal F2 value for /eh/. The baseline vowel and a baseline broadband noise token were both 400 ms in duration and were sampled at 33000 Hz. The vowel-in-noise conditions with F1 and T1 near BF were produced via changes in sampling rate for the vowel and noise. Signal-to-noise ratio in dB was defined as the difference in overall vowel level and noise level within the frequency range from 0 Hz to the trough between the 3rd and 4th formant of the baseline vowel.

Spectral coding was evaluated based on individual-neuron responses in a manner similar to the spectrum manipulation procedure (SMP), which was developed to study rate-based spectral coding (e.g., May et al. 1996). In the SMP, spectral coding is evaluated by comparing responses to vowels with formants and troughs placed at BF via changes in sampling rate. The slope of the discharge rate as a function of vowel feature level is used to quantify spectral coding, with robust coding indicated by a constant slope across vowel level (or SNR). Although the SMP is useful for evaluating rate coding, changes in the temporal waveform with changes in sampling rate do not allow spatiotemporal coding to be evaluated.

The spectro-temporal manipulation procedure (STMP) was developed to study spatiotemporal coding by estimating the responses of several neurons with nearby BFs to the same stimulus waveform using the responses of a single neuron to different stimuli with a spectral feature shifted nearby BF (Heinz 2005). Based on a neuron with BF0 and a vowel with F1=BF0, the response of a neuron with BF<BF0 can be predicted by playing the vowel at a higher sampling frequency, thus increasing the frequency of all vowel features. The response of the below-F1 neuron to the baseline vowel waveform (F1 at BF0) is estimated by scaling up the measured spike times in response to the shifted vowel by the same factor used to increase the sampling rate, thus reducing the effective vowel feature frequencies to baseline values. To account for the fixed neural delay, an offset of 1 ms was subtracted from all spike times prior to time scaling and then added back afterwards. Temporal scaling changes the overall discharge rate of the estimated fiber response, which is accounted for by scaling the resulting period histograms by the temporal scaling factor. A computational AN model has been used to test the STMP approach (Heinz 2005; Zhang, Heinz, Bruce, and Carney 2001).

Spatiotemporal Encoding of Vowels

109

A physiologically based spatiotemporal mechanism was evaluated by computing shuffled cross-correlograms (SCCs) between responses at different effective BFs (Joris 2003; Joris, Van de Sande, Louage, and van der Heijden 2006). The SCCs were used as a model of a cross-BF coincidence detector with two AN-fiber inputs responding to the same vowel waveform. The SCC value at each delay represents the discharge rate (spikes/sec) of a coincidence detector with the corresponding delay between inputs. SCCs were computed using a 50sec binwidth. The SCC provides an efficient method to predict responses of simple monaural cross-BF coincidence detectors based on AN responses.

3Results

3.1Choice of Vowel and Noise Levels

For each AN fiber, vowel and noise levels were chosen to focus the STMP analysis on conditions where rate coding degraded as noise level was increased. A vowel level was chosen for which there was good rate coding in quiet (rate to F1 > rate to T1; Fig. 1A). This was generally chosen near the middle of the T1 dynamic range. Rate as a function of noise level (or decreasing SNR) was then measured for F1 and T1 at that vowel level (panel B). Three SNRs (across a 20-dB range) were chosen to cover the range over which rate coding degraded, with the middle SNR typically chosen at the level where the rate to F1 and T1 became close (Fig. 1B).

Figure 1 shows an example of a fiber with a low SR that had robust spectral coding in quiet. However, the addition of noise degraded the spectral coding in this fiber to the point where the rate to F1 and T1 were equal. The complete degradation of rate coding as noise increased was true of every fiber studied.

Fig. 1 A Rate-level functions for F1 and T1 placed at BF. Dotted vertical line indicates vowel level chosen. B. Rate-level functions for the 50-dB SPL vowel in noise as a function of decreasing SNR. Dotted vertical lines indicate SNRs used for STMP. Fiber BF=0.96 kHz; Thresh.=8 dB SPL; SR=1.4 sp/sec

110

M.G. Heinz

3.2Predicted Spatiotemporal Response Patterns

Based on the vowel and noise levels chosen (Fig. 1), the STMP was used to predict the spatiotemporal responses of 10 effective BFs near the AN fiber’s BF to F1 and T1 in 4 noise conditions (3 SNRs and in quiet). The data shown in Fig. 2 represent 20 repetitions of the 80 conditions studied (10 effective BFs × 2 features × 4 noise levels), which were presented in an interleaved manner.

The spatiotemporal responses to F1 (top panel) show synchrony capture to F1 in both conditions across all but the highest BFs. The responses of BFs near T1 (bottom panel) show a significant response to F0 in the quiet condition, which disappears in noise (Delgutte and Kiang 1984). The effective

Fig. 2 Spatiotemporal patterns in response to F1 (top) and T1 (bottom) at 0.96 kHz (thickest line), predicted from an AN fiber with BF0=0.96 kHz. Left panels show period histograms for each of 10 effective BFs for the in-quiet and middle SNR condition (4 dB). Vowel level = 50 dB SPL. Labels and short vertical lines at the bottom of the period histograms indicate the temporal periods (from time 0) associated with various vowel features. Middle panels show rate as a function of BF. Horizontal dotted lines and labels indicate the location of vowel features. Right panels show synchrony coefficient to F1 (top) and T1 (bottom) -only significant values shown. Octave shifts re BF0: −.4, −.25, −.15, −.05, 0, .05, .15, .25, .5, .75

Spatiotemporal Encoding of Vowels

111

BF near 1.4 kHz is near F2 and shows the expected response to F2. These predicted patterns are consistent with many of the properties reported in previous population studies (e.g., Delgutte and Kiang 1984; Sachs et al. 1983; Young and Sachs 1979).

3.3Cross-BF Coincidence Functions

Figure 3 shows SCC functions computed from the predicted spatiotemporal responses (Fig. 2), which represent the discharge rate of a model coincidence detecting neuron with two inputs from AN fibers with effective BFs 0.05 octaves above and below each vowel feature. The periodic nature of the responses to F1 can be seen both in quiet and in noise. In contrast, a strong temporal representation of F0 in the fibers near T1 can be seen in quiet, but not in noise.

These SCC functions are an example of the precise cross-correlation analyses that are possible with the STMP approach. Because the effective BFs are created by changes in sampling rate, the BF difference is known exactly and is controllable by the experimenter. In contrast, population studies are limited by BF sampling issues and by any inaccuracies in estimating the two BFs.

Fig. 3 SCC functions between effective BFs 0.05 octaves above and below F1 (top) and T1 (bottom). In-quiet: left panels; In-noise (SNR = 4 dB): right panels