Cerebral Cortex, Vol. 10, No. 5, 512-528,
May 2000
© 2000 Oxford University Press
Human Temporal Lobe Activation by Speech and Nonspeech Sounds
Department of Neurology and Department of Cellular Biology, Neurobiology, and Anatomy, Medical College of Wisconsin, Milwaukee, WI, USA
| Abstract |
|---|
|
|
|---|
Functional organization of the lateral temporal cortex in humans is not well understood. We recorded blood oxygenation signals from the temporal lobes of normal volunteers using functional magnetic resonance imaging during stimulation with unstructured noise, frequency-modulated (FM) tones, reversed speech, pseudowords and words. For all conditions, subjects performed a material- nonspecific detection response when a train of stimuli began or ceased. Dorsal areas surrounding Heschl's gyrus bilaterally, particularly the planum temporale and dorsolateral superior temporal gyrus, were more strongly activated by FM tones than by noise, suggesting a role in processing simple temporally encoded auditory information. Distinct from these dorsolateral areas, regions centered in the superior temporal sulcus bilaterally were more activated by speech stimuli than by FM tones. Identical results were obtained in this region using words, pseudowords and reversed speech, suggesting that the speechtones activation difference is due to acoustic rather than linguistic factors. In contrast, previous comparisons between word and nonword speech sounds showed left-lateralized activation differences in more ventral temporal and temporoparietal regions that are likely involved in processing lexicalsemantic or syntactic information associated with words. The results indicate functional subdivision of the human lateral temporal cortex and provide a preliminary framework for understanding the cortical processing of speech sounds.
| Introduction |
|---|
|
|
|---|
Several lines of evidence suggest that the superior temporal cortex in humans plays a vital role in speech sound processing. This region contains the cortical representation of the auditory sensory system (Wernicke, 1874
What has not emerged from these observations is a more detailed picture of the functional organization of the human superior temporal region. From an information processing perspective, the perception of speech involves several stages of auditory sensory analysis and pattern extraction as well as interactions between sensory and stored linguistic information (McClelland and Elman, 1986
; Klatt, 1989
). These analysis stages imply a degree of hierarchical functional organization in the speech perceptual system analogous to the levels of processing observed in ventral visual recognition pathways. From an anatomical and physiological standpoint, there is ample evidence in nonhuman primates for subdivision of the superior temporal region into a large number of distinct areas (Galaburda and Pandya, 1983
; Morel et al., 1993
; Rauschecker et al., 1995
; Hackett et al., 1998
). Although available anatomical and physiological data are consistent with some degree of functional subdivision of the human superior temporal cortex (Celesia, 1976
; Galaburda and Sanides, 1980
; Liegeois-Chauvel et al., 1991
; Rademacher et al., 1993
; Howard et al., 1996
), the precise correspondence between human and nonhuman auditory cortical areas remains unclear. The human equivalent of primary auditory cortex (area AI) is largely confined to the caudal one- third of the transverse temporal or Heschl's gyrus (HG) (Celesia, 1976
; Galaburda and Sanides, 1980
; Liegeois-Chauvel et al., 1991
; Rademacher et al., 1993
), suggesting a role for this region in early sensory analysis of speech sounds. Little else is currently known regarding the relationships between putative subdivisions of the human temporal lobe and the multiple processing stages proposed to underlie speech perception.
A first step would be to identify cortical regions that are selectively responsive to speech sounds in contrast to simpler sounds. Just as certain visual regions are activated by specific stimulus attributes like motion or color, it is possible that certain auditory areas become active only in the presence of speech sounds, which contain a variety of characteristic physical attributes (Liberman et al., 1967
). Four functional imaging studies comparing processing of speech and simpler nonspeech sounds support this concept. In these studies, speech sounds elicited significantly stronger responses than amplitude-modulated noise (Zatorre et al., 1992
) or tone sequences (Démonet et al., 1992
; Binder et al., 1996
, 1997
) in a region of the superior temporal gyrus (STG) and adjacent superior temporal sulcus (STS) located lateral and ventral to HG and clearly distinct from human AI. While consistent with a functional subdivision of the superior temporal region, these findings raise other, as yet unresolved, issues.
One issue concerns the nature of the activation response elicited in these studies. Because the speech sounds differed from the nonspeech sounds in terms of both auditory complexity and phonetic content, it is not clear whether the differential activation by speech sounds is due to complex auditory processing, phoneme processing or both. Moreover, the speech stimuli in most of these studies included lexical items, raising the additional possibility that the speech activation could reflect processes pertaining to lexical access or semantic retrieval. To address this problem, several investigators contrasted listening to words with listening to meaningless speech sounds (pseudowords, syllables or reversed speech) that were closely matched to the words in terms of physical attributes. By matching on auditory processing demands, these comparisons should identify brain regions involved in processing post-sensory, linguistic representations (i.e. lexical and semantic knowledge) associated with the word stimuli. Unfortunately, no clear result has emerged from these experiments, some of which showed no superior temporal lobe differences between conditions (Wise et al., 1991
; Démonet et al., 1992
, 1994
; Hirano et al., 1997
; Binder et al., 1999
), and others of which identified superior temporal foci but in inconsistent locations (Howard et al., 1992
; Mazoyer et al., 1993
; Perani et al., 1996
; Price et al., 1996b
).
A second issue concerns the relationship of this speech- specific region in lateral STG and STS to conventional neuroanatomical models of auditory word processing. The conventional view, based on lesion-deficit correlation and anatomical studies in humans, emphasizes the importance of the posterior left STG (Wernicke's area) and posterior temporal plane, or planum temporale (PT), in auditory word perception (Wernicke, 1874
; Geschwind and Levitsky, 1968
; Geschwind, 1971
; Bogen and Bogen, 1976
; Braak, 1978
; Galaburda et al., 1978
; Mesulam, 1990
; Foundas et al., 1994
). Relative to these classical areas, the speech-specific region identified by functional imaging is located somewhat anteriorly and ventrally, suggesting the possibility of an auditory processing stream that projects ventrolaterally from AI rather than posteriorly (Binder et al., 1996
). The functional imaging data suggest, furthermore, that speech-responsive cortex exists in both left and right superior temporal regions (Démonet et al., 1992
; Zatorre et al., 1992
; Binder et al., 1997
), in contrast to the conventional emphasis on functional lateralization to the language-dominant hemisphere. Taken as a whole, the functional imaging data are reminiscent of observations in nonhuman primate auditory cortex, which suggest a hierarchical, ventrolaterally directed flow of information from AI to STS in the superior temporal region bilaterally (Galaburda and Pandya, 1983
; Hackett et al., 1998
; Rauschecker, 1998
).
We addressed these issues in the following experiments by using blood oxygenation level dependent (BOLD) fMRI to measure superior temporal responses to speech and non- speech sounds. Our first aim was to identify cortical regions responsive to simple acoustic temporal structure in comparison to incoherent noise. BOLD responses to frequency-modulated (FM) tones were compared with responses elicited by unmodulated noise in 28 normal, right-handed subjects. The second aim was to identify regions that are selectively activated by speech sounds relative to these less complex nonspeech sounds. Accordingly, BOLD responses to speech sounds were measured concurrently in the same group of subjects and compared with responses elicited by the FM tones. Our third aim was to determine the sensitivity of these speech-specific regions to post-sensory linguistic variables. BOLD responses to words were compared with responses elicited by phonetically familiar nonwords (pseudowords) and to responses elicited by phonetically unfamiliar nonwords created by temporal reversal of speech sounds (reversed speech). Our rationale was that the wordpseudoword contrast would test whether the regions characterized as speech-specific are sensitive to lexical semantic attributes of input stimuli, and that the speech-reversed speech contrast would test sensitivity of this region to phonetic attributes of the input stimuli. As noted above, previous wordnonword comparisons have not demonstrated consistent results with respect to the superior temporal region. Many of these studies incorporated explicit language processing tasks that may have caused activation differences due to differences in attentional state (Howard et al., 1992
; Perani et al., 1996
; Price et al., 1996b
). To avoid this potential confound, our subjects performed a simple, non-linguistic, detection response for all stimuli, such that explicit attentional requirements were uniform across conditions. Quantitative comparisons were made between activated regions in the left and right temporal lobe. Through meta-analysis of standard stereotaxic coordinate data, activation foci resulting from each contrast were compared with those identified in previous PET and fMRI studies.
| Materials and Methods |
|---|
|
|
|---|
Pilot Study: Reversed Speech Transcription
Reversed speech (i.e. backward playback of recorded speech waveforms) has been used frequently in functional imaging research (Howard et al., 1992
; Perani et al., 1996
; Price et al., 1996b
; Dehaene et al., 1997
; Hirano et al., 1997
). Reversal produces nonword stimuli that match the originals in terms of physical complexity and acoustic characteristics and are therefore ideal as controls for acoustic input. While it seems unlikely that such stimuli evoke specific lexical or semantic associations, the amount of phonetic information conveyed by reversed speech sounds is unclear. Many phonemes are temporally symmetrical or relatively symmetrical (fricatives and long-duration vowel sounds), or show approximate mirror reversal of formant transition structure before and after a vowel, and so might convey considerable phonetic information even in reversed form. On the other hand, reversal produces other sound sequences that do not occur in naturally uttered speech and so strike the listener as unfamiliar. The purpose of this pilot study was to explore the degree to which listeners perceive and reliably report familiar phonemes on hearing reversed speech sounds.
Participants were six men and four women (age range 2448 years, mean 31 years) who were monolingual English speakers with no training in phonetics or audiology and no history of neurologic or audiologic symptoms. Testing was performed in a quiet room. Digitally recorded stimuli were presented at clearly audible levels through standard headphones. The duration of all stimuli was edited to 666 ms using cutting and duplication techniques (for fricative and pause segments) together with digital time compression and expansion without altering pitch (SoundDesigner II software, Digidesign Inc., Menlo Park, CA). The stimuli included:
- Words: monosyllabic, medium frequency, concrete English nouns (e.g. desk, fork, stream) spoken by a male.
- Pseudowords: monosyllabic, pronounceable nonwords, derived by rearranging the phoneme order of the word stimuli (e.g. /sked/, /korf/, /rimst/). These were recorded concurrently with the word stimuli by the same talker and were matched to the words in terms of duration, overall spectral content, overall intensity, and phoneme content.
- Reversed speech (Reversed): temporally reversed versions of the word stimuli.
There were 48 stimuli of each type, and each stimulus was presented once, for a total of 144 trials. Stimuli of a given type were presented in blocks of six trials (thus eight blocks per condition), and these blocks were presented in a randomized condition order.
After each stimulus, the subject attempted to transcribe what had been presented by typing into a computer keyboard. No time constraints were used. The typed string was visible at all times, but the subject was instructed not to change a letter that had been typed except in the event of an accidental key press. Subjects were instructed to make the most accurate and complete transcription possible, and to guess or approximate when necessary. When satisfied with a response, the subject advanced to the next trial using the enter key.
The aim of data analysis was to estimate the degree of consistency of transcription responses across subjects for each condition. Our assumption was that if the sounds were readily assignable to phoneme categories (i.e. phonetically familiar), then both the number of letters and the choice of letters used in transcription should be relatively consistent across subjects. For each stimulus, an automated program determined the most common number of letters used for transcription, as well as the most commonly selected letter for each letter position. Using these canonical responses, the proportion of subjects in agreement with the canonical response was then computed. Most English phonemes have several acceptable alternative spellings, which may vary in length (e.g. c, k or ck for /k/). We therefore computed the percent agreement based on criteria that did not require a perfect match with the canonical response. For the letter count measurement (Length), any response with a letter count of ±1 relative to the canonical was accepted as a match. For the letter choice measurement (Letter), a letter response could vary by ±1 letter position relative to the canonical. Although only rough approximations, these measures were assumed to reflect the degree of transcription consistency across subjects, and thus the degree of consistency of phonetic categorization of the presented sounds. It was anticipated that performance on the words, which have known standard spellings, would be virtually perfect. Of interest was any difference that would arise between pseudowords and reversed speech, which would be most readily attributable to differences in phonetic familiarity.
fMRI Subjects
Subjects in the fMRI study were 13 men and 15 women, ranging in age from 19 to 41 years (mean 26.0 years), with no history of neurologic or audiologic symptoms. All subjects were monolingual English speakers, and all indicated right hand dominance (laterality quotient > 33) on the Edinburgh Handedness Inventory (Oldfield, 1971
). After a full explanation of the nature and risks of the research, subjects gave written informed consent for all studies according to a protocol approved by the institutional research review board.
Image Acquisition
Imaging was performed on a 1.5 T General Electric Signa scanner (GE Medical Systems, Milwaukee, WI) using three-axis local gradient and insertable transmit-receive radiofrequency coils (Medical Advances, Milwaukee, WI) designed for echo-planar imaging (EPI) (Wong et al., 1992
). Sagittally oriented, gradient-echo EPI images were acquired for functional studies using the following conditions: echo time = 40 ms, repetition time = 4000 ms, field of view = 240 mm and in-plane voxel dimensions = 3.75 x 3.75 mm. Sequential images were collected concurrently at contiguous sagittal locations in the lateral portion of each hemisphere using spatially interleaved acquisitions. Slice thickness (78 mm) and number of slices (36 per hemisphere) varied across subjects, although for a given subject the number of slices was the same in the left and right hemispheres. Care was taken to position the slices at the same location relative to the lateral brain edge in both hemispheres. To minimize overall scanner noise levels, the minimal number of slices needed to cover the temporal lobes was acquired, using a relatively long interscan interval (i.e. repetition time/number of slices). While slice coverage varied by a few millimeters from subject to subject, brain tissue located between stereotaxic x coordinates 42 and 62 was imaged in both hemispheres of all subjects. This volume includes HG, superior temporal plane, STS and lateral temporal lobe, excluding part of the lateral temporal surface lying midway along the length of the STS, and the medial tip of HG (Talairach and Tournoux, 1988
). Also included in this slice selection are the middle and inferior temporal gyri, the lateral half of the fusiform gyrus, inferior frontal gyrus, supramarginal and angular gyri, and lateral occipital cortex. Two EPI series with 168 repetitions each were acquired and subsequently concatenated after removing initial signal equilibration images. The two final images in each series were obtained using slightly modified echo time to provide phase data for the generation of B-field maps, which were used for automated image unwarping. High-resolution anatomical images of the entire brain were obtained during the same session using either a fast-spin-echo sequence yielding contiguous 4 mm sagittal slices (six subjects) or a 3-D spoiled-gradient-echo sequence (SPGR, GE Medical Systems, Milwaukee, WI) yielding contiguous 1.2 mm sagittal slices (22 subjects).
Stimuli and Task
Room lights were dimmed during scanning, and subjects were instructed to keep their eyes closed. Stimuli were 16-bit digital sound files sampled at 44.1 kHz and low-pass filtered at 11 kHz. These sounds were presented binaurally using a computer playback system, a magnetically shielded transducer system and plastic sound conduction tubes (Binder et al., 1995
). The conducting tubes were threaded through tightly occlusive ear inserts that attenuated the average sound pressure level (SPL) of the continuous scanner noise by ~20 dB. Average intensity of the experimental stimuli was 100 dB SPL and was matched across all conditions. Average intensity of the scanner noise was ~75 dB SPL after attenuation by the ear inserts.
Five types of auditory stimulus were compared (Fig. 1
). Stimuli of a single type were presented during activation periods lasting 12 s. Activation periods were always preceded and followed by baseline periods each lasting 12 s, during which only the background scanner noise was heard. There were eight activation periods for each of the five conditions (40 activation periods in all). The order of these conditions was randomized, and all conditions were presented before a condition was repeated. The duration of all stimuli was 666 ms. All stimuli had 10 ms linear on and off intensity envelopes, and the inter-stimulus interval was 0 ms. There were thus 18 individual stimuli presented during each 12 s activation period and a total of 144 stimuli in each condition over the course of the experiment.
|
The stimuli included:
- Noise: a white noise burst containing all frequencies from 20 to 11 000 Hz. Except for the on and off envelopes every 666 ms, this repeating stimulus contained no coherent frequency or amplitude modulation.
- Tones: pure sine wave tones having a constant frequency within the range 502400 Hz. Forty-eight different, harmonically unrelated, frequencies were used within this range; thus each frequency was presented three times during the course of the experiment. Because a new tone was presented every 666 ms, the frequency changed randomly and by at least 10 Hz in a stepwise manner every 666 ms.
- Words: monosyllabic, medium frequency, concrete English nouns (e.g. desk, fork, stream) spoken by a male. These were the same word stimuli used in the pilot study. All of the acoustic energy of these sounds was below 11 000 Hz, with the majority between 50 and 3000 Hz. All stimuli were edited to a duration of 666 ms. Forty- eight different words were used; thus each was presented three times during the course of the experiment.
- Pseudowords: monosyllabic, pronounceable nonwords, derived by rearranging the phoneme order of the word stimuli (e.g. /sked/, /korf/, /rimst/). These were the same pseudoword stimuli used in the pilot study. They were recorded concurrently with the word stimuli by the same talker. These stimuli were matched to the words in terms of duration, overall spectral content, overall intensity and phoneme content. Forty-eight different stimuli were used; thus each was presented three times during the course of the experiment.
- Reversed speech (Reversed): temporally reversed versions of the word stimuli. These were the same reversed stimuli used in the pilot study. Temporal reversal results in sounds that are, on average, less readily categorized phonetically (see Pilot Data, below). These stimuli were thus matched to the word and pseudoword stimuli on overall spectral content, intensity, duration and acoustic complexity, but conveyed less phonetic information and presumably very little lexical or semantic information. Forty-eight different stimuli were used; thus each was presented three times during the course of the experiment.
In summary, the stimulus types differed along several dimensions, as shown in Table 1
.
|
Subjects were instructed to listen to the sounds being played through the eartubes and to briefly press a button with the left hand whenever a 12 s block of sounds began or ceased.
Within-subject Data Analysis
All image analysis was performed with the AFNI software package (Cox, 1996a
). Automated image unwarping was performed using field maps obtained from the final two images in each EPI series (Weisskoff and Davis, 1992
). All images were then spatially co-registered (to the first steady-state EPI image acquired after the anatomical scan) using an iterative procedure that minimizes variance in voxel intensity differences between images (Cox, 1996b
).
Activationbaseline difference (Diff) maps were created for each of the 40 activation periods as follows. First, the final two images from each activation period were averaged to produce an image of average signal intensity values during the last 8 s of each period. Next, the final two images from the baseline periods immediately preceding and following each activation period were averaged, and a Diff image was created for each activation period by subtracting the average baseline image from the corresponding average activation image (Binder et al., 1994a
).
Pooled-variance Student's t-tests of the condition-specific Diff maps were then used to create three-dimensional statistical parametric maps (SPMs) reflecting differences between conditions at each voxel location for each subject. SPMs were created for each of the five conditions compared with baseline by testing the Diff map values at each voxel against a hypothetical mean of zero. SPMs were also created for seven paired activation condition contrasts: TonesNoise, ReversedTones, PseudowordsTones, WordsTones, PseudowordsReversed, Words Reversed and WordsPseudowords. No assumptions were made about the direction of effects in the condition contrasts, thus all contrasts were tested and are reported for both directions.
The volume of significantly activated tissue in left and right superior temporal regions was determined for each condition contrast in each subject as follows. First, each SPM was thresholded at a t value of 3.5 (uncorrected P < 0.01), and clusters of activation containing fewer than four raw data voxels were removed. These thresholds reduce the voxelwise probability of false positives to approximately P < 0.0001, as determined by Monte Carlo simulation using the AlphaSim module of AFNI. Next, these thresholded SPMs were superimposed on anatomical images from the same subject, and the activated voxels were classified anatomically by reference to sulcal landmarks. Voxels were considered to lie within the superior temporal region if they fell within the STG (including HG and PT), STS (including lower bank of STS) or planum parietale (the lower bank of the ascending ramus of the sylvian fissure). The number of such voxels was determined for each hemisphere and multiplied by the voxel volume to estimate the total volume of activated superior temporal tissue in each hemisphere.
Stereotaxic Averaging
Individual anatomical scans and unthresholded SPMs were also projected into the standard stereotaxic space of Talairach and Tournoux (Talairach and Tournoux, 1988
). To compensate for normal variation in anatomy across subjects (Toga et al., 1993
), the unthresholded, stereotaxically resampled SPMs were smoothed slightly with a Gaussian filter measuring 2 mm full-widthhalf-maximum. These datasets were then merged across subjects by averaging the t-deviates at each voxel (Binder et al., 1997
). The procedure of averaging statistics was chosen to guard against hetero- scedasticity of MR signal variance among subjects that could arise, for example, from differing degrees of subject motion or tissue pulsatility, variability in global blood flow or reactivity, or scanner variability between sessions.
The maps of averaged t-statistics were thresholded to identify voxels in which the mean change in MR signal between comparison conditions was unlikely to be zero. Because the average of a set of t-deviates is not a tabulated distribution, the CornishFisher expansion of the inverse distribution of a sum of random deviates was used to select a threshold for rejection of the null hypothesis (Fisher and Cornish, 1960
). Only average t-scores of 0.909 or greater were considered significant (voxelwise P < 0.0001). To further guard against false-positive findings, activation foci had to attain a contiguous volume of at least 15 µl. These thresholds reduced the voxelwise probability of false positives to ~P < 108 and the overall probability of false-positive voxels in each 3-D image to P < 0.001.
| Results |
|---|
|
|
|---|
Pilot Study: Speech Sound Transcription
Subjects in the transcription study uniformly described a sense of uncertainty and unfamiliarity regarding the reversed speech sounds. Transcriptions of these stimuli varied considerably across subjects. For example, responses to the reversed version of desk included excit, hisech and schah, and responses to the reversed version of milk included heckloineb, lem, koim and loycoym. Time taken to transcribe the sounds varied significantly as a function of stimulus condition (ANOVA with df 2,27; F = 22.67, P < 0.0001), with Reversed transcription taking significantly longer (mean for all subjects = 8344 ms) than Pseudoword transcription (mean = 6048 ms; P = 0.0028) and also significantly longer than Word transcription (mean = 3653 ms; P < 0.0001).
Response consistency analyses are shown in Figure 2
. As expected, both Length and Letter consistency were close to 100% for Word stimuli. Consistency on the Length measure (number of letters ± 1) varied significantly across conditions (ANOVA with df 2,141; F = 28.684, P < 0.0001). Length consistency for Reversed stimuli was significantly less than for both Pseudoword and Word conditions (P < 0.0001 for both comparisons). Pseudoword and Word conditions did not differ on this measure. Consistency on the Letter measure (position- specific letter choice ± 1 position) also varied significantly across conditions (ANOVA with df 2,141; F = 149.508, P < 0.0001). Letter consistency for Reversed stimuli was significantly less than for both Pseudoword and Word conditions, and Pseudoword consistency was significantly less than Words consistency (P < 0.0001 for all comparisons).
|
Despite these differences in transcription consistency, it is notable that the percent agreement on the Reversed stimuli was clearly better than chance for the Letter measure (72.7%). Thus, considerable phonetic information was conveyed by the stimuli even if subjects, on average, categorized these sounds less reliably.
fMRI Task Performance
The listening task used during fMRI, which required a key press after the beginning and after the end of each block of sounds, was performed well by all subjects. A response was correctly produced on 99.4% of trials (range 94% to 100%). The number of missed responses did not differ across the five stimulus conditions (ANOVA with df 4,135; F = 1.066, n.s.).
Averaged Activation Maps: Stimulus versus Baseline
Averaged stimulusbaseline SPMs (Fig. 3
) represent the activation induced by presentation of an experimental stimulus relative to ongoing activation from the background scanner sound. All stimuli produced extensive activation of the superior temporal region bilaterally. The dorsomedial boundary of this activated region was clear: while typically involving most of the dorsal temporal plane, activation did not spread medially into insular cortex. Posterior and ventral boundaries were less definite and were more dependent on stimulus type. Activated regions in the right temporal lobe were shifted somewhat anteriorly and laterally relative to left temporal activations. For example, in all conditions the activation along HG and PT extended farther medially in the left hemisphere, while activation of lateral STG and planum polare was more prominent in the right hemisphere. The stimuli also produced variable degrees of frontal lobe activation bilaterally.
|
Noise
Activated areas primarily included HG and surrounding dorsal plane bilaterally, with some spread to lateral STG. There were small right frontal activation foci near the ventral precentral sulcus (BA 6/44) and in pars triangularis of the inferior frontal gyrus (BA 45).
Tones
Dorsal plane and HG activation was stronger and more extensive than with Noise, particularly along the anterior dorsal plane (planum polare) in the right hemisphere. Activation also extended ventrolaterally into lateral STG and STS along most of the length of the STG. Activation was observed extending, finger-like, into the posterior MTG bilaterally; this was more prominent in the right hemisphere. Frontal foci were observed in the precentral gyrus (BA 6), ventral precentral sulcus (BA 6/44) and pars opercularis (BA 44) of the left hemisphere, and in the precentral gyrus, ventral precentral sulcus, pars opercularis and pars triangularis of the right hemisphere.
Reversed
Activation extended more ventrally than with Tones, more clearly involving the middle portion of STS bilaterally. Caudal projections into posterior left MTG were more prominent. Frontal lobe foci were similar to those observed with Tones.
Pseudowords
The activation pattern was approximately identical to that of Reversed.
Words
The activation pattern was approximately identical to Reversed and Pseudowords, except that activation in the ventral pre- central sulcus and pars opercularis (BA 44 and 6) appeared somewhat more extensive bilaterally.
Simple Temporal Structure: Tones versus Noise
Direct comparisons between stimulus conditions were performed to more clearly delineate areas responding differently to different stimuli. Representative SPMs are displayed in Figures 46![]()
![]()
, and the location of activation peaks for each comparison are given in Table 2
.
|
|
|
|
The Noise and Tones stimuli differ in two important respects (see Table 1
No areas showed significantly stronger responses to Noise than Tones. Areas responding more strongly to Tones included much of the dorsal plane and lateral surface of STG bilaterally and a portion of the right STS (Fig. 4
). Peak foci for this contrast surrounded the medial HG (primary auditory area) posteriorly, laterally and anteriorly (Table 2
), but the primary auditory area itself was not greatly involved. The preference for Tones was strongest posterior to HG (in planum temporale) in the left hemisphere and strongest anterolateral to HG (in anterolateral STG and planum polare) in the right hemisphere. No frontal areas showed significant differences between Tones and Noise.
Speech-sensitive Areas: Words versus Tones
The WordsTones contrast was intended to reveal brain regions that are more responsive to speech than to nonspeech sounds with simple temporal structure. This comparison provides a replication test of previously reported findings using similar stimulus contrasts (Démonet et al., 1992
; Zatorre et al., 1992
; Binder et al., 1997
), and allows the location of these regions to be compared with those associated with simple temporal processing (TonesNoise) in the same subjects.
No areas showed significantly stronger responses to Tones than Words. Areas responding more strongly to Words included the mid-portion of the lateral STG (lateral to the anterior aspect of HG) and adjacent STS bilaterally (Fig. 5A
). There were at least two neighboring but distinct peak foci in each hemisphere. These were located, on average, 78 mm ventral to the TonesNoise foci (Table 2
). No frontal areas showed significant differences between Words and Tones.
A direct comparison between speech-sensitive regions, identified in the WordsTones contrast, and simple temporal processing regions, identified in the TonesNoise contrast, is shown in Figure 5B
. A dorsal-to-ventral organization is apparent, with simple temporal processing regions located dorsally on the STG and speech-sensitive regions located more ventrolateral, within or near the STS. An intermediate zone responded more to Tones than Noise and also more to Words than Tones.
Linguistic Factors: Words versus Nonword Speech
Contrasts between word and nonword speech sounds were conducted to test whether the regions selectively activated by word stimuli are sensitive to phonetic or lexicalsemantic attributes of the stimuli. As discussed earlier, the Pseudoword stimuli differ from the Word stimuli primarily in terms of eliciting fewer explicit lexicalsemantic associations. The Reversed stimuli provide a stronger contrast, differing from the Word stimuli in terms of both phonetic intelligibility and lexicalsemantic content. Pseudoword and Reversed conditions were each contrasted with Tones in order to compare these results with the WordsTones data, and both of these nonword conditions were contrasted directly with the Words condition.
The PseudowordsTones and ReversedTones contrasts yielded results indistinguishable from WordsTones (Fig. 6
). In each case, bilateral regions at the mid-lateral STG and adjacent STS demonstrated stronger responses to speech sounds than Tones. The location of peak foci were strikingly similar (Table 2
) and differed across the three contrasts by an average of only 3.0 mm (SD = 1.8). No areas showed stronger responses to Tones in either contrast.
The direct contrasts WordsPseudowords and Words Reversed revealed no areas of significant activation difference in either direction in either hemisphere, nor did a direct contrast between Pseudowords and Reversed conditions. To investigate the possibility that small differences in activation between word and nonword conditions might have been missed by overly conservative thresholding, a second set of SPMs was produced using a much more lenient voxelwise threshold for significance (uncorrected P < 0.05) and no cluster size threshold. Of particular interest was whether the set of voxels originally identified in the WordsTones contrast (Fig. 5A
) would show any activation differences between the three speech conditions (Words, Pseudowords, Reversed). Several small left hemisphere foci showed stronger activation to Words than to Pseudowords using this threshold. These were located in the posterior inferior temporal gyrus (BA 20/37, peak coordinates = 46, 40, 17), rostral middle frontal gyrus (BA 46/10, peak coordinates = 40, 40, 0), angular gyrus (BA 39, peak coordinates = 44, 54, 39), and the border between posterior middle and inferior temporal gyri (BA 21/37, peak coordinates = 53, 46, 5). None of these clusters overlapped with the clusters identified in the speech nonspeech contrasts (Figs 5A, 6![]()
). No activation differences favoring Words were observed in the WordsReversed contrast even using this more lenient threshold. Of note, however, several small regions in the right anterior STG (peak coordinates = 58, 9, 10) and STS (peak coordinates = 63, 13, 2) showed stronger activation to Reversed than to Words at these thresholds. In the contrast between Pseudowords and Reversed, the only differences observed were several very small areas in the right anterior STG that favored Reversed stimuli. These regions were smaller but otherwise identical to those observed in the Words Reversed contrast.
Hemispheric Asymmetry of Superior Temporal Activation
The volume of activated superior temporal lobe tissue was estimated for each condition contrast and each hemisphere by multiplying the volume of each voxel by the number of voxels within the superior temporal region that survived a statistical threshold of P < 0.0001 (Fig. 7
). Activation volumes in individual subjects ranged from 0 to 27.76 ml. No voxels survived the statistical threshold in any subject in any of the wordnonword contrasts (i.e. contrasts between Words, Pseudowords, and Reversed), so these contrasts were not analyzed further. As expected, the largest activation volumes were obtained in the speechbaseline contrasts (mean for Words, Pseudowords and Reversed averaged across left and right hemispheres = 10.902 ml). Tonesbaseline also resulted in extensive activation (mean averaged across hemispheres = 8.134 ml). Smaller activation volumes were obtained for Noisebaseline (mean averaged across hemispheres = 2.589 ml), speechTones (mean for WordsTones, PseudowordsTones and ReversedTones averaged across hemispheres = 0.860 ml) and TonesNoise (mean averaged across hemispheres = 0.524 ml). All of these means were significantly greater than zero at P < 0.0001.
|
Collapsing across all nine condition contrasts, activation volume was significantly greater in the left hemisphere than in the right hemisphere (mean leftright difference = 0.691 ml; t = 4.669, P < 0.0001). However, this leftright difference varied significantly by condition contrast (ANOVA with df 8,243; F = 3.290, P = 0.0014). Of the nine condition contrasts, only Words baseline showed a significant leftright volume difference after Bonferroni correction for the number of comparisons (mean leftright difference = 2.315 ml; t = 4.312, P = 0.0002), although Pseudowordsbaseline (P = 0.027), WordsTones (P = 0.01), PseudowordsTones (P = 0.038) and ReversedTones (P = 0.089) all showed trends in the same direction of greater volume on the left side. When the three speechTones condition contrasts were collapsed together, activation volume in the left hemisphere was significantly greater than the right (t = 3.831, P = 0.0002), although the absolute mean leftright difference was comparatively small (0.493 ml). There was no significant lateralization for Tones (mean leftright difference = 0.057 ml), Noise (mean leftright difference = 0.353 ml) or TonesNoise (mean left right difference = 0.070 ml).
Thus, leftright differences in activation volume were comparatively small except for Wordsbaseline. It is apparent from Figure 7
that the hemispheric asymmetry resulting from the speechbaseline contrasts was greatest for Words, less for Pseudowords and least for Reversed speech. However, this difference is not due to differential left hemisphere activation but rather to differential right hemisphere activation. That is, Reversedbaseline and Wordsbaseline produced almost identical activation volumes in the left hemisphere (mean = 11.625 and 11.636 ml, respectively), whereas Reversedbaseline tended to activate a larger volume of the right hemisphere (mean = 11.182 ml) than did Wordsbaseline (mean = 9.320 ml).
| Discussion |
|---|
|
|
|---|
BOLD fMRI responses were obtained from superior temporal cortex in the human brain using a wide range of auditory stimuli, including white noise, FM tones and speech. HG and PT, on the dorsal temporal surface, were activated by all stimuli, whereas activation of the lateral surface of the STG and the STS was more variable across stimuli. The spatial extent of activation was greatest for speech sounds, less for FM tones and least for noise. Although primary auditory cortex on the medial half of HG responded equivalently to all stimuli, much of the surrounding dorsal temporal surface and lateral STG showed stronger activation to FM tones than to noise, despite the greater band- width of the noise stimuli. More ventrolateral regions, located primarily in the STS, showed stronger activation to speech sounds than to FM tones. Activation of these areas was equivalent whether the speech sounds were words, pseudowords or reversed speech.
Superior temporal activation occurred bilaterally for all stimuli, as in previous auditory studies (Petersen et al., 1988
; Wise et al., 1991
; Howard et al., 1992
; Mazoyer et al., 1993
; Binder et al., 1994b
; Fiez et al., 1995
, 1996
; O'Leary et al., 1996
; Price et al., 1996b
; Warburton et al., 1996
; Dhankhar et al., 1997
; Grady et al., 1997
; Hirano et al., 1997
; Jäncke et al., 1998
). Right temporal areas of activation were shifted anterolaterally relative to homologous activations in the left hemisphere. This asymmetry is consistent with previous morphometric studies showing a relative anterolateral shift of HG in the right hemisphere (Penhune et al., 1996
). Right auditory cortex activation included more of the anterolateral STG and planum polare than did left auditory cortex activation. This arrangement may be compensation for the fact that the horizontal portion of the PT is generally larger in the left hemisphere (Geschwind and Levitsky, 1968
; Steinmetz et al., 1991
; Witelson and Kigar, 1992
; Leonard et al., 1993
; Foundas et al., 1994
). Quantitative measures indicated relative symmetry of activation volume for the superior temporal region as a whole for most stimulus conditions. A notable exception was the Words condition, which produced significantly less activation of the right temporal lobe than the left. However, because Words produced no more left hemisphere activation than the Reversed stimuli, this asymmetry is not likely due to additional (language specific) left hemisphere recruitment by the Words stimuli. Rather, the asymmetry during Words results from the fact that these stimuli were relatively less effective (compared with nonwords) at activating the right temporal lobe (Fig. 7
). Using relaxed statistical criteria, regions in the anterior right STG were more strongly activated by Reversed stimuli than by Words or Pseudowords in direct voxel- wise comparisons. One possible explanation for this pattern is that the nonwords (particularly the Reversed stimuli) were relatively novel and may have engaged subjects' attention more than the Words stimuli (Wood and Cowan, 1995
). Several prior studies have demonstrated modulatory effects of attention on auditory cortex activation (O'Leary et al., 1996
; Pugh et al., 1996
; Grady et al., 1997
). If this account is correct, the data imply a greater susceptibility of the right auditory cortex than the left to such attentional modulation.
In the following discussion, we focus on the temporal lobe activation observed when speech sounds are contrasted with simpler nonspeech sounds. At issue are the following: (i) How reliably is this activation observed? (ii) Where is the activation located, and how variable is the location across different studies? (iii) Does the activation reflect auditory, phonetic, lexical or semantic processing? (iv) Why were no differences in activation observed using words and nonwords? Previously published results from studies contrasting speech to nonspeech and words to nonwords are reviewed. Finally, the available data are interpreted within a proposed hierarchical model of auditory word processing in the temporal lobe.
Speech Activation Relative to Nonspeech
The results confirm previous reports describing areas within the superior temporal region that respond more strongly to speech sounds than to nonspeech sounds. Relative to areas identified in the TonesNoise contrast, these speech processing areas are significantly more ventral (Fig. 5B
). Figure 8
shows speech > nonspeech peak activation foci identified in the present study and in three previous studies (Démonet et al., 1992
; Zatorre et al., 1992
; Binder et al., 1997
). The findings are remarkably consistent, particularly in the left hemisphere, despite differences across studies with regard to the speech stimuli used (words, pseudowords, syllables, reversed speech), type of nonspeech stimuli (amplitude modulated noise, frequency modulated tones), task performed by the subject and imaging methodology. The activation peaks cluster closely near the mid- portion of the STS and lateral STG. The mean of these peaks is at x = 55.5 (SD 2.3), y = 20.2 (SD 10.9), z = 0.3 (SD 4.1) in the left hemisphere and x = 57.0 (SD 4.2), y = 15.0 (SD 13.8), z = 2.2 (SD 6.1) in the right hemisphere. According to the atlas of Talairach and Tournoux, these mean coordinates are in a portion of the STS located ventral to the lateral aspect of the transverse temporal sulcus (Talairach and Tournoux, 1988
).
|
Activation of the right temporal lobe was less consistent across the four studies. In two studies there was a strong left right asymmetry of the areas identified (Démonet et al., 1992
Effects of Linguistic and Acoustic Factors
Despite what appear to be significant linguistic differences between the Word, Pseudoword and Reversed stimuli used in the present study, we detected no significant differences in brain activation state when subjects listened to these stimuli and performed a nonspecific observing response. The same voxels identified in the WordsTones contrast were also identified by contrasting Pseudoword or Reversed stimuli with Tones, and there were no differences in activation level of these voxels across the Words, Pseudowords and Reversed conditions even using very lenient significance thresholds. Processing in these regions thus appears to be unrelated to activation of semantic associations of the stimuli, given that the nonwords (particularly the Reversed stimuli) are unlikely to have resulted in retrieval of specific semantic information. What remains at issue is whether the data suggest lexical, phonetic or auditory processes as the explanation for activation in the speechnonspeech contrasts.
These findings do not exclude the possibility that some or all of the speech > nonspeech activation could be due to processing of lexical or phonetic representations. Some theorists have proposed computational models of word recognition that accomplish pseudoword recognition through activation of multiple lexical representations (Glushko, 1979
; McClelland and Rumelhart, 1981
), so it is possible that lexical information was activated during all three conditions (Wise et al., 1991
; Price et al., 1996a
). Similarly, the pilot data presented earlier demonstrate that subjects are capable of extracting phonetic information from pseudowords and reversed speech sounds, suggesting that phonetic categorization processes were probably activated to some degree during all three conditions. On the other hand, previous investigators have generally assumed that reversed speech sounds do not elicit lexical processing (Howard et al., 1992
; Perani et al., 1996
; Price et al., 1996b
; Dehaene et al., 1997
; Hirano et al., 1997
). Although transcription reliability was better than chance for all stimuli in the pilot study, subjects had greater difficulty transcribing reversed speech than pseudoword speech, suggesting that reversed speech sounds are, in general, less easily classified within familiar phonetic categories. Phonetic processing of the reversed sounds may have been further suppressed during the imaging study by the competing scanner noise. Despite these important differences in the way subjects interpret reversed speech sounds, there were no corresponding differences in activation of STG and STS regions identified in the speechnonspeech contrasts. The data thus provide no definite evidence that the speech > nonspeech activation difference in these regions is related to either lexical or phonetic processing.
A simpler explanation is that the activation observed during these speechnonspeech contrasts is due largely to physical differences between the speech and nonspeech stimuli (Table 1
, Fig. 1
). Compared with the Tones stimuli, which contain only simple, stepwise frequency modulations at predictable, infrequent intervals, speech sounds are characterized by nearly continuous frequency and amplitude modulations with complex and highly variable temporal structure. Speech sounds typically include both noise and periodic elements, and these components may occur simultaneously or in rapid succession. Unlike the Tones, the periodic components in speech sounds are harmonically complex, with constantly changing, inhomogeneous distributions of energy across the harmonic series caused by vocal tract resonances (speech formants). Acoustic features distinguishing speech phonemes can include any of these phenomena alone or in combination with other elements. For example, information about the voicing of stop consonants (i.e. the distinction between /b/ and /p/ or between /d/ and /t/) is represented in the time latency between consonantinitial noise burst and onset of periodicity, the fundamental frequency of the periodicity, the onset frequency of the first formant resulting from periodicity, the relative amplitude of the aspiration noise preceding periodicity and the duration of the vowel (if any) preceding the stop (Lisker and Abramson, 1964
; Summerfield and Haggard, 1977
; Repp, 1979
, 1982
; Kingston and Diehl, 1994
). All of this acoustic information is typically presented within a few tens of milliseconds, necessitating continuous, rapid analysis not only of individual acoustic events but also of sequential or simultaneous combinations of events. While differing from the Tones condition, the three speech conditions were matched to each other in terms of this acoustic complexity. Thus the data, which demonstrate brain regions that respond more strongly to speech than to tones, but respond identically to word and nonword speech, are explainable on the basis of differences in acoustic content alone. This interpretation is also consistent with the fact that the speech > nonspeech activations were bilateral. If these activation differences are due primarily to differences in auditory sensory processing demands rather than to linguistic processing, they would be expected to occur in auditory cortex bilaterally.
Word versus Nonword Processing
Even if this interpretation of the speech > nonspeech activation is correct, there remains the question of why no differences were observed between the Words, Pseudowords and Reversed conditions in other brain areas. For example, left hemisphere ventral temporal regions, angular gyrus and prefrontal cortex have all been implicated in semantic processing (Petersen et al., 1989
; Démonet et al., 1992
; Kapur et al., 1994
; Demb et al., 1995
; Vandenberghe et al., 1996
; Binder et al., 1997
, 1999
; Price et al., 1997b
). Given the differences in semantic content across conditions outlined in Table 1
, it is surprising that no significant activation differences were observed in these or other regions in the wordnonword contrasts. This lack of difference in brain activation by word and nonword speech has been reported by other investigators as well (Wise et al., 1991
; Hirano et al., 1997
). One possible explanation is that subjects simply did not process the semantic associations of the word stimuli because there was no explicit requirement to do so. Although straight- forward, this account seems inadequate given the extensive body of evidence demonstrating implicit semantic processing when subjects are presented with meaningful linguistic stimuli (Carr et al., 1982
; Marcel, 1983
; Van Orden, 1987
; Macleod, 1991
; Price et al., 1996a
). A familiar example is the Stroop effect, in which semantic processing of printed words (i.e. retrieval of word meaning) occurs even when subjects are instructed to attend to the color of the print and even when this semantic processing interferes with task performance (Glaser and Glaser, 1989
; Macleod, 1991
). Price et al. demonstrated that such implicit processing alters brain activity sufficiently to be detected by functional imaging (Price et al., 1996a
). They showed that visual words and pseudowords produce activation in widespread, left-lateralized cortical areas relative to activation produced by nonsense characters and consonant strings, even when the explicit task (visual feature detection) is identical across all conditions.
Given the possibility of implicit lexicalsemantic retrieval, any explanation for the observed lack of wordnonword differences necessarily becomes more complex. One variable that may be critical is presentation rate. In the present study, words were presented at a relatively rapid rate (1.5 per s) with no inter-stimulus interval. Although this is not fast compared with a typical spoken sentence, the words in this case were all content nouns, which typically occur at much slower rates in natural speech. It may be that activation of lexicalsemantic word associations requires considerably more time than activation of phonological representations (Collins and Loftus, 1975
; Carr et al., 1982
; Van Orden et al., 1988
). Moreover, the presentation of a second word immediately after the first may act as a semantic mask that disrupts retrieval processes related to the first word. In this way, rapid presentation of successive content words might result in relatively little activation of semantic information.
Another possibility is that pro







