The Efficient Coding of Speech: Cross-Linguistic Differences

Ramon Guevara Erra; Judit Gervain

doi:10.1371/journal.pone.0148861

Abstract

Neural coding in the auditory system has been shown to obey the principle of efficient neural coding. The statistical properties of speech appear to be particularly well matched to the auditory neural code. However, only English has so far been analyzed from an efficient coding perspective. It thus remains unknown whether such an approach is able to capture differences between the sound patterns of different languages. Here, we use independent component analysis to derive information theoretically optimal, non-redundant codes (filter populations) for seven typologically distinct languages (Dutch, English, Japanese, Marathi, Polish, Spanish and Turkish) and relate the statistical properties of these filter populations to documented differences in the speech rhythms (Analysis 1) and consonant inventories (Analysis 2) of these languages. We show that consonant class membership plays a particularly important role in shaping the statistical structure of speech in different languages, suggesting that acoustic transience, a property that discriminates consonant classes from one another, is highly relevant for efficient coding.

Citation: Guevara Erra R, Gervain J (2016) The Efficient Coding of Speech: Cross-Linguistic Differences. PLoS ONE 11(2): e0148861. https://doi.org/10.1371/journal.pone.0148861

Editor: Sonja Kotz, Max Planck Institute for Human Cognitive and Brain Sciences, GERMANY

Received: October 27, 2015; Accepted: January 25, 2016; Published: February 22, 2016

Copyright: © 2016 Guevara Erra, Gervain. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Sound files we do not own were provided to us by their original copy right holders, Franck Ramus, Marina Nespor and Jacques Mehler. Some of these files are available at this website: http://www.lscp.net/persons/ramus/resynth/ecoute.htm. Other files are available upon request from Franck Ramus, Marina Nespor and Jacques Mehler (Franck Ramus franck.ramus@ens.fr, Marina Nespor nespor@sissa.it, Jacques Mehler mehler@sissa.it). All other data are contained within the paper.

Funding: This work was supported by (1) City of Paris, Emergence program (http://next.paris.fr/pro/chercheurs/2d3es-appels-a-projets/programme-emergence-s/rub_9587_stand_79582_port_23607) and (2) Human Frontiers Science Program grant nr. RGY-073/2014 (http://www.hfsp.org/funding/research-grants). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Increasing evidence suggests that neural representations in the auditory system follow the principles of efficient neural coding [1], an information theoretical principle known to underlie neural coding in several perceptual systems [2]. The theory of efficient neural coding holds that the sensory systems have evolved to encode environmental signals in an information theoretically optimal way, representing the greatest amount of information at the lowest possible cost [3,4]. To achieve this information theoretical optimum, the sensory systems need to capture the underlying statistical structure of environmental signals [5,6]. Indeed, it has been shown that mathematically derived efficient codes for different natural stimuli closely resemble neural response functions measured in the visual [7,8] and, more recently, in the auditory systems [1,9,10].

A better understanding of the statistical structure of natural stimuli is crucial for the assessment of the efficient coding theory and for a more fine-grained description of neural coding in general. Recent work [1,10,11] suggests that the statistical structure of speech is particularly similar to the auditory neural code. However, only one language, English, has so far been investigated. It thus remains unknown how much variation there is in the statistical structures of different languages. Languages of the world exhibit considerable differences in their sound patterns, e.g. in their speech rhythm, phoneme repertoire, syllable structure etc. To what extent and how these linguistic differences are reflected in the overall statistical structure of a language has not yet been explored. The first objective of the current study is to address this question. Specifically, we will use independent component analysis (ICA) to derive information theoretically optimal, non-redundant codes (filter populations) for seven typologically distinct languages (Dutch, English, Japanese, Marathi, Polish, Spanish and Turkish) and relate the statistical properties of these filter populations to documented differences in the speech rhythms (Analysis 1) and phoneme repertoires (Analysis 2) of these languages.

The efficient coding of natural auditory stimuli

Paralleling previous work in vision research [2,7], an increasing number of studies has recently investigated whether, and if yes, how the statistical structure of natural sound stimuli might be reflected in the auditory neural code [1,10,12] and how they might be related to sound percepts in humans [11,13–15].

Of relevance for the current study is the finding [1,10] that physiologically measured auditory nerve responses in mammals [16–18] appear to best match the statistical properties of two specific sound classes. The first is a mixture of transient environmental sounds (e.g. breaking branches, cracking ice, dripping water etc.) and animal vocalizations. Transient environmental sounds, which are short, non-harmonic, broad-band signals, can be statistically characterized by a population of wavelet-like filters, localized both in frequency and time, whereas mammalian vocalizations, which are long, harmonic, narrow-band sounds, are best captured by Fourier-like filters, localized in frequency, but not in time. The mathematically derived efficient filters for a mixture of these two types of sounds match remarkably well the reverse correlation filters obtained from electrophysiologically measured auditory nerve responses. The second sound class for which this match is particularly strong is speech. Indeed, speech is a mix of harmonic and transient sounds and its theoretically derived efficient filter population is in between the wavelet-like filters for environmental sounds and the Fourier-like filters for animal vocalizations.

One way to characterize and numerically compare the filter populations for different sounds is to calculate the regression between the center frequency and the sharpness (center frequency divided by the bandwidth) of each filter within a population. The slopes of the resulting regression lines can then be compared across different filter populations. Consistently with the observation that the average power spectrum of speech, music and many other sounds is approximately 1/f [19], following a power law distribution, sharpness increased with center frequency for all three sound classes [10]. However, the slopes of the regression lines differed. The filter population for animal vocalizations had the steepest regression slope, as in this Fourier-like filter population the bandwidth of filters was almost constant and didn’t scale very strongly with center frequency. By contrast, filters for environmental sounds had bandwidths that increased with increasing center frequency. The slope of the regression line was therefore less steep for this filter population. The slope for speech fell in between the other two slopes.

These results are consistent with the idea [10] that speech is a mix of transient and harmonic sounds, possibly because speech has evolved to recruit the two already existing neural codes, i.e. wavelet-like filters for environmental sounds and Fourier-like filters for vocalizations. Indeed, speech sounds greatly differ in harmonicity and transience. Stop consonants, for instance, are typically transient and non-harmonic, resembling environmental sounds, whereas vowels are longer and harmonic, like vocalizations. One prediction of this proposal, therefore, is that languages differing in the relative proportion of vowels and consonants should have efficient filter populations with different slopes.

Following up on this hypothesis, the efficient filter populations for different vowel and consonant subclasses have recently been investigated [20]. It has been found that efficient filter populations for different sound classes do indeed exhibit different slopes. Importantly, however, the main difference is found not between vowels and consonants, but between different consonant sub-classes. Specifically, different vowel classes (e.g. back vs. front; high vs. low) all have similar slopes, which closely match the slope of the filter population for speech in general [10]. By contrast, efficient filters for consonant classes show important variations in their slopes, with stop consonants (e.g. /b/, /t/, /p/ etc.) having the lowest slopes, close to that of transient environmental sounds, affricates (e.g. /tʃ/ etc.) having medium slopes, close to that of vowels and of speech in general, and fricatives (e.g. /s/, /f/ etc.) as well as nasals having steep slopes, similar to that of animal vocalizations. These results thus diverge from the original proposal [10], which suggested that harmonicity and transience both play a role in determining the statistical structure of speech sounds, rendering vowels similar to vocalizations, and consonants to environmental sounds. This more recent study [20] suggests instead that since consonants are not harmonic and consonant sub-classes differ greatly in their acoustic transience, it is solely this latter that accounts for differences between sound classes, and harmonicity or bandwidth play a less important role.

The second objective of the current study is, therefore, to explore further which acoustic properties of speech underlie differences in the efficient codes for different languages, if such cross-linguistic differences can indeed be found. The respective roles of transience, harmonicity and bandwidth in determining the statistical structure of different languages and hence their efficient codes can be tested based on their phonological and perceptual correlates. Harmonicity correlates with the vowel/consonant distinction. The relative proportion of vowels and consonants in the speech signal is in turn related to the notion of speech rhythm. In fact, the relative proportion of vowels and consonants and the variability in vocalic and consonantal intervals in the speech signal are well-established operational measures of speech rhythm [21]. Therefore, deriving efficient codes for languages with different speech rhythms allows us to investigate the role of harmonicity. If harmonicity plays a role, as initially suggested [10], then we expect the efficient codes for languages to vary as a function of their speech rhythm, a well-documented factor of cross-linguistic variability.

If, by contrast, only transience is decisive [20], then vowels are less relevant for determining the slope of the sharpness regression line, and differences in the types of consonants found in a language are expected to underlie cross-linguistic differences in derived efficient codes. Languages of the world differ considerably in their consonant inventories. Therefore, by testing whether cross-linguistic differences between the efficient filters of different languages align with differences in speech rhythm, in consonant inventory or in both, we can determine what acoustic cues contribute most to cross-linguistic differences in the statistical structure of speech.

Cross-linguistic variations in speech rhythm and phoneme repertoire

The languages of the world show systematic variation in their speech rhythms and phoneme repertoires. Linguists had traditionally categorized languages into three rhythmic classes: syllable-timed languages, such as Spanish or Italian, stress-timed languages, like English or Dutch, and mora-timed languages, like Japanese and Tamil. This classification was operationalized [21] using three measures for describing and quantifying language rhythm: %V, i.e. the relative length of vocalic space in the speech signal, ΔV, i.e. the variability in the length of vocalic spaces, and ΔC, i.e. the variability in the length of consonant clusters. The authors measured these three properties in eight languages, English, Dutch, Polish, Catalan, Italian, Spanish, French and Japanese, and found that they cluster into three groups similar to the original classification when plotted in the two-dimensional spaces defined by any two of these measures. The measure %V appeared to correlate particularly well with previous classifications. This operational definition, which provides a continuous measure, has the advantage of yielding a quantitative basis for the original classification, yet allowing languages with mixed or ambiguous rhythm to be accommodated.

The seven languages used in the current study have the following rhythmic properties. Dutch, English and Polish are traditionally described as stress-timed languages, and have relatively low %V values [21], Spanish, Turkish and Marathi are syllable-timed languages, with medium to high %V values [21–24], and Japanese is a mora-timed language, with very high %V values [21]. (Note that quantitative definitions of speech rhythm other than %V, ΔV and ΔC have been proposed in the literature [25–30]. Here we will be using the original %V, ΔV and ΔC measures, because the seven languages we analyze all have published %V, ΔV and ΔC values, whereas some lack the other metrics. Furthermore, our analyses are based on the same sound files as the published %V, ΔV and ΔC measures, ensuring full comparability between our study and the previous rhythm measures.)

Speech rhythm is critical in every aspect of speech perception from the earliest prenatal experience through language acquisition to adult language comprehension. Indeed, speech is first experienced in the womb as a low-pass filtered signal, transmitting only global speech prosody and rhythm, but not individual sounds. Accordingly, newborns are able to recognize their native language on the basis of its rhythm [31] and can discriminate languages they have never heard before if those are rhythmically different [32–34]. Adults also use speech perception mechanisms optimized for the rhythmic characteristics of their native language [35,36], and are better able to maintain intelligibility under a wide set of circumstances (speech in noise, accelerated speech etc.) for non-native languages that resemble the rhythm of their native language than for those that are rhythmically different [37].

Importantly for the purposes of the current study, the above definition of rhythm relies on the vowel/consonant distinction, but not on the precise identity of specific vowels and consonants. This predicts that languages are perceived as having different rhythms as long as the relative timing of vocalic and consonantal spaces are preserved, even if the identities of the individual phonemes are suppressed. These predictions have been confirmed [38] by showing that adults were able to discriminate rhythmically different languages, even if the signal was resynthesized replacing all vowels by /a/ and all consonants within a consonantal class by a representative of that class (fricatives by /s/, liquids by /l/, occlusives by /t/, nasals by /n/, and glides by /j/; the “saltanaj” transformation) or all vowels by /a/ and all consonants by /s/ (the “sasasa” transformation). Thus language rhythm is not sensitive to vowel and consonant identity or consonant class. However, language discrimination failed if all segments were replaced by /a/, i.e. if the consonant/vowel distinction was abolished.

Nevertheless, languages also differ in their speech sounds. The sizes of phoneme inventories in languages vary from a dozen sounds to well over a hundred sounds. Consonant inventories also show large variations, from as few as six consonants up to more than a hundred [39]. The size principle argues that languages with smaller consonant inventories mainly have consonants that are phonetically, phonologically or articulatorily simpler, e.g. oral stop consonants, while languages with larger inventories have more complex consonant classes [40]. For instance, Rotokas, a language spoken in Papua New Guinea, only has 6 consonants, /p/, /t/, /k/, /b/, /d/ and /g/, which are exactly the oral stop consonants. Unlike in the case of rhythm, consonant class inventory is preserved under a “saltanaj”-like transformation, but not under a “sasasa”-like transformation. The languages of the current study have the following consonant class inventories [39,40]. English has an average-sized consonant inventory, in which all major consonant classes, i.e. stops, laterals, glides, affricates, fricatives and nasals, are represented. Stop consonants constitute the most frequent class, fricatives are the second, laterals and glides together come next, followed by nasals. Affricates are the least frequent. In a speech corpus of adult-directed conversational English, stops made up about 20% of their corpus, fricatives accounted for approx. 16%, laterals and glides 11%, nasals 10%, and affricates 1% [41]. Dutch and Turkish also has an average-sized consonant inventory, with consonants in all major consonant classes [42,43]. The consonant inventory of Spanish is also average-sized, although somewhat smaller than that of English, Dutch and Turkish, as affricates are lacking and fricatives are fewer in number [44]. Polish has a large phoneme inventory, with dental stops, fricatives, nasals and affricates, which are not present in most of the other languages. The consonant inventory of Marathi is also large, with dental and retroflex stops and nasals not present in the other languages, although it has fewer fricatives and no affricates [45]. Japanese, by contrast, has a relatively small consonant inventory, with no affricates and fewer fricatives than in the other languages.

In sum, the current study seeks to test whether the statistical structure of speech in different languages shows cross-linguistic differences, and if yes, to what phonological and acoustic properties of speech these might be related to. Specifically, we test whether cross-linguistic differences may be related to the different proportions of vocalic and consonantal intervals in the speech signals of different languages, i.e. a correlate of speech rhythm, and/or whether they may be related to differences in phoneme, in particular consonant, repertoires across the world’s languages. In the former case, the relevant physical properties of the acoustic signal are harmonicity, bandwidth and transience; in the latter case, only transience plays a role. Note that the two hypotheses are not mutually exclusive: speech rhythm and phoneme inventories show correlations across the world’s languages [46,47]. It is, therefore, possible that the properties of efficient populations are related to both of these linguistic features. However, our analyses will be able to shed light on their respective contributions to the statistical structure of the speech signal.

Analysis 1

In Analysis 1, we tested whether theoretically derived efficient filters for speech in seven rhythmically different languages correlated with existing acoustic measures of speech rhythm. We used a generalized independent component analysis (ICA) algorithm [10,48] to achieve an information theoretical optimum, i.e. the encoding of the speech signal with a set of independent filters that capture the statistical structure of the input at the lowest cost, i.e. with no redundancy. In this framework, it is assumed that the linear response of the auditory system, , is given by the convolution of a set of filters w_i(t) with the signal x(t) of length N (in matrix form, ). Efficient encoding is achieved if the set of filters is such that the statistical dependence of the responses is minimized. As the input to this analysis (Fig 1), we used speech samples from Dutch, English, Japanese, Marathi, Polish, Spanish and Turkish. These languages were chosen because they are geographically and historically unrelated, represent typologically different phonological, morphological and syntactic structures and, most importantly, belong to different rhythmic classes, as previously described [21–23]. To maximize comparability between earlier prosodic measures and the current analysis, we used the same sound files as the previous prosodic studies [21–23].

Download:

Fig 1. Algorithm.

The ICA algorithm used in the current study.

https://doi.org/10.1371/journal.pone.0148861.g001

Stimuli

The speech samples consisted of sentences recorded by female native speakers in Dutch, English, Japanese, Marathi, Polish, Spanish and Turkish, all obtained from previous studies on the prosodic and rhythmic properties of these languages [Dutch, English, Japanese, Polish [21]; Marathi, Turkish [22]; Spanish [23]]. For each language, the dataset consisted of simple declarative, news-like sentences ranging between 15 and 21 syllables in length, recorded each by four different female native speakers at a sampling rate of 16 kHz. The sentences in the different languages were roughly matched for meaning. The input dataset to the ICA algorithm was then constructed by randomly selecting 10000 different 128-datapoint-long segments [corresponding roughly to 8msec chunks at the current sampling rate) from the speech samples for each language.

As a basis for comparison with rhythmic measures, we used %V (as reported in [21] for English, Japanese, Dutch and Polish, in [22] for Marathi and Turkish and [23,24] for Spanish. We chose to use %V rather than other existing measures [25,26,29], because we had access to the sound files used to obtain the above cited %V measures. Furthermore, measures of %V have been linked to language discrimination and speech perception abilities in newborns, infants and adults [21,24,34,49,50], constituting solid evidence in favor of the psycholinguistic validity of this measure.

Algorithm

According to the efficient coding hypothesis, redundancy between information processing channels is reduced in sensory systems so as to maximize capacity for each channel (in the nervous system, a code that maximize channel capacity is a key design constraint, given the high metabolic cost of spiking neurons). In other words, efficient coding is achieved if processing channels are as independent as possible from each other (so, minimizing redundancy). In this view, the transformation from input to output of a sensory system is such that the output channels are maximally independent. From a signal processing point of view, early stages in auditory processing can be modelled as a transformation from incoming auditory signals (input) to neural activity encoding auditory information (output). The passing from input to output in the auditory system has been successfully modelled as a linear transformation [16,17] effected by appropriate filters acting on auditory signals. According to the efficient coding hypothesis, early stages in the auditory system can be modelled as a linear transformation that minimizes mutual information between output channels. This is of course an approximation valid within a limited dynamical range, where non-linear effects of the early auditory system can be neglected. Since we are interested in the filters that act on the auditory signal, we need to solve an inverse problem (signal deconvolution): what is the linear transformation (whose matrix is formed by those filters) such that when applied to a given set of auditory signals gives statistical independence between the output channels?

This problem is typically solved in signal analysis by applying independent component analysis (ICA), an algorithm for blind signal separation and signal deconvolution. ICA assumes that a mixture of independent sources results in the registered signals. The task of the algorithm is to find the sources and the mixing matrix. More formally, given a linear combination of sources s (x = As) and assuming that the sources are statistically independent, ICA reconstructs the sources as . ICA has been successfully used in the context of the efficient coding hypothesis, for both natural images [51] and auditory signals [10].

We therefore used ICA for blind signal deconvolution of the speech time series, following the specifications of the ICA algorithm in [10]. A sample of 10 000 speech time-series segments per language were introduced as vector columns in x, each vector consisting of 128 samples. The number of samples was chosen to be 128 for three reasons. First, given the sampling rate of the sound files, 128 samples roughly correspond to time series of about 8msec in duration. This window size is appropriate to capture (sub)phonemic information, i.e. the type of information relevant for language rhythm and the discrimination of consonants, vowels and their different sub-classes. Second, this window size was used in previous work [10,20], allowing for easier comparison across studies. Third, this window size also ensured easy and rapid computation in our ICA algorithm. The filters were the rows of the unmixing matrix W. The specific ICA algorithm used was RUNICA, which uses the logistic infomax ICA algorithm [51] with the natural gradient feature [52].

Results

The ICA algorithm generated a set of filters (unmixing matrix), each defined by 128, independently varying points with no a priori constraints on filter shape. Examples of the filters obtained and their spectra are shown in Fig 2. The filters obtained have a gammatone shape, with an amplitude envelope relatively well localized in time. These filter shapes are similar to those obtained in other studies [10,48].

Download:

Fig 2. Filter populations.

Representative examples of filters (insets) and their spectra obtained for English.

https://doi.org/10.1371/journal.pone.0148861.g002

To compare the filter populations obtained for the different languages to one another and to the existing rhythmic measures, we quantified the time-frequency properties of the filter populations, as in [10,20]. For each filter in a population, we obtained its spectrum using a fast Fourier transform (Fig 2). We then calculated its center frequency, bandwidth and sharpness (Q₁₀, center frequency f_c divided by the bandwidth Δf) at a drop of 10dB on either side of the spectral peak. (Q₁₀ cannot be defined if a filter lacks enough depth in its central peak. In our dataset, Q₁₀ could be calculated for most filters in all languages. To compensate for lack of depth, alternatively, we could have calculated sharpness at drops smaller that 10 dB, but in pilot calculations, this resulted in bandwidth values too small for proper spectral analysis.) Sharpness was then plotted against center frequency for all filters in the population for a given language (Fig 3A), and a linear regression fit was obtained. As the slopes (Fig 3A) indicate, filter bandwidth and sharpness increase with center frequency in all languages (for filter bandwidth we obtained , with A a constant and k the slope of the regression lines in Fig 3A). In a linear regression analysis, center frequency f_c significantly predicted sharpness Q₁₀ for all languages [Dutch: β = 0.889, t(122) = 21.465, p < 0.001, R² = 0.789, English: β = 0.875, t(118) = 19.588, p < 0.001, R² = 0.763, Japanese: β = 0.790, t(112) = 13.656, p < 0.001, R² = 0.621, Marathi: β = 0.563, t(104) = 6.951, p < 0.001, R² = 0.311, Polish: β = 0.813, t(119) = 15.235, p < 0.001, R² = 0.658, Spanish: β = 0.892, t(79) = 17.501, p < 0.001, R² = 0.792, Turkish: β = 0.815, t(109) = 14.659, p < 0.001, R² = 0.660 –degrees of freedom vary across languages, as there were different numbers of filters in the different languages whose spectra were not deep enough to have a well-defined Q₁₀]. This linear relationship between f_c and Q₁₀ implies that, as expected, all languages show a 1/f average power spectrum. Importantly, however, the slopes of the linear regression were different across languages. To test for this difference, we have run an analysis of covariance on Q₁₀ as the dependent variable with Language (Dutch / English / Japanese / Polish / Marathi / Turkish / Spanish) as the independent variable and f_c as a covariate. As the previous correlations suggest, f_c had an effect on Q₁₀ (F(1,769) = 837.452, p < 0.001, η_p² = 0.649). Even more importantly, there was a significant effect of Language on Q₁₀ even after controlling for the effect of f_c (F(6,769) = 29.470, p < 0.001, η_p² = 0.213). Post hoc pairwise comparisons revealed that Spanish with the steepest slope and Marathi with the lowest slope were different from all other languages (p = 0.001 for all pairwise post hocs for both languages). In addition, English differed from Dutch (p = 0.004), Japanese (p = 0.029) and Turkish (p = 0.004). The other pairwise comparisons were not significant.

Download:

Fig 3. Results of Analysis 1.

A. The sharpness of the derived filter populations as a function of center frequency for the seven languages. B. Comparison with %V.

https://doi.org/10.1371/journal.pone.0148861.g003

To test whether these differences were related to rhythmic differences between the languages, the slopes were compared to previous rhythm analyses [21,27,28]. Interestingly, the slopes show a very high negative correlation (r = -0.84, p = 0.011) with %V (Fig 3B).

Discussion

We have derived efficient filter populations for seven different languages. We have obtained filters that are localized in time and in frequency for all of the languages, falling in between wavelet-like filters and Fourier-transforms. Furthermore, all filter populations show a scaling relationship between the center frequencies of filters and their bandwidth and sharpness, related to the fact that the average power spectrum for each language is approximately 1/f [19]. Importantly, however, we have found that the regression line between center frequency and sharpness has different slopes for different languages, indicating that efficient filters are able to capture cross-linguistic variation in sound patterns.

Furthermore, we have found that this cross-linguistic difference is negatively correlated to a measure of linguistic rhythm, %V, i.e. the proportion of vocalic intervals in the speech signal. The shorter the vocalic intervals in a language, the steeper the slope of the regression line. Since %V and %C add up to 100% [21], the slope of the regression line positively correlates with %C, i.e. the longer the consonantal intervals in a language, the steeper the regression line. In [10], steeper regression lines were associated with animal vocalizations. The negative, rather than positive correlation with %V is therefore unexpected, if vowels resemble animal vocalizations, i.e. if harmonicity plays an important role. Rather, this negative correlation with %V and the concomitant positive correlation with %C imply that acoustic properties of consonants are more relevant, along the lines of [20]. In other words, the properties of the efficient filter populations correlate with speech rhythm, but this correlation may be mediated by acoustic properties other than the simple relative proportion of vowels and consonants, as could have been initially expected on the basis of previous work [10]. Rather, the efficient filter code properties and language rhythm both seem to be correlated with a third factor related to the acoustic properties of consonants.

What are these acoustic properties? If acoustic transience is at play, as suggested by [20], then efficient codes should vary as a function of the types of consonant classes a language has and the relative proportion of these different classes in the speech signal. Work in language typology [39,40,46] shows that these properties are related to the complexity of the different syllable structures a language allows, which in turn is related to the %V and %C values for a given language. Japanese, for instance, has almost exclusively CV syllables (e.g. Kurisumasu, the Japanese adaptation of the English word Christmas), thus the relative proportion of vowels and consonants is balanced, resulting in relatively high values of %V (above 50%). By contrast, Dutch and English allow complex consonant clusters in syllable onsets and codas, i.e. CCCVCCC (e.g. springs).

The current results thus imply that transience might underlie the observed correlation with speech rhythm and might play an important role in determining the statistical structure and hence the properties of the efficient filters of different language. To test this hypothesis, we need to test whether the properties of the efficient codes depend on the consonant classes found in the languages tested. The “sasasa” and “saltanaj” transformations [38] provide an ideal testing ground, as the “saltanaj” transformation preserves consonant class identity, i.e. the degree of transience of a consonant, while the “sasasa” transformation suppresses it. Furthermore, these transformations have been applied to the same Dutch, Japanese, English and Polish sound files that were used in analyses of speech rhythm in previous studies [34,38] and of efficient coding above, in Analysis 1. Therefore, they constitute an optimal ground for comparison with the previous results.

Analysis 2

In the current analysis, we derived efficient population filters for “sasasa” and “saltanaj” versions of the Dutch, Japanese, English and Polish material from Analysis 1. Since the “saltanaj” resynthesis preserves consonant class identity, and thus transience, we expected this transformation to provide efficient filters with sharpness regression slopes similar to those of the original languages, whereas we predicted the “sasasa” transformation, suppressing the differences between consonant classes, to yield efficient filters with different slopes. Even more specifically, if the claim [20] about the importance of transience alone is correct, the slopes for the “sasasa” versions are expected to be steeper than those of the original or “saltanaj” versions, as “s” is a fricative consonant, and in [20], fricatives have been found to have efficient codes with steep regression slopes. Depending on the nature and frequency of consonants from different consonant classes in a language, the transformations might impact languages to different extents, but the direction of the change (a steeper slope than in the original) should be the same across languages.