Skip to main content
Advertisement
  • Loading metrics

To integrate or not to integrate: Temporal dynamics of hierarchical Bayesian causal inference

  • Máté Aller ,

    Roles Data curation, Formal analysis, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

    M.Aller@bham.ac.uk

    Affiliation Computational Neuroscience and Cognitive Robotics Centre, University of Birmingham, Birmingham, United Kingdom

  • Uta Noppeney

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Computational Neuroscience and Cognitive Robotics Centre, University of Birmingham, Birmingham, United Kingdom

Update

18 Nov 2021: Ferrari A, Noppeney U (2021) Attention controls multisensory perception via two distinct mechanisms at different levels of the cortical hierarchy. PLOS Biology 19(11): e3001465. https://doi.org/10.1371/journal.pbio.3001465 View update

Abstract

To form a percept of the environment, the brain needs to solve the binding problem—inferring whether signals come from a common cause and are integrated or come from independent causes and are segregated. Behaviourally, humans solve this problem near-optimally as predicted by Bayesian causal inference; but the neural mechanisms remain unclear. Combining Bayesian modelling, electroencephalography (EEG), and multivariate decoding in an audiovisual spatial localisation task, we show that the brain accomplishes Bayesian causal inference by dynamically encoding multiple spatial estimates. Initially, auditory and visual signal locations are estimated independently; next, an estimate is formed that combines information from vision and audition. Yet, it is only from 200 ms onwards that the brain integrates audiovisual signals weighted by their bottom-up sensory reliabilities and top-down task relevance into spatial priority maps that guide behavioural responses. As predicted by Bayesian causal inference, these spatial priority maps take into account the brain’s uncertainty about the world’s causal structure and flexibly arbitrate between sensory integration and segregation. The dynamic evolution of perceptual estimates thus reflects the hierarchical nature of Bayesian causal inference, a statistical computation, which is crucial for effective interactions with the environment.

Author summary

The ability to tell whether various sensory signals come from the same or different sources is essential for forming a coherent percept of the environment. For example, when crossing a busy road at dusk, seeing and hearing an approaching car helps us estimate its location better, but only if its visual image is associated—correctly—with its sound and not with the sound of a different car far away. This is the so-called binding problem, and numerous studies have demonstrated that humans solve this near-optimally as predicted by Bayesian causal inference; however, the underlying neural mechanisms remain unclear. We combined Bayesian modelling, electroencephalography (EEG), and multivariate decoding in an audiovisual spatial localisation task to show that the brain dynamically encodes multiple spatial estimates while accomplishing Bayesian causal inference. First, auditory and visual signal locations are estimated independently; next, information from vision and audition is combined. Finally, from 200 ms onwards, the brain weights audiovisual signals by their sensory reliabilities and task relevance to guide behavioural responses as predicted by Bayesian causal inference.

Introduction

In our natural environment, our senses are exposed to a barrage of sensory signals: the sight of a rapidly approaching truck, its looming motor noise, the smell of traffic fumes. How the brain effortlessly merges these signals into a seamless percept of the environment remains unclear. The brain faces two fundamental computational challenges: First, we need to solve the ‘binding’ or ‘causal inference’ problem—deciding whether signals come from a common cause and thus should be integrated or instead be treated independently [1,2]. Second, when there is a common cause, the brain should integrate signals taking into account their uncertainties [3,4].

Hierarchical Bayesian causal inference provides a rational strategy to arbitrate between sensory integration and segregation in perception [2]. Bayesian causal inference explicitly models the potential causal structures that could have generated the sensory signals—i.e., whether signals come from common or independent sources. In line with Helmholtz’s notion of ‘unconscious inference’, the brain is then thought to invert this generative model during perception [5]. In case of a common signal source, signals are integrated weighted in proportion to their relative sensory reliabilities (i.e., forced fusion [3,4,610]). In case of independent sources, they are processed independently (i.e., full segregation [11,12]). Iin a particular instance, the brain does not know the world’s causal structure that gave rise to the sensory signals. To account for this causal uncertainty, a final estimate (e.g., object’s location) is obtained by averaging the estimates under the two causal structures (i.e., common versus independent source models) weighted by each causal structure’s posterior probability—a strategy referred to as model averaging (for other decisional strategies, see [13]).

A large body of psychophysics research has demonstrated that human observers combine sensory signals near-optimally as predicted by Bayesian causal inference [2,1316]. Most prominently, when locating events in the environment, observers gracefully transition between sensory integration and segregation as a function of audiovisual spatial disparity [12]. For small spatial disparities, they integrate signals weighted by their reliabilities, leading to cross-modal spatial biases [17]; for larger spatial disparities, audiovisual interactions are attenuated. A recent functional MRI (fMRI) study showed how Bayesian causal inference is accomplished within the cortical hierarchy [14,16]: While early auditory and visual areas represented the signals on the basis that they were generated by independent sources (i.e., full segregation), the posterior parietal cortex integrated sensory signals into one unified percept (i.e., forced fusion). Only at the top of the cortical hierarchy, in anterior parietal cortex, the uncertainty about the world’s causal structure was taken into account and signals were integrated into a spatial estimate consistent with Bayesian causal inference.

The organisation of Bayesian causal inference across the cortical hierarchy raises the critical question of how these neural computations unfold dynamically over time within a trial. How does the brain merge spatial information that is initially coded in different reference frames and representational formats? Whereas the brain is likely to recurrently update all spatial estimates by passing messages forwards and backwards across the cortical hierarchy [1820], the unisensory estimates may to some extent precede the computation of the Bayesian causal inference estimate.

To characterise the neural dynamics of Bayesian causal inference, we presented human observers with auditory, visual, and audiovisual signals that varied in their spatial disparity in an auditory and visual spatial localisation task while recording their neural activity with electroencephalography (EEG). First, we employed cross-sensory decoding and temporal generalisation matrices [21] of the unisensory auditory and visual signal trials to characterise the emergence and the temporal stability of spatial representations across the senses. Second, combining psychophysics, EEG, and Bayesian modelling, we temporally resolved the evolution of unisensory segregation, forced fusion, and Bayesian causal inference in multisensory perception.

Results

To determine the computational principles that govern multisensory perception we presented 13 participants with synchronous audiovisual spatial signals (i.e., white noise burst and Gaussian cloud of dots) that varied in their audiovisual spatial disparity and visual reliability (Fig 1A and 1B). On each trial, participants reported their perceived location of either the auditory or the visual signal. In addition, we included unisensory auditory and visual signal trials under auditory or visual report, respectively.

thumbnail
Fig 1. Experimental design, example trial, and behavioural and predicted AV weights (wAV).

(A) Experimental design. In a 4 × 4 × 2 × 2 factorial design, the experiment manipulated (1) the location of the visual (‘V’) signal (−10°, −3.3°, 3.3°, and 10°), (2) the location of the auditory (‘A’) signal (−10°, −3.3°, 3.3°, and 10°), (3) the reliability of the visual signal (VR+ versus low VR−, as defined by the spread of the visual cloud), and (4) task relevance (auditory versus visual report). In addition, we included unisensory auditory and visual VR+ and VR− trials. The greyscale codes the spatial disparity between the auditory and visual locations for each AV condition (i.e., darker greyscale = larger spatial disparity). (B) Time course of an example trial. (C) Behavioural AV weight index wAV computed from behavioural responses (left) and from the predictions of the Bayesian causal inference model (right; across-participants circular mean ± 68% CI and individual wAV represented by filled/empty circles, n = 13). The AV weight index wAV is shown as a function of (1) visual reliability: high [VR+] versus low [VR−]; (2) task relevance: auditory versus visual report; and (3) AV spatial disparity: small (≦6.6; D−) versus large (>6.6; D+). The data used to make this figure are available in file S1 Data. AV, audiovisual; D+, high disparity; D−, low disparity; VR+, high visual reliability; VR−, low visual reliability.

https://doi.org/10.1371/journal.pbio.3000210.g001

Combining psychophysics, EEG, and computational modelling, we addressed two questions: First, we investigated when and how human observers form spatial representations from unisensory visual or auditory inputs, which generalise across the two sensory modalities. Second, we studied the computational principles and neural dynamics that mediate the integration of audiovisual signals into spatial representations that take into account the observer’s uncertainty about the world’s causal structure consistent with Bayesian causal inference.

Shared and distinct neural representations of space across vision and audition—Unisensory auditory and visual conditions

Behavioural results.

Participants were able to locate unisensory auditory and visual signals reliably as indicated by a significant Pearson correlation between participants’ location responses and the true signal source location for both unisensory auditory (across subjects mean ± SEM: 0.88 ± 0.05), visual high reliability (VR+; across subjects mean ± SEM: 0.998 ± 0.19), and visual low reliability (VR−; across subjects mean ± SEM: 0.91 ± 0.05) conditions. As expected, observers were significantly less accurate when locating the sound than when locating the visual stimuli for both levels of visual reliability (VR+ versus A: t[12] 8.83, p < 0.0001; VR− versus A: t[12] = 1.47, p = 0.005; see S1 Fig for response distributions across all conditions).

EEG results.

Multivariate decoding of EEG activity patterns revealed how the brain dynamically encodes the location of unisensory auditory or visual signals. The decoding accuracy was expressed as the Pearson correlation coefficient between the true and the decoded stimulus locations and entered into so-called temporal generalisation matrices that illustrate the stability of EEG activity patterns encoding spatial location across time [21]. If a support vector regression (SVR) model trained on EEG activity patterns at time t can correctly decode the stimulus location not only at time t but also at other time points, then the stimulus location is encoded in EEG activity patterns that are relatively stable across time (for further details about this temporal generalisation approach, see [21]). If an SVR model cannot successfully generalise to EEG activity patterns at other time points, spatial locations are encoded in transient EEG activity patterns that differ across time.

For visual stimuli, spatial locations were successfully (i.e., significantly better than chance) decoded from EEG activity patterns from 60 ms onwards for visual stimuli (Fig 2, upper right quadrant and S2A Fig). Moreover, the temporal generalisation matrices suggest that the visual spatial representations were initially transient (i.e., 60–150 ms, significant decoding accuracy only near the diagonal), reflecting the early visual evoked EEG responses (e.g., N1, see S2 Fig). Later (i.e., from about 250 ms), the location of the visual stimulus was encoded in a more sustained activity pattern (see S2 Fig), leading to successful cross-temporal generalisation from 300 ms to 700 ms post stimulus (i.e., significantly better than chance decoding accuracy is present far off the diagonal).

thumbnail
Fig 2. Temporal generalisation matrices within and across auditory and visual senses.

Each temporal generalisation matrix shows the decoding accuracy for each training (y-axis) and testing (x-axis) time point. We factorially manipulated the training data (auditory versus visual stimulation) and testing data (auditory versus visual stimulation). Decoding accuracy is quantified by the Pearson correlation between the true and the decoded locations of the auditory (or visual) stimulus. The grey line along the diagonal indicates where the training time is equal to the testing time (i.e., the time-resolved decoding accuracies). Horizontal and vertical grey lines indicate the stimulus onset. The thin grey lines encircle clusters with decoding accuracies that were significantly better than chance at p < 0.05 corrected for multiple comparisons. The thick grey lines encircle the clusters with decoding accuracies that were significantly better than chance jointly for both (1) auditory-to-visual and (2) visual-to-auditory cross-temporal generalisation at p < 0.05 corrected for multiple comparisons. The data used to make this figure are available in file S1 Data.

https://doi.org/10.1371/journal.pbio.3000210.g002

By contrast, auditory spatial representations could be decoded significantly better than chance from about 95 ms onwards (see Fig 2, lower left quadrant), which corresponds to the auditory N1 component (S3 Fig). Particularly from 200 ms onwards, the SVR decoder trained on EEG activity patterns can decode auditory spatial location significantly better than chance also from EEG activity patterns across other time points, even as late as 700 ms post stimulus (significant cluster encircled by thin grey lines in Fig 2). This temporal generalisation profile indicates that auditory spatial locations were encoded in EEG activity patterns that were relatively stable across time later from 200 ms onwards. Visual inspection of the EEG topographies shows that auditory spatial location is encoded at these later processing stages in sustained activity patterns that correspond to the long latency auditory P2 component (see S3 Fig) [2224].

In addition to temporal generalisation within each sensory modality, we also investigated the extent to which the SVR decoding model generalised across sensory modalities throughout poststimulus time. Whereas earlier neural representations were more specific to each particular sensory modality, the SVR model was able to generalise significantly better than chance from audition to vision and vice versa from 160 to 360 ms (Fig 2, upper left and lower right quadrant, areas encircled by thick grey line indicate significant generalisation across sensory modalities). This cross-sensory generalisation across visual- and auditory-evoked EEG activity patterns suggests that at those stages (i.e., 160 ms to 360 ms), the brain forms spatial representations that are relatively stable and rely on neural generators that may be partly shared across sensory modalities. By contrast, the spatial representations encoded in very early (<160 ms) EEG activity patterns did not enable successful cross-sensory generalisation, suggesting that they are modality-specific. These statistically significant cross-sensory generalisation results are also illustrated by the EEG topographies evoked by unisensory auditory and visual signals (see S2B and S3B Figs). From 200 ms to 400 ms, poststimulus auditory and visual stimuli elicit centro-posterior dominant topographies that depend on the stimulus location to some extent similarly in vision and audition. Although these results may point towards partly overlapping neural generators and representations potentially in parietal cortices that encode location both in audition and vision, it is important to emphasise that different configurations of neural generators can in principle elicit similar EEG scalp topographies.

Computational principles of audiovisual integration: GLM-based wAV and Bayesian modelling analysis—Audiovisual conditions

Combining psychophysics, multivariate EEG pattern decoding, and computational modelling, we next investigated the computational principles and neural dynamics underlying audiovisual integration of spatial representations using a general linear model (GLM)-based wAV and a Bayesian modelling analysis. As shown in Fig 3, both analyses were applied to the spatial estimates that were either reported by participants (i.e., behaviour, Fig 3B left) or decoded from EEG activity patterns independently for each poststimulus time point (i.e., neural, Fig 3B right, for further details, see the Methods section and the Fig 3 legend).

thumbnail
Fig 3. GLM-based wAV and Bayesian modelling analysis overview.

(A) The GLM-based wAV and Bayesian modelling analysis were performed on auditory (‘A’) and visual (‘V’) spatial estimates that were indicated by participants as behavioural localisation responses (left, ‘Behaviour’) or decoded from participants’ EEG activity patterns (right, ‘Neural’). The neural spatial estimates were obtained by training an SVR model on ERP activity patterns at each time point of the AV congruent trials to learn the mapping from EEG pattern to external spatial locations (black diagonal line). This learnt mapping was then used to decode the spatial location from the ERP activity patterns of the spatially congruent and incongruent AV conditions (coloured arrows). (B) Distributions of spatial localisation responses (left, Behaviour: SResp) and decoded spatial estimates (right, Neural: SDec) were computed for each of the 64 conditions of the 4 (visual stimulus location) × 4 (auditory stimulus location) × 2 (visual reliability) × 2 (task relevance) factorial design. (C) Left: In the GLM-based wAV analysis, the perceived (or decoded at each time point) spatial estimates were predicted by the true visual and auditory spatial locations (SV1..8, SA1..8) for each of the eight conditions in the 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: ≤6.6° versus >6.6°) factorial design. As a summary index, we defined the relative audiovisual weight (wAV) as the four-quadrant inverse tangent of the visual (ßV1..8) and auditory (ßA1..8) parameter estimates for each of the eight conditions in each regression model. Right: In the Bayesian modelling analysis, we fitted the following models to observers’ behavioural and neural spatial estimates: SegA (green, for EEG only), SegV (red, for EEG only), SegV,A (light blue), ‘forced fusion’ (‘Fusion’, yellow), and BCI model (with model averaging, dark blue). We performed Bayesian model selection at the group level and computed the protected exceedance probability that one model is better than any of the other candidate models above and beyond chance [25]. (D) Left: Based on previous studies [14,16], we hypothesised that the wAV profile with an interaction between task relevance (i.e., visual versus auditory report) and spatial disparity that is characteristic for BCI would emerge relatively late. Right: Likewise, we expected the different models to dominate the EEG activity patterns to some extent sequentially: first the unisensory segregation model (SegV, SegA), followed by the forced-fusion model (‘Fusion’), and finally the BCI estimate. The fading of colours indicates that we did not have specific hypotheses for those times. AV, audiovisual; BCI, Bayesian causal inference; D+, high disparity; D−, low disparity; EEG, electroencephalography; ERP, event-related potential; GLM, general linear model; SDec, Spatial estimate decoded; SegA, unisensory auditory segregation; SegV, unisensory visual segregation; SegV,A, audiovisual full-segregation; SResp, spatial estimate responded; stim, stimulus; SVR, support vector regression; VR+, high visual reliability; VR−, low visual reliability.

https://doi.org/10.1371/journal.pbio.3000210.g003

The GLM-based wAV analysis quantifies the influence of the true auditory and true visual location on (1) the reported or (2) EEG decoded auditory and visual spatial estimates in terms of an audiovisual weight index wAV.

The Bayesian modelling analysis formally assessed the extent to which (2) the full-segregation model(s) (Fig 3C, encircled in light blue, red or green), (2) the forced-fusion model (Fig 3C, yellow), and (3) the Bayesian causal inference model (i.e., using model averaging as decision function, encircled in dark blue; see supporting material S1 Table for other decision functions) can account for the spatial estimates reported by observers (i.e., behaviour) or decoded from EEG activity pattern (i.e., neural).

Behavioural results.

In a GLM-based wAV analysis, the behavioural audiovisual weight index wAV shows that observers integrated audiovisual signals weighted by their sensory reliabilities and task relevance (see Fig 1C and S1 Fig for histograms of reported signal locations across all conditions).

The audiovisual weight index wAV was close to 90° (i.e., pure visual influence) when the visual signal needed to be reported (Fig 1C, dark lines). But it shifted towards 0° when the auditory signal was task relevant (Fig 1C, grey lines). In other words, we observed a significant main effect of task relevance on behavioural wAV (p = 0.0002). Observers flexibly adjusted the weights they assigned to auditory and visual signals in the integration process as a function of task relevance, giving more emphasis to the sensory modality that needed to be reported. The main effect of task relevance on wAV is inconsistent with classical forced-fusion models, in which audiovisual signals are integrated into one single unified percept irrespective of task relevance of the sensory modalities. In other words, even in the case of audiovisual spatial disparity, the observer would perceive the auditory and visual signals at the same location. Instead, it indicates that observers maintain separate auditory and visual spatial estimates for an audiovisual spatially disparate stimulus.

Consistent with Bayesian causal inference, the difference in wAV between auditory and visual report significantly increased for large (>6.6°) relative to small (≤6.6°) spatial disparities (i.e., significant interaction between task relevance and spatial disparity: p = 0.0002). In other words, audiovisual integration and cross-modal spatial biasing broke down when auditory and visual signals were far apart and likely to be caused by independent sources. This attenuation of audiovisual interactions for large relative to small spatial disparities (i.e., interaction between task relevance and disparity) is the characteristic profile of Bayesian causal inference (see model predictions for wAV in Fig 1C right).

Moreover, we observed significant two-way interactions between visual reliability and spatial disparity (p = 0.0014) and between visual reliability and task relevance (p = 0.0002). The effect of high versus low visual reliability was stronger when the two signals were close in space and the auditory (i.e., less reliable) signal needed to be reported. For auditory report conditions, the influence of the visual signal on the audiovisual spatial representation is stronger for high visual reliability and small disparity trials (Fig 1C, difference between dashed and solid grey line for the small spatial disparity condition). Again, this interaction is expected for Bayesian causal inference, because the spatial estimate furnished by the forced-fusion model receives a stronger weight in Bayesian causal inference for low-spatial-disparity trials, when it is likely that the two signals come from a common source.

Consistent with the profile of the audiovisual weight index wAV, formal Bayesian model comparison showed that the Bayesian causal inference model outperformed the full-segregation and forced-fusion models (85.6% ± 0.3% variance explained, protected exceedance probability > 0.99; Table 1). Fig 1C (right) shows the profile of the audiovisual weight index wAV that is predicted by the Bayesian causal inference model fitted to the observer’s behavioural localisation responses. It illustrates that Bayesian causal inference inherently accounts for effects of task relevance (or reported modality) and the interaction between task relevance and spatial disparity by combining the forced-fusion estimate with the task-relevant full-segregation estimate weighted by the posterior probability of common and independent sources. Conversely, the interaction between reliability and spatial disparity arises because the forced-fusion model component, which integrates signals weighted by their reliabilities, is more dominant for small spatial disparities.

thumbnail
Table 1. Model parameters (across-subjects’ mean ± SEM) of the computational models fit to observers’ behavioural localisation reports.

https://doi.org/10.1371/journal.pbio.3000210.t001

In summary, our audiovisual weight index wAV and Bayesian modelling analysis of observers’ perceived/reported locations provided convergent evidence that human observers integrate audiovisual spatial signals weighted by their relative reliabilities at small spatial disparities. Yet, they mostly segregate audiovisual signals at large spatial disparities, when it is unlikely that signals come from a common source.

EEG results—Temporal dynamics of audiovisual integration.

To characterise the neural dynamics underlying integration of audiovisual signals into spatial representations, we applied the GLM-based wAV and the Bayesian modelling analysis to the ‘spatial estimates’ that were decoded from EEG activity patterns at each time point (see Fig 3B right). Because both the GLM-based wAV and the Bayesian modelling analysis require reliable spatial estimates, we report and interpret results limited to the time window from 55 ms to 700 ms post stimulus (Fig 4, S4 Fig), during which the location of congruent audiovisual stimuli could be decoded better than chance from EEG activity patterns (p < 0.001).

thumbnail
Fig 4. EEG results for GLM-based wAV and Bayesian modelling analysis.

The neural audiovisual weight index wAV (across-participants’ circular mean ± 68% CI; n = 13). Neural wAV as a function of time is shown for (A) visual reliability: VR+ versus VR−; (B) task relevance: auditory (‘A’) versus visual (‘V’) report; (C) audiovisual spatial disparity: small (≦6.6; D−) versus large (>6.6; D+); (D) the interaction between task relevance and disparity. Shaded grey areas indicate the time windows during which the main effect of (A) visual reliability, (B) task relevance, (C) audiovisual spatial disparity, or (D) the interaction between task relevance and disparity on wAV was statistically significant at p < 0.05 corrected for multiple comparisons across time. (E) Time course of the circular–circular correlation (across-participants’ mean after Fisher z-transformation ± 68% CI; n = 13) between the neural and the behavioural audiovisual weight index wAV. Shaded grey areas indicate significant correlation at p < 0.05 corrected for multiple comparisons across time. (F) Time course of the protected exceedance probabilities [25] of the five models of the Bayesian modelling analysis: SegA (green), SegV (red), SegV,A (light blue), ‘forced fusion’ (‘Fusion’, yellow), and BCI model (with model averaging, dark blue). The early time window until 55 ms (delimited by black vertical line on all plots) is shaded in white, because the decoding accuracy was not greater than chance for audiovisual congruent trials; hence, the neural weight index wAV and Bayesian model fits are not interpretable in this window. The data used to make this figure are available in file S1 Data. BCI, Bayesian causal inference; D+, high disparity; D−, low disparity; EEG, electroencephalography; GLM, general linear model; SegA, unisensory auditory segregation; SegV, unisensory visual segregation; SegV,A, audiovisual full-segregation; VR+, high visual reliability; VR−, low visual reliability.

https://doi.org/10.1371/journal.pbio.3000210.g004

The GLM-based analysis of audiovisual weight index wAV investigated the effects of visual reliability, task relevance, and spatial disparity on the audiovisual neural weight index wAV that quantifies the influence of auditory and visual signals on the spatial representations decoded from EEG activity patterns. Our results show that sensory reliability significantly influenced the neural wAV from 65 to 510 ms. As expected, the spatial representations were more strongly influenced by the true visual signal location when the visual signal was more reliable (i.e., significant main effect of visual reliability, Fig 4A, Table 2). Moreover, consistent with our behavioural findings, we also observed a significant main effect of task relevance between 190 and 700 ms (Fig 4B, Table 2). As expected, the decoded location was more strongly influenced by the visual signal when the visual modality was task relevant. We also observed a significant interaction between task relevance and spatial disparity from 310 to 440 ms and 510 to 590 ms. As discussed in the context of the behavioural results, this interaction is the profile that is characteristic for Bayesian causal inference: the brain integrates sensory signals at low spatial disparity (i.e., small difference for auditory versus visual report) but computes different spatial estimates for auditory and visual signals at large spatial disparities (see Fig 4D, Table 2).

thumbnail
Table 2. Statistical significance of main, interaction, and simple main effects for the behavioural and neural audiovisual weight indices (wAV) (‘model-free’ approach).

https://doi.org/10.1371/journal.pbio.3000210.t002

In addition to these key findings, we also observed a brief but pronounced significant main effect of spatial disparity on wAV at about 55–130 ms. Whereas a sound attracted the decoded spatial location at small spatial disparity (i.e., wAV is shifted below 90°, Fig 4C solid line), the decoded location is shifted away from the sound location (i.e., a repulsive effect) at large spatial disparity (i.e., wAV values above 90°, Fig 4C, dashed line). Moreover, in this early time window, which coincides with the visual-evoked N100 response, the decoded spatial estimate was overall dominated by the visual stimulus location (i.e, wAV was close to 90° for both small and large disparity). The effect of disparity may indicate that early multisensory processing is already influenced by a spatial window of integration (Fig 4C, Table 2). Auditory stimuli affected the decoded spatial representations mainly when they were close in space with the visual signal. However, because spatial disparity was inherently correlated with the eccentricity of the audiovisual signals by virtue of our factorial and spatially balanced design, these two effects cannot be fully dissociated. While signals were presented parafoveally or peripherally for small-disparity trials, they were presented always in the periphery for large-disparity trials.

For completeness, we also observed a significant interaction between spatial disparity and visual reliability between 55 and 135 ms and between 170 and 235 ms (Table 2). This interaction results from a larger spatial window of integration for stimuli with low versus high visual reliability. In other words, it is easier to determine whether two signals come from different sources when the visual input is reliable, leading to a smaller window of integration.

Finally, we asked whether the neural audiovisual weights were related to the audiovisual weights that observers applied at the behavioural level. Hence, we computed the correlation between the values of the behavioural and neural weight indices wAV separately for each time point. The Fisher z-transformed correlation coefficient fluctuated around chance level until about 100 ms. From 100 ms onwards, it progressively increased over time until it peaked and reached a plateau at about 350 ms (R = 0.72). As expected, this coincides with the time window during which we observed a significant interaction between task relevance and spatial disparity—i.e., the profile characteristic for Bayesian causal inference. After 500 ms, it then slowly decreased towards the end of the trial. Cluster permutation test confirmed that the correlation between neural and behavioural weight indices wAV was significantly better than chance, revealing two significant clusters between 175 and 550 ms (p = 0.0012) and 575 and 665 ms (p = 0.013). These results indicate that the neural representations expressed in EEG activity patterns are critical for guiding observers’ responses.

In the EEG Bayesian modelling analysis, we fitted five models to the spatial estimates decoded from EEG activity patterns separately for each time point: (1) ‘full-segregation audiovisual’, (2) ‘forced-fusion’, (3) the ‘Bayesian causal inference’, (4) the ‘segregation auditory’, and (5) the ‘segregation visual’ models (Fig 3C). The segregation visual and segregation auditory models incorporate the hypothesis that neural generators may represent only the visual (or only the auditory) location irrespective of whether the visual (or auditory) location needs to be reported. In other words, they model a purely unisensory visual (or auditory) source. By contrast, the full-segregation audiovisual model embodies the hypothesis that a neural source represents the task-relevant location—i.e., the auditory location for auditory report and the visual location for visual report.

At the random-effects group level, Bayesian model comparison revealed a sequential pattern of protected exceedance probabilities across time (Fig 4F): initially, the ‘segregation visual’ model dominated until about 100 ms post stimulus. This converges with our wAV analysis showing that spatial representations decoded from early EEG activity patterns are dominated by the location of the visual signal (i.e., wAV is close to 90°). From 100 to about 200 ms, the forced-fusion model outperformed the other models, indicating that spatial estimates are now influenced jointly by the locations of auditory and visual signals irrespective of their spatial disparity or task relevance. Again, this mirrors our wAV results in which we observed a significant effect of reliability on wAV early (i.e., as expected for forced fusion), whereas the effect of task relevance arose later and became prominent from 250 ms onwards.

Hence, both wAV and Bayesian modelling analyses suggest that in this early time window, audiovisual signals are predominantly integrated weighted by their reliability into a unified spatial representation irrespective of task relevance, as predicted by forced-fusion models. From about 200 ms onwards, the protected exceedance probability of the Bayesian causal inference model progressively increased, peaking with an exceedance probability of >0.85 at about 350 ms followed by a plateau until 500 ms. Thus, consistent with the wAV results, audiovisual interactions consistent with Bayesian causal inference emerge relatively late at about 350 ms post stimulus.

Discussion

Integrating information from vision and audition into a coherent representation of the space around us is critical for effective interactions with the environment. This EEG study temporally resolved the neural dynamics that enable the brain to flexibly integrate auditory and visual signals into spatial representations in line with the predictions of Bayesian causal inference.

Auditory and visual senses code spatial location in different reference frames and representational formats [26]. Vision provides spatial information in eye-centred and audition in head-centred reference frames [27,28]. Furthermore, spatial location is directly coded in the retinotopic organisation in primary visual cortex [29], whereas spatial location in audition is computed from sound latency and amplitude differences between the ears, starting in the brainstem [27]. In auditory cortices of primates, spatial location is thought to be represented by neuronal populations with broad tuning functions [30,31]. In order to merge spatial information from vision and audition, the brain thus needs to establish coordinate mappings and/or transform spatial information into partially shared ‘hybrid’ reference frames, as previously suggested by neurophysiological recordings in nonhuman primates [30,32]. In the first step, we therefore investigated the neural dynamics of spatial representations encoded in EEG activity patterns separately for unisensory auditory and visual signals using the method of temporal generalisation matrices [21]. In vision, spatial location was encoded initially at 60 ms in transient neural activity associated with the early P1 and N1 components and then turned into temporally more stable representations from 200 ms and particularly from 350 ms (Fig 2, upper right quadrant, S2 Fig). In audition, spatial location was encoded by relatively stable EEG activity from 95 ms and particularly from 250 ms, which is associated with the auditory long latency P2 component [2224] (S3 Fig).

Activity patterns encoding spatial location generalised not only across time but also across sensory modalities between 160 and 360 ms. As indicated in Fig 2, SVR models trained on visual-evoked responses generalised to auditory-evoked responses and vice versa (upper left and lower right quadrant, significant cross-sensory generalisation encircled by thick grey line). These results suggest that unisensory auditory and visual spatial locations are initially represented by transient and modality-specific activity patterns. Later, at about 200 ms, they are transformed into temporally more stable representations that may rely on neural sources in frontoparietal cortices that are at least to some extent shared between auditory and visual modalities [22,33,34].

Next, we asked when and how the human brain combines spatial information from vision and audition into a coherent representation of space. The brain should integrate sensory signals only when they come from a common event but should segregate signals from independent events [1,2,12]. To investigate how the brain arbitrates between sensory integration and segregation, we presented observers with synchronous audiovisual signals that varied in their spatial disparity across trials. On each trial, observers reported either the auditory or the visual location. Our results show that a concurrent yet spatially disparate visual signal biased observers’ perceived sound location towards the visual location—a phenomenon coined spatial ventriloquist illusion [17,35]. Consistent with reliability-weighted integration, this audiovisual spatial bias was significantly stronger when the visual signal was more reliable (Fig 1C left, grey solid versus dashed lines). Furthermore, observers reported different locations for auditory and visual signals, and this difference was even greater for large- relative to small-spatial-disparity trials. This significant interaction between spatial disparity and task relevance indicates that human observers arbitrate between sensory integration and segregation depending on the probabilities of different causal structures of the world that can be inferred from audiovisual spatial disparity.

Using EEG, we then investigated how the brain forms neural spatial representations dynamically post stimulus. Our analysis of the neural audiovisual weight index wAV shows that the spatial estimates decoded from EEG activity patterns are initially dominated by visual inputs (i.e., wAV close to 90°). This visual dominance is most likely explained by the retinotopic representation of visual space that facilitates EEG decoding of space leading to visual predominance (for further discussion, see the Methods section). From about 65 ms onwards, visual reliability significantly influenced wAV (Fig 4A): as expected, the location of the visual signal exerted a stronger influence on the spatial estimate decoded from EEG activity patterns when the visual signal was more reliable than unreliable. By contrast, the signal’s task relevance influenced the audiovisual weight index only later, from about 190 ms (Fig 4B). Thus, visual reliability as a bottom-up stimulus-bound factor impacted the sensory weighting in audiovisual integration prior to top-down effects of task relevance. We observed a significant interaction between task relevance and spatial disparity as the characteristic profile for Bayesian causal inference from about 310 ms: the difference in wAV between auditory and visual report was significantly greater for large- than for small-disparity trials (Fig 4D, Table 2). Thus, spatial disparity determined the influence of task-irrelevant signals on the spatial representations encoded in EEG activity from about 310 ms onwards. A task-irrelevant signal influenced the spatial representations mainly when auditory and visual signals were close in space and hence likely to come from a common event, but it had minimal influence when they were far apart in space. Collectively, our statistical analysis of the audiovisual weight index revealed a sequential emergence of visual dominance, reliability weighting (from about 100 ms), effects of task relevance (from about 200 ms), and finally the interaction between task relevance and spatial disparity (from about 310 ms, Fig 4A–4D).

This multistage process was also mirrored in the time course of exceedance probabilities furnished by our formal Bayesian model comparison: The unisensory visual segregation (SegV) model was the winning model for the first 100 ms, thereby modelling the early visual dominance. The audiovisual forced-fusion model embodying reliability-weighted integration dominated the time interval of 100–250 ms. Finally, the Bayesian causal inference model that enables the arbitration between sensory integration and segregation depending on spatial disparity outperformed all other models from 350 ms onwards. Hence, both our Bayesian modelling analysis and our wAV analysis showed that the hierarchical structure of Bayesian causal inference is reflected in the neural dynamics of spatial representations decoded from EEG. The Bayesian causal inference model also outperformed the audiovisual full-segregation (SegV,A) model that enables the representation of the location of the task-relevant stimulus unaffected by the location of the task-irrelevant stimulus. Instead, our Bayesian modelling analysis confirmed that from 350 ms onwards, the brain integrates audiovisual signals weighted by their bottom-up reliability and top-down task relevance into spatial priority maps [36,37] that take into account the probabilities of the different causal structures consistent with Bayesian causal inference. The spatial priority maps were behaviourally relevant for guiding spatial orienting and actions, as indicated by the correlation between the neural and behavioural audiovisual weight indices, which progressively increased from 100 ms and culminated at about 300–400 ms. Two recent studies have also demonstrated such a temporal evolution of Bayesian causal inference in an audiovisual temporal numerosity judgement task [38] and an audiovisual rate categorisation task [39].

The timing and the parietal-dominant topographies of the AV potentials (see S2 and S3 Figs) that form the basis for our spatial decoding (and hence for wAV and Bayesian modelling analyses) closely match the P3b component (i.e., a subcomponent of the classical P300). Although it is thought that the P3b relies on neural generators located mainly in parietal cortices [40,41], its specific functional role remains controversial [42]. Given its sensitivity to stimulus probability [4345] and discriminability [46] as well as task context [42,47,48], it was proposed to reflect neural processes involved in transforming sensory evidence into decisions and actions [49]. Most recent research has suggested that the P3b may sustain processes of evidence accumulation [50] that are influenced by observers’ prior [51], incoming evidence (i.e., likelihood [52]), and observers’ belief updating [53]. Likewise, our supplementary time-frequency analyses revealed that alpha/beta power, which has previously been associated with the generation of the P3b component [54], depended on bottom-up visual reliability between 200 and 400 ms and top-down task relevance between 350 and 550 ms post stimulus (see S5 Fig, S2 Table and S1 Text), thereby mimicking the temporal evolution of bottom-up and top-down influences observed in our main wAV and Bayesian modelling analysis.

Yet, our main analysis took a different approach. Rather than focusing on the effects of visual reliability, task relevance/attention, and spatial disparity directly on event-related potentials (ERPs) or time-frequency power, the wAV analysis investigated how these manipulations affect the spatial representations encoded in EEG activity patterns, and the Bayesian modelling analysis accommodated those effects directly in the computations of Bayesian causal inference. Along similar lines, two recent fMRI studies characterised the computations involved in integrating audiovisual spatial inputs across the cortical hierarchy [14,16]: whereas low level auditory and visual areas predominantly encoded the unisensory auditory or visual locations (i.e., full-segregation model) [5564], higher-order visual areas and posterior parietal cortices combined audiovisual signals weighted by their sensory reliabilities (i.e., forced-fusion model) [6568]. Only at the top of the hierarchy, in anterior parietal cortices, did the brain integrate sensory signals consistent with Bayesian causal inference. Thus, the temporal evolution of Bayesian causal inference observed in our current EEG study mirrored its organisation across the cortical hierarchy observed in fMRI.

Fusing the results from EEG and fMRI studies (see caveats in the Methods section) thus suggests that Bayesian causal inference in multisensory perception relies on dynamic encoding of multiple spatial estimates across the cortical hierarchy. During early processing, multisensory perception is dominated by full-segregation models associated with activity in low-level sensory areas. Later audiovisual interactions that are governed by forced-fusion principles rely on posterior parietal areas. Finally, Bayesian causal inference estimates are formed in anterior parietal areas. Yet, although our results suggest that full segregation, forced fusion, and Bayesian causal inference dominate EEG activity patterns at different latencies, they do not imply a strictly feed-forward architecture. Instead, we propose that the brain concurrently accumulates evidence about the different spatial estimates and the underlying causal structure (i.e., common versus independent sources) most likely via multiple feedback loops across the cortical hierarchy [18,19]. Only after 350 ms is a final perceptual estimate formed in anterior parietal cortices that takes into account the uncertainty about the world’s causal structure and combines audiovisual signals into spatial priority maps as predicted by Bayesian causal inference.

Methods

Participants

Sixteen right-handed participants participated in the experiment; three of those participants did not complete the entire experiment: two participants were excluded based on eye tracking results from the first day (the inclusion criterion was less than 10% of trials rejected because of eye blinks or saccades; see the Eye movement recording and analysis section for details), and one participant withdrew from the experiment. The remaining 13 participants (7 females, mean age = 22.1 years; SD = 3.0) completed the 3-day experiment and are thus included in the analysis. All participants had no history of neurological or psychiatric illnesses, had normal or corrected-to-normal vision, and had normal hearing.

Ethics statement

All participants gave informed written consent to participate in the experiment. The study was approved by the research ethics committee of the University of Birmingham (approval number: ERN_11_0470AP4) and was conducted in accordance with the principles outlined in the Declaration of Helsinki.

Stimuli

The visual (‘V’) stimulus was a cloud of 20 white dots (diameter = 0.43° visual angle, stimulus duration: 50 ms) sampled from a bivariate Gaussian distribution with vertical standard deviation of 2° and horizontal standard deviation of 2° or 12° visual angle presented on a dark grey background (67% contrast). Participants were told that the 20 dots were generated by one underlying source in the centre of the cloud. The visual cloud of dots was presented at one of four possible locations along the azimuth (i.e., −10°, −3.3°, 3.3°, or 10°).

The auditory (‘A’) stimulus was a 50-ms-long burst of white noise with a 5-ms on/off ramp. Each auditory stimulus was delivered at a 75-dB sound pressure level through one of four pairs of two vertically aligned loudspeakers placed above and below the monitor at four positions along the azimuth (i.e., −10°, −3.3°, 3.3°, or 10°). The volumes of the 2 × 4 speakers were carefully calibrated across and within each pair to ensure that participants perceived the sounds as emanating from the horizontal midline of the monitor.

Experimental design and procedure

In a spatial ventriloquist paradigm, participants were presented with synchronous, spatially congruent or disparate visual and auditory signals (Fig 1A and 1B). On each trial, visual and auditory locations were independently sampled from four possible locations along the azimuth (i.e., −10°, −3.3°, 3.3°, or 10°), leading to four levels of spatial disparity (i.e., 0°, 6.6°, 13.3°, or 20°; i.e., as indicated by the greyscale in Fig 1A). In addition, we manipulated the reliability of the visual signal by setting the horizontal standard deviation of the Gaussian cloud to a 2° (high reliability) or 14° (low reliability) visual angle. In an intersensory selective-attention paradigm, participants reported either their auditory or visual perceived signal location and ignored signals in the other modality. For the visual modality, they were asked to determine the location of the centre of the visual cloud of dots. Hence, the 4 × 4 × 2 × 2 factorial design manipulated (1) the location of the visual stimulus (−10°, −3.3°, 3.3°, 10°; i.e., the mean of the Gaussian), (2) the location of the auditory stimulus (−10°, −3.3°, 3.3°, 10°), (3) the reliability of the visual signal (2°, 14°; SD of the Gaussian), and (4) task relevance (auditory-/visual-selective report), resulting in 64 conditions (Fig 1A). To characterise the computational principles of multisensory integration, we reorganised these conditions into a 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: ≤6.6° versus >6.6°) factorial design for the statistical analysis of the behavioural and EEG data. In addition, we included 4 (locations: −10°, −3.3°, 3.3°, or 10°) × 2 (visual reliability: high, low) unisensory visual conditions and 4 (locations: −10°, −3.3°, 3.3°, or 10°) unisensory auditory conditions. We did not manipulate auditory reliability, because the reliability of auditory spatial information is anyhow limited. Furthermore, the manipulation of visual reliability is sufficient to determine reliability-weighted integration as a computational principle and arbitrate between the different multisensory integration models (see Bayesian modelling analysis section).

On each trial, synchronous audiovisual, unisensory visual, or unisensory auditory signals were presented for 50 ms, followed by a response cue 1,000 ms after stimulus onset (Fig 1B). The response was cued by a central pure tone (1,000 Hz) and a blue colour change of the fixation cross presented in synchrony for 100 ms. Participants were instructed to withhold their response and avoid blinking until the presentation of the cue. They fixated on a central cross throughout the entire experiment. The next stimulus was presented after a variable response interval of 2.6–3.1 s.

Stimuli and conditions were presented in a pseudo-randomised fashion. The stimulus type (bisensory versus unisensory) and task relevance (auditory versus visual) was held constant within a run of 128 trials. This yielded four run types: audiovisual with auditory report, audiovisual with visual report, auditory with auditory report, and visual with visual report. The task relevance of the sensory modality in a given run was displayed to the participant at the beginning of the run. Furthermore, across runs we counterbalanced the response hand (i.e., left versus right hand) to partly dissociate spatial processing from motor responses. The order of the runs was counterbalanced across participants. All conditions within a run were presented an equal number of times. Each participant completed 60 runs, leading to 7,680 trials in total (3,840 auditory and 3,840 visual localisation tasks—i.e., 96 trials for each of the 76 conditions were included in total; apart from the four unisensory auditory conditions that included 192 trials). The runs were performed across 3 days with 20 runs per day. Each day was started with a brief practice run.

Experimental setup

Stimuli were presented using Psychtoolbox version 3.0.11 [69] (http://psychtoolbox.org/) under MATLAB R2014a (MathWorks) on a desktop PC running Windows 7. Visual stimuli were presented via a gamma-corrected 30” LCD monitor with a resolution of 2,560 × 1,600 pixels at a frame rate of 60 Hz. Auditory stimuli were presented at a sampling rate of 44.1 kHz via eight external speakers (Multimedia) and an ASUS Xonar DSX sound card. Exact audiovisual onset timing was confirmed by recording visual and auditory signals concurrently with a photodiode and a microphone. Participants rested their head on a chin rest at a distance of 475 mm from the monitor and at a height that matched participants’ ears to the horizontal midline of the monitor. Participants responded by pressing one of four response buttons on a USB keypad with their index, middle, ring, and little finger, respectively.

Eye movement recording and analysis

To address potential concerns that results were confounded by eye movements, we recorded participants’ eye movements. Eye recordings were calibrated in the recommended field of view (32° horizontally and 24° vertically) for the EyeLink 1000 Plus system with the desktop mount at a sampling rate of 2,000 Hz. Eye position data were on-line parsed into events (saccade, fixation, eye blink) using the EyeLink 1000 Plus software. The ‘cognitive configuration’ was used for saccade detection (velocity threshold = 30°/sec, acceleration threshold = 8,000°/sec2, motion threshold = 0.15°) with an additional criterion of radial amplitude larger than 1°. Individual trials were rejected if saccades or eye blinks were detected from −100 to 700 ms post stimulus.

Behavioural data analysis

Participants’ stimulus localisation accuracy was assessed as the Pearson correlation between their location responses and the true signal source location separately for unisensory auditory, visual high reliability, and visual low reliability conditions. To confirm whether localisation accuracy in vision exceeded performance in audition in both visual reliabilities, we performed Monte-Carlo permutation tests. Specifically, we entered the subject-specific Fisher z-transformed Pearson correlation differences between vision and audition (i.e., visual–auditory) separately for the two visual reliability levels into a Monte-Carlo permutation test at the group level based on the one-sample t statistic with 5,000 permutations [70].

EEG data acquisition

Continuous EEG signals were recorded from 64 channels using Ag/AgCl active electrodes arranged in a 10–20 layout (ActiCap, Brain Products GmbH, Gilching, Germany) at a sampling rate of 1,000 Hz, referenced at FCz. Channel impedances were kept below 10 kΩ.

EEG preprocessing

Preprocessing was performed with the FieldTrip toolbox [71] (http://www.fieldtriptoolbox.org/). For the decoding analysis, raw data were high-pass filtered at 0.1 Hz, re-referenced to average reference, and low-pass filtered at 120 Hz. Trials were extracted with a 100-ms prestimulus and 700-ms poststimulus period and baseline corrected by subtracting the average value of the interval between −100 and 0 ms from the time course. Trials were then temporally smoothed with a 20-ms moving window and downsampled to 200 Hz (note that a 20-ms moving average is comparable to a finite impulse response [FIR] filter with a cutoff frequency of 50 Hz). Trials containing artefacts were rejected based on visual inspection. Furthermore, trials were rejected if (1) they included eye blinks, (2) they included saccades, (3) the distance between eye fixation and the central fixation cross exceeded 2°, (4) participants responded prior to the response cue, or (5) there was no response. For ERPs (S2 and S3 Figs), the preprocessing was identical to the decoding analysis, except that a 45-Hz low-pass filter was applied without additional temporal smoothing with a temporal moving window. Grand average ERPs were computed by averaging all trials for each condition first within each participant and then across participants.

EEG multivariate pattern analysis

For the multivariate pattern analyses, we computed ERPs by averaging over sets of eight randomly assigned individual trials from the same condition. To characterise the temporal dynamics of the spatial representations, we trained linear SVR models (LIBSVM [72], https://www.csie.ntu.edu.tw/~cjlin/libsvm/) to learn the mapping from ERP activity patterns of the (1) unisensory auditory (for auditory decoding), (2) unisensory visual (for visual decoding), or (3) audiovisual congruent conditions (for audiovisual decoding) to external spatial locations separately for each time point (every 5 ms) over the course of the trial (S2, S3 and S4 Figs). All SVR models were trained and evaluated in a 12-fold-stratified cross-validation (12 ERPs/fold) procedure with default hyperparameters (C = 1, ε = 0.001). The specific training and generalisation procedures were adjusted to the scientific questions (see the Shared and distinct neural representations of space across vision and audition section and the GLM analysis of audiovisual weight index wAV section for details).

Overview of behavioural and EEG analysis

Combining psychophysics, computational modelling, and EEG, we addressed two questions: First, focusing selectively on the unisensory auditory and unisensory visual conditions, we investigated when spatial representations are formed that generalise across auditory and visual modalities. Second, focusing on the audiovisual conditions, we investigated when and how human observers integrate audiovisual signals into spatial representations that take into account the observer’s uncertainty about the world’s causal structure consistent with Bayesian causal inference. In the following sections, we will describe the analysis approaches to address these two questions in turn.

Shared and distinct neural representations of space across vision and audition

First, we investigated how the brain forms spatial representations in either audition or vision using the so-called temporal generalisation method [21]. Here, the SVR model is trained at time point t to learn the mapping from, e.g., unisensory visual (or auditory) ERP pattern to external stimulus location. This learnt mapping is then used to predict spatial locations from unisensory visual (or auditory) ERP activity patterns across all other time points. Training and generalisation were applied separately to unisensory auditory and visual ERPs. To match the number of trials for auditory and visual conditions, we applied this analysis to the visual ERPs pooled over the two levels of visual reliability. The decoding accuracy as quantified by the Pearson correlation coefficient between the true and decoded stimulus locations is entered into a training time × generalisation time matrix. The generalisation ability across time illustrates the similarity of EEG activity patterns relevant for encoding features (i.e., here: spatial location) and has been proposed to assess the stability of neural representations [21]. In other words, if stimulus location is encoded in EEG activity patterns that are stable (or shared) across time, then an SVR model trained at time point t will be able to correctly decode stimulus location from EEG activity patterns at other time points. By contrast, if stimulus location is represented by transient or distinct EEG activity patterns across time, then an SVR model trained at time point t will not be able to decode stimulus location from EEG activity patterns at other time points. Hence, entering Pearson correlation coefficients as a measure for decoding accuracy for all combinations of training and test time into a temporal generalisation matrix has been argued to provide insights into the stability of neural representations whereby the spread of significant decoding accuracy to off-diagonal elements of the matrix indicates temporal generalisability or stability [21].

Second, to examine whether and when neural representations are formed that are shared across vision and audition, we generalised to ERP activity patterns across time not only from the same sensory modality but also from the other sensory modality (i.e., from vision to audition and vice versa). This cross-sensory generalisation reveals neural representations that are shared across sensory modalities.

To assess whether decoding accuracies were better than chance, we entered the subject-specific matrices of the Fisher z-transformed Pearson correlation coefficients into a between-subjects Monte-Carlo permutation test using the one-sample t statistic with 5,000 permutations ([70], as implemented in the FieldTrip toolbox). To correct for multiple comparisons within the two-dimensional (i.e., time × time) data, cluster-level inference was used based on the maximum of the summed t values within each cluster (‘maxsum’) with a cluster-defining threshold of p < 0.05, and a two-tailed p-value was computed.

Computational principles of audiovisual integration: GLM-based analysis of audiovisual weight index wAV and Bayesian modelling analysis

To characterise how human observers integrate auditory and visual signals into spatial representations at the behavioural and neural levels, we developed a GLM-based analysis of an audiovisual weight index wAV and a Bayesian modelling analysis that we applied to both (1) the reported auditory and visual spatial estimates (i.e., participants’ behavioural localisation responses) and (2) the neural spatial estimates decoded from EEG activity pattern evoked by audiovisual stimuli (see Fig 3 and [14,16]).

GLM analysis of audiovisual weight index wAV

SVR to decode spatial estimates from audiovisual EEG activity pattern.

The neural spatial estimates were obtained by training a SVR model on the audiovisual congruent trials to learn the mapping from audiovisual ERP activity pattern at time t to external stimulus locations. This learnt mapping at time t was then used to decode the stimulus location from the ERP activity patterns of the spatially congruent and incongruent audiovisual conditions at time t (see Fig 3A and 3B right). These training and generalisation steps were repeated across all times t to obtain distributions of neural (i.e., decoded) spatial estimates for all 64 conditions for every time point t.

Regression model and computation of behavioural and neural audiovisual weight index wAV.

In the ‘GLM-based’ analysis approach, we quantified the influence of the location of the auditory and visual stimuli on the reported (behavioural) or decoded (neural) spatial estimates using a linear regression model (see Fig 3C left). In this regression model, the reported (or decoded) spatial locations were predicted by the true auditory and visual stimulus locations for each of the eight conditions in the 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: ≤6.6° versus >6.6°) factorial design (Fig 1A).

(1)

Hence, the regression model included 16 regressors in total—i.e., 8 (conditions) × 2 (true auditory or visual spatial locations). We computed one behavioural regression model for participants’ reported locations. Further, we computed 161 neural regression models for the spatial locations decoded from EEG activity pattern across time—i.e., one neural regression model for every 5-ms interval, leading to time courses of auditory (ßA) and visual (ßV) parameter estimates.

In each regression model, the auditory (ßA) and visual (ßV) parameter estimates quantified the influence of auditory and visual stimulus locations on the reported (or decoded) stimulus location for a particular condition. A positive ßV (or ßA) indicates that the true visual (or auditory) location has a positive weight and hence an attractive effect on the reported or decoded location (e.g., it is shifted towards the true visual location; see Fig 3C left for an example). A negative ßV (or ßA) indicates that the true visual (or auditory) location has a negative weight and hence a repulsive effect on the reported or decoded location (e.g., it is shifted away from the true visual location). The auditory and visual parameter estimates need to be interpreted together. To obtain a summary index, we computed the relative audiovisual weight (wAV) as the four-quadrant inverse tangent of the visual (ßV) and auditory (ßA) parameter estimates for each of the eight conditions in each regression model (see Fig 3C left). The angles in radians are then converted to degrees: (2)

The four-quadrant inverse tangent was used to map each combination of (positive or negative) visual (ßV) and auditory (ßA) parameters uniquely to a value in the closed interval [−π, π], which was then transformed into degrees. If the reported/decoded estimate is dominated purely and positively by the visual signal (i.e., ßA = 0, ßV > 0), then wAV is 90°. For pure (and positive) auditory dominance, it is 0° (i.e., ßA > 0, ßV = 0). Furthermore, if the visual signal has an attractive influence (i.e., it attracts the perceived location towards the visual location) but the auditory signal has a repulsive influence (i.e., it shifts the perceived location away from the auditory location) on perceived/decoded location (i.e., ßA < 0, ßV > 0), then wAV is >90° (e.g., Fig 4C, high-disparity condition).

We obtained one wAV for each of the eight conditions at the behavioural level and one wAV for each of the eight conditions and time point (every 5 ms) at the neural level. The neural wAV time courses were temporally smoothed using a 20-ms moving average filter.

Statistical analysis of circular indices wAV for behavioural and neural data.

We performed the statistics on the behavioural and neural audiovisual weight indices using a 2 (auditory versus visual report) × 2 (high versus low visual reliability) × 2 (large versus small spatial disparity) factorial design based on the likelihood ratio statistics for circular measures (LRTS) [73]. Similar to an analysis of variance for linear data, LRTS computes the difference in log-likelihood functions for the full model that allows differences in the mean locations of circular measures between conditions (i.e., main and interaction effects) and the reduced null model that does not model any mean differences between conditions. LRTS were computed separately for the main effects (i.e., reliability, task relevance, spatial disparity) and interactions.

To refrain from making any parametric assumptions, we evaluated the main effects of visual reliability, task relevance, spatial disparity, and their interactions in the factorial design using randomisation tests (5,000 randomisations). To account for the within-subject repeated-measures design at the second random-effects level, randomisations were performed within each participant. For the main effects of visual reliability, task relevance, and spatial disparity, wAV values were permuted within the levels of the nontested factors. For tests of the two-way interactions (e.g., spatial disparity × task relevance), we permuted the simple main effects of the two factors of interest within the levels of the third factor [74]. For tests of the three-way interaction, values were freely permuted across all conditions [75]. These statistical tests were performed once for behavioural wAV and independently for each time point between 55 and 700 ms (i.e., 130 tests) for neural wAV (see below for multiple comparison correction across time points).

To assess the similarity between behavioural and neural audiovisual weight (wAV) indices, we computed the circular correlation coefficient (as implemented in the CircStat toolbox [76]) between the eight behavioural (i.e., constant across time) and eight neural (i.e., variable across time) wAV from our 2 (high versus low visual reliability) × 2 (auditory versus visual report) × 2 (large versus small spatial disparity) factorial design separately for each time point.

Unless otherwise stated, results are reported at p < 0.05. To correct for multiple comparisons across time, cluster-level inference was used based on the maximum of the summed LRTS values within each cluster (‘maxsum’) with an uncorrected cluster-defining threshold of p < 0.05 (as implemented in the FieldTrip toolbox).

For plotting circular means of wAV (Fig 1C for behavioural wAV, Fig 4A–4D for neural wAV), we computed the means’ confidence intervals (as implemented in the CircStat toolbox [76]).

Bayesian modelling analysis

Description of Bayesian models and decision strategies.

Next, we fitted the full-segregation model(s), the forced-fusion model, and the Bayesian causal inference model to the spatial estimates that were reported by observers (i.e., behavioural response distribution, Fig 3B left) or decoded from ERP activity patterns at time t (i.e., neural spatial estimate distribution, Fig 3B right). Using Bayesian model comparison, we then assessed which of these models is the best explanation for the behavioural or neural spatial estimates.

In the following, we will first describe the Bayesian causal inference model from which we will then derive the forced-fusion and full-segregation models as special cases (details can be found in [2,1315]).

Briefly, the generative model of Bayesian causal inference (see Fig 3C right) assumes that common (C = 1) or independent (C = 2) causes are sampled from a binomial distribution defined by the common cause prior Pcommon. For a common source, the ‘true’ location SAV is drawn from the spatial prior distribution N(μAV, σP). For two independent causes, the ‘true’ auditory (SA) and visual (SV) locations are drawn independently from this spatial prior distribution. For the spatial prior distribution, we assumed a central bias (i.e., μ = 0). We introduced sensory noise by drawing xA and xV independently from normal distributions centred on the true auditory (or visual) locations with parameters σA2 (or σV2). Thus, the generative model included the following free parameters: the common source prior pcommon, the spatial prior variance σP2, the auditory variance σA2, and the two visual variances σV2 corresponding to the two visual reliability levels.

The posterior probability of the underlying causal structure can be inferred by combining the common-source prior with the sensory evidence according to Bayes rule: (3)

In the case of a common source (C = 1), the optimal estimate of the audiovisual location is a reliability-weighted average of the auditory and visual percepts and the spatial prior (i.e., referred to as forced-fusion spatial estimate).

(4)

In the case of independent sources (C = 2), the auditory and visual stimulus locations (for the auditory and visual location report, respectively) are estimated independently (i.e., referred to as unisensory auditory or visual segregation estimates): (5)

To provide a final estimate of the auditory and visual locations, the brain can combine the estimates from the two causal structures using various decision functions such as ‘model averaging’, ‘model selection’, and ‘probability matching’ [13].

According to the ‘model averaging’ strategy, the brain combines the integrated forced-fusion spatial estimate with the segregated, task-relevant unisensory (i.e., either auditory or visual) spatial estimates weighted in proportion to the posterior probability of the underlying causal structures.

(6)(7)

According to the ‘model selection’ strategy, the brain reports the spatial estimate selectively from the more likely causal structure (Eq 8 only shown for ): (8)

According to ‘probability matching’, the brain reports the spatial estimate of one causal structure stochastically selected in proportion to the posterior probability of this causal structure (Eq 9 only shown for ): (9)

Thus, Bayesian causal inference formally requires three spatial estimates (), which are combined into a final Bayesian causal inference estimate ( or , depending on which sensory modality is task relevant) according to one of the three decision functions.

In the main paper, we present behavioural results using ‘model averaging’ as the decision function, which was associated with the highest model evidence and exceedance probability at the group level. S1 Table shows the model evidence, exceedance probabilities, and parameters for Bayesian causal inference across the three decision strategies for the behavioural data.

At the behavioural level, we evaluated whether and how participants integrate auditory and visual stimuli by comparing (1) the Bayesian causal inference model (i.e., with model averaging; Table 1), (2) the forced-fusion model that integrates auditory and visual signals in a mandatory fashion (i.e., formally, the BCI model with a fixed pcommon = 1, Fig 3C, encircled in yellow), and (3) the full-segregation model that estimates stimulus location independently for vision and audition (i.e., formally, the BCI model with a fixed pcommon = 0; i.e., Fig 3C, SegV,A encircled in light blue). This SegV,A model assumes that observers report when they are asked to report the auditory location and when they are asked to report the visual location. In short, the SegV,A model reads out the spatial estimate from the task-relevant unisensory segregation model.

At the neural level, we may also conceive a neural source (or brain region) that represents , irrespective of which sensory modality needs to be reported (i.e., Fig 3C, SegV model, encircled in red). For instance, primary visual cortices may be considered predominantly unisensory with selective representations of the visual location even if the observer needs to report the auditory stimulus location. Likewise, we included a model that selectively represents the auditory location (i.e., Fig 3C, unisensory auditory segregation [SegA] model, encircled in green). By contrast, the full-segregation audiovisual model (i.e., SegV,A, encircled in light blue) can be thought of as a neural source (or brain area) that encodes the task-relevant estimate computed in a full-segregation model. It differs from the Bayesian causal inference model by not allowing for any audiovisual interactions or biases irrespective of the probabilities of the world’s causal structure (i.e., operationally manipulated by spatial disparity in the current experiment).

At the behavioural level, the unisensory SegV and SegA models are not useful, because we would expect observers to follow instructions and report their auditory estimate for the auditory report conditions and their visual estimate for the visual report conditions. In other words, it does not seem reasonable to fit the unisensory SegV and SegA models jointly to visual and auditory localisation responses at the behavioural level. By contrast, at the neural level, spatial estimates decoded from EEG activity patterns may potentially reflect neural representations that are formed by ‘predominantly unisensory’ neural generators (e.g., primary visual cortex), particularly in early processing phases. Hence, we estimated and compared three models for the behavioural localisation reports and five models for the spatial estimates decoded from EEG activity patterns.

Model fitting to behavioural and neural spatial estimates and Bayesian model comparison.

We fitted each model individually to participants’ behavioural localisation responses (or spatial estimates decoded from EEG activity pattern at time t) based on the predicted distributions of the spatial estimates (i.e., ; we use as a variable to refer generically to any spatial estimate) for each combination of auditory (SA) and visual (SV) source location. These predicted distributions marginalise over the internal sensory inputs (i.e., xA, xV) that are unknown to the experimenter (see [2] for further explanation). More specifically, we fit (1) the Bayesian causal inference model based on for auditory report conditions and for visual report conditions, (2) the forced-fusion model based on , and (3) the SegV,A model based on for auditory report conditions and for visual report conditions. At the neural level, we also fit the SegV model based on and the SegA model based on to the spatial estimates decoded from EEG activity pattern across both visual and auditory report conditions.

To marginalise over the internal variables xA and xV that are not accessible to the experimenter, the predicted distributions were generated by simulating xA and xV 10,000 times for each of the 64 conditions and inferring the different sorts of spatial estimate from Eq 39. To link any of those to participants’ auditory and visual discrete localisation responses at the behavioural level, we assumed that participants selected the button that is closest to and binned the accordingly into a histogram (with four bins corresponding to the four buttons). Thus, we obtained a histogram of predicted localisation responses for each of those five models separately for each condition and individually for each participant. Based on these histograms, we computed the probability of a participant’s counts of localisation responses using the multinomial distribution (see [2]). This gives the likelihood of the model given participants’ response data. Assuming independence of conditions, we summed the log likelihoods across conditions.

At the neural level, we first binned the spatial estimates decoded from each ERP activity pattern at each time point based on their distance from the four true locations (i.e., −10°, −3.3°, 3.3°, or 10°) into four spatial bins before fitting the models to those discretised spatial estimates.

To obtain maximum-likelihood estimates for the parameters of the models (pcommon, σP, σA, σV1 − σV2 for the two levels of visual reliability; formally, the forced-fusion and segregation models assume pcommon = 1 or = 0, respectively), we used a nonlinear simplex optimisation algorithm as implemented in MATLAB’s fminsearch function (MATLAB R2016a). This optimisation algorithm was initialised with a parameter setting that obtained the highest log likelihood in a prior grid search.

The model fit for behavioural and neural data (i.e., at each time point) was assessed by the coefficient of determination R2 [77], defined as (10) where and l(0) denote the log likelihoods of the fitted and the null model, respectively, and n is the number of data points. For the null model, we assumed that an observer randomly chooses one of the four response options; i.e., we assumed a discrete uniform distribution with a probability of 0.25. As in our case, the Bayesian causal inference model’s responses were discretised to relate them to the four discrete response options, and the coefficient of determination was scaled (i.e., divided) by the maximum coefficient (see [77]) defined as (11)

To identify the optimal model for explaining participants’ data (i.e., localisation responses at the behavioural level or spatial estimates decoded from EEG activity pattern at the neural level), we compared the candidate models using the Bayesian information criterion (BIC) as an approximation to the model evidence [78]. (12) where denotes the likelihood, n the number of data points (i.e., EEG activity patterns summed over conditions at a time point t), and k the number of parameters. The BIC depends on both model complexity and model fit. We performed Bayesian model selection [25] at the group (i.e., random-effects) level as implemented in SPM8 [79] to obtain the protected exceedance probability that one model is better than any of the other candidate models above and beyond chance.

Assumptions and caveats of EEG decoding analyses.

The EEG activity patterns measured across 64 scalp electrodes represent a superposition of activity generated by potentially multiple neural sources located, for instance, in auditory, visual, and higher-order association areas. The extent to which auditory or visual information can be decoded from EEG activity patterns depends therefore inherently on how information is neurally encoded by the ‘neural generators’ in source space and on how these neural activities are expressed and superposed in sensor space (i.e., as measured by scalp electrodes). For example, visual space is retinotopically encoded, whereas auditory space is represented by broadly tuned neuronal populations (i.e., opponent channel coding model [31,80]), rate-based code [30,81], or spike latency and pattern [82,83]. These differences in encoding of auditory and visual space may contribute to the visual bias we observed for the audiovisual weight index wAV in early processing (Fig 4A–4D) and the dominance of the SegV model in the time course of exceedance probabilities (Fig 4F). Furthermore, particularly at later stages, scalp EEG patterns likely rely on superposition of activity of multiple neural generators so that ‘decodability’ will also depend on how source activities combine and project to the scalp (e.g., source orientation etc.). Given the inverse problem involved in inferring sources from EEG topographies, recent studies suggested combining information from fMRI and EEG activity pattern via representational similarity analyses [84,85]. Although we informally also pursue this approach in the Discussion section of the current paper, when merging information from a previous fMRI study that used the same ventriloquist paradigm and analyses with our current EEG results, we recognise the limitations of such an fMRI and EEG fusion approach. For instance, different features encoded in neural activity may be expressed in BOLD-response and EEG scalp topographies [86].

Finally, we trained the SVR model on the audiovisual congruent conditions pooled over task relevance and visual reliability to ensure that the decoder was based on activity patterns generated by sources related to auditory, visual, and audiovisual integration processes and that the effects of task relevance or reliability on the audiovisual weight index wAV cannot be attributed to differences in the decoding model (see [65] for a related discussion).

Supporting information

S1 Data. Zip file containing datasets underlying Figs 1C, 2, and 4.

The data are stored in MATLAB structures.

https://doi.org/10.1371/journal.pbio.3000210.s001

(ZIP)

S1 Text. Supporting information: Time-frequency analysis.

https://doi.org/10.1371/journal.pbio.3000210.s002

(DOCX)

S1 Fig. Distributions of spatial estimates.

The distribution (across participants’ mean) of spatial estimates given by observers’ behavioural localisation responses (solid lines) or predicted by the Bayesian causal inference model fitted to observers’ behavioural responses (dashed lines, for model averaging) are shown across all conditions in our 2 task relevance (auditory: red versus visual: blue) × visual reliability (high: row 1–4 versus low: row 5–8) × 4 auditory location (columns as indicated) × 4 visual location (rows as indicated) design.

https://doi.org/10.1371/journal.pbio.3000210.s003

(TIF)

S2 Fig. Time-resolved decoding of visual location for unisensory visual stimuli.

(A) Time course of decoding accuracy (i.e., Pearson correlation between true and predicted visual stimulus locations pooled over both visual reliabilities, black line) and the EEG evoked potentials (across participants’ mean) for the unisensory visual (high reliability only) signals at −10°, −3.3°, 3.3°, and 10°, averaged over occipital channels. Shaded grey area indicates the time window at which the decoding accuracy is significantly better than chance. EEG signals were averaged across the electrodes shown in the inset. (B) EEG topographies (across participants’ mean) for the unisensory visual signals (high reliability only) at −10°, −3.3°, 3.3°, and 10° shown at the given time points.

https://doi.org/10.1371/journal.pbio.3000210.s004

(TIF)

S3 Fig. Time-resolved decoding of auditory location for unisensory auditory stimuli.

(A) Time course of decoding accuracy (i.e., Pearson correlation between true and predicted stimulus locations, black line) and the EEG evoked potentials (across participants’ mean) for the unisensory auditory signals at −10°, −3.3°, 3.3°, and 10°, averaged over central channels. Shaded grey area indicates decoding accuracy significantly better than chance. EEG signals were averaged across the electrodes shown in the inset. (B) EEG topographies (across participants’ mean) for the unisensory auditory stimuli at −10°, −3.3°, 3.3°, and 10° shown at the given time points.

https://doi.org/10.1371/journal.pbio.3000210.s005

(TIF)

S4 Fig. Temporal generalisation matrix for audiovisual congruent trials.

The temporal generalisation matrix shows the decoding accuracy for audiovisual congruent trials across each combination of training (y-axis) and testing (x-axis) time point. The grey line along the diagonal indicates where the training time is equal to the testing time. Horizontal and vertical grey lines indicate the stimulus onset. The thin black lines encircle the cluster with decoding accuracies that were significantly better than chance at p < 0.05 corrected for multiple comparisons.

https://doi.org/10.1371/journal.pbio.3000210.s006

(TIF)

S5 Fig. Time-frequency results for oscillatory power in the alpha/beta band.

(A) Time courses of total power averaged over the alpha/beta (8–30 Hz) frequency bands (baseline corrected using prestimulus window [−400 ms to −200 ms]) are shown for the main effects of visual reliability (row 1), task relevance (row 2), spatial disparity (row 3), and the visual reliability × task relevance interaction (row 4) at three selected electrodes (i.e., Fz = left; Pz = middle; Oz = right columns). For each effect, we show the power for the difference (or interaction) and the individual conditions coded in different colours as indicated for each row. Grey shaded areas indicate the time windows where at least one electrode was part of the significant cluster after correcting for multiple comparisons across time (i.e., −200 ms to 700 ms), frequency (i.e., 4–30 Hz), and topography. (B) Topographies of the t values averaged across the significant time windows of the corresponding effects. Electrodes marked with black stars were part of the significant cluster (corrected across topography × time × frequency).

https://doi.org/10.1371/journal.pbio.3000210.s007

(TIF)

S1 Table. Model parameters (across-subjects mean ± SEM) and fit indices of the BCI models with different decision functions.

Model averaging (BCIavg), model selection (BCIsel), and probability matching (BCImatch). BCI, Bayesian causal inference; PEP, protected exceedance probability; R2, coefficient of determination; relBICgroup, group-level relative Bayesian information criterion [25].

https://doi.org/10.1371/journal.pbio.3000210.s008

(DOCX)

S2 Table. Time-frequency results.

Significant effects are shown for overall relative to baseline, main effect of VR, and main effect of task relevance (‘Task’), and the interaction between VR and task relevance is shown across rows. Columns of the table indicate the approximate time windows that the significant cluster spanned. All p-values are reported at the cluster level, corrected for multiple comparisons over time × topography × frequency. VR, visual reliability.

https://doi.org/10.1371/journal.pbio.3000210.s009

(DOCX)

Acknowledgments

We thank Ulrik Beierholm and Tim Rohe for kindly providing source code and Ágoston Mihalik and Johanna Zumer for very helpful discussions.

References

  1. 1. Shams L, Beierholm UR. Causal inference in perception. Trends Cogn Sci. 2010;14: 425–432. pmid:20705502
  2. 2. Körding KP, Beierholm U, Ma WJ, Quartz S, Tenenbaum JB, Shams L. Causal inference in multisensory perception. PLoS ONE. 2007;2: e943. pmid:17895984
  3. 3. Alais D, Burr D. Ventriloquist Effect Results from Near-Optimal Bimodal Integration. Curr Biol. 2004;14: 257–262. pmid:14761661
  4. 4. Ernst MO, Banks MS. Humans integrate visual and haptic information in a statistically optimal fashion. Nature. 2002;415: 429–433. pmid:11807554
  5. 5. von Helmholtz H. Handbuch der physiologischen Optik [Internet]. Monatshefte für Mathematik und Physik. Leipzig, Germany: Leopold Voss; 1896. https://doi.org/10.1007/BF01708548
  6. 6. Fetsch CR, Turner AH, DeAngelis GC, Angelaki DE. Dynamic Reweighting of Visual and Vestibular Cues during Self-Motion Perception. J Neurosci. 2009;29: 15601–15612. pmid:20007484
  7. 7. Butler JS, Smith ST, Campos JL, Bülthoff HH. Bayesian integration of visual and vestibular signals for heading. J Vis. 2010;10: 23. pmid:20884518
  8. 8. Van Beers RJ, Wolpert DM, Haggard P. When feeling is more important than seeing in sensorimotor adaptation. Curr Biol. 2002;12: 834–837. pmid:12015120
  9. 9. Rohe T, Noppeney U. Reliability-Weighted Integration of Audiovisual Signals Can Be Modulated by Top-down Attention. eNeuro. 2018;5: 315–317. pmid:29527567
  10. 10. Helbig HB, Ernst MO, Ricciardi E, Pietrini P, Thielscher A, Mayer KM, et al. The neural mechanisms of reliability weighted integration of shape information from vision and touch. Neuroimage. 2012;60: 1063–72. pmid:22001262
  11. 11. Roach NW, Heron J, McGraw P V. Resolving multisensory conflict: A strategy for balancing the costs and benefits of audio-visual integration. Proc R Soc B Biol Sci. 2006;273: 2159–2168. pmid:16901835
  12. 12. Wallace MT, Roberson GE, Hairston WD, Stein BE, Vaughan JW, Schirillo JA. Unifying multisensory signals across time and space. Exp Brain Res. 2004;158: 252–258. pmid:15112119
  13. 13. Wozny DR, Beierholm UR, Shams L. Probability matching as a computational strategy used in perception. PLoS Comput Biol. 2010;6. pmid:20700493
  14. 14. Rohe T, Noppeney U. Cortical Hierarchies Perform Bayesian Causal Inference in Multisensory Perception. PLoS Biol. 2015;13: e1002073. pmid:25710328
  15. 15. Rohe T, Noppeney U. Sensory reliability shapes perceptual inference via two mechanisms. J Vis. 2015;15: 22. pmid:26067540
  16. 16. Rohe T, Noppeney U. Distinct computational principles govern multisensory integration in primary sensory and association cortices. Curr Biol. 2016;26: 509–514. pmid:26853368
  17. 17. Bonath B, Noesselt T, Martinez A, Mishra J, Schwiecker K, Heinze HJ, et al. Neural Basis of the Ventriloquist Illusion. Curr Biol. 2007;17: 1697–1703. pmid:17884498
  18. 18. Friston K. A theory of cortical responses. Philos Trans R Soc B Biol Sci. 2005;360: 815–836. pmid:15937014
  19. 19. Rao RPN, Ballard DH. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat Neurosci. 1999;2: 79–87. pmid:10195184
  20. 20. Talsma D. Predictive coding and multisensory integration: an attentional account of the multisensory mind. Front Integr Neurosci. 2015;09: 19. pmid:25859192
  21. 21. King JR, Dehaene S. Characterizing the dynamics of mental representations: The temporal generalization method. Trends Cogn Sci. 2014;18: 203–210. pmid:24593982
  22. 22. Picton TW. Late auditory evoked potentials: changing the things which are. Human auditory evoked potentials. San Diego: Plural Publishing; 2011. pp. 335–399.
  23. 23. Bishop CW, London S, Miller LM. Neural time course of visually enhanced echo suppression. J Neurophysiol. 2012;108: 1869–83. pmid:22786953
  24. 24. Shrem T, Murray MM, Deouell LY. Auditory-visual integration modulates location-specific repetition suppression of auditory responses. Psychophysiology. 2017;54: 1663–1675. pmid:28752567
  25. 25. Rigoux L, Stephan KE, Friston KJ, Daunizeau J. Bayesian model selection for group studies—Revisited. Neuroimage. 2014;84: 971–985. pmid:24018303
  26. 26. Maier JX, Groh JM. Multisensory guidance of orienting behavior. Hear Res. 2009;258: 106–112. pmid:19520151
  27. 27. Grothe B, Pecka M, McAlpine D. Mechanisms of Sound Localization in Mammals. Physiol Rev. 2010;90: 983–1012. pmid:20664077
  28. 28. Gardner JL, Merriam EP, Movshon JA, Heeger DJ. Maps of Visual Space in Human Occipital Cortex Are Retinotopic, Not Spatiotopic. J Neurosci. 2008;28: 3988–3999. pmid:18400898
  29. 29. Wandell BA, Dumoulin SO, Brewer AA. Visual field maps in human cortex. Neuron. 2007;56(2): 366–383. pmid:17964252
  30. 30. Werner-Reiss U, Groh JM. A Rate Code for Sound Azimuth in Monkey Auditory Cortex: Implications for Human Neuroimaging Studies. J Neurosci. 2008;28: 3747–3758. pmid:18385333
  31. 31. Ortiz-Rios M, Azevedo FAC, Kuśmierek P, Balla DZ, Munk MH, Keliris GA, et al. Widespread and Opponent fMRI Signals Represent Sound Location in Macaque Auditory Cortex. Neuron. 2017;93: 971–983.e4. pmid:28190642
  32. 32. Mullette-Gillman OA, Cohen YE, Groh JM. Eye-Centered, Head-Centered, and Complex Coding of Visual and Auditory Targets in the Intraparietal Sulcus. J Neurophysiol. 2005;94: 2331–2352. pmid:15843485
  33. 33. Godey B, Schwartz D, de Graaf J., Chauvel P, Liégeois-Chauvel C. Neuromagnetic source localization of auditory evoked fields and intracerebral evoked potentials: a comparison of data in the same patients. Clin Neurophysiol. 2001;112: 1850–1859. pmid:11595143
  34. 34. Yvert B, Fischer C, Bertrand O, Pernier J. Localization of human supratemporal auditory areas from intracerebral auditory evoked potentials using distributed source models. Neuroimage. 2005;28: 140–153. pmid:16039144
  35. 35. Bertelson P, Radeau M. Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys. 1981;29: 578–584. pmid:7279586
  36. 36. Bisley JW, Goldberg ME. Attention, Intention, and Priority in the Parietal Lobe. Annu Rev Neurosci. 2010;33: 1–21. pmid:20192813
  37. 37. Gottlieb J, Snyder LH. Spatial and non-spatial functions of the parietal cortex. Curr Opin Neurobiol. 2010;20: 731–40. pmid:21050743
  38. 38. Rohe T, Ehlis A-C, Noppeney U. The neural dynamics of hierarchical Bayesian inference in multisensory perception. Nat Commun. 2019. In Press.
  39. 39. Cao Y, Summerfield C, Park H, Giordano BL, Kayser C. Causal inference in the multisensory brain. bioRxiv 500413 [Preprint]. 2018 [cited 2019 Feb 28]. Available from: https://www.biorxiv.org/content/10.1101/500413v1
  40. 40. Bledowski C. Localizing P300 Generators in Visual Target and Distractor Processing: A Combined Event-Related Potential and Functional Magnetic Resonance Imaging Study. J Neurosci. 2004;24: 9353–9360. pmid:15496671
  41. 41. Linden DEJ. The P300: Where in the Brain Is It Produced and What Does It Tell Us? Neurosci. 2005;11: 563–576. pmid:16282597
  42. 42. Polich J. Updating P300: An integrative theory of P3a and P3b. Clin Neurophysiol. 2007;118: 2128–2148. pmid:17573239
  43. 43. Duncan-Johnson CC, Donchin E. On quantifying surprise: the variation of event-related potentials with subjective probability. Psychophysiology. 1977;14: 456–467. Available from: http://www.ncbi.nlm.nih.gov/pubmed/905483 pmid:905483
  44. 44. Duncan-Johnson CC, Donchin E. The P300 component of the event-related brain potential as an index of information processing. Biol Psychol. 1982;14: 1–52. Available from: http://www.ncbi.nlm.nih.gov/pubmed/6809064 pmid:6809064
  45. 45. Tueting P, Sutton S, Zubin J. Quantitative evoked potential correlates of the probability of events. Psychophysiology. 1970;7: 385–394. Available from: http://www.ncbi.nlm.nih.gov/pubmed/5510812 pmid:5510812
  46. 46. Sutton S, Tueting P, Zubin J, John ER. Information Delivery and the Sensory Evoked Potential. Science (80-). 1967;155: 1436–1439.
  47. 47. Donchin E, Coles MGH. Is the P300 component a manifestation of context updating? Behav Brain Sci. 1988;11: 357.
  48. 48. Hillyard SA, Hink RF, Schwent VL, Picton TW. Electrical Signs of Selective Attention in the Human Brain. Science (80-). 1973;182: 177–180.
  49. 49. Johnson R. A triarchic model of P300 amplitude. Psychophysiology. 1986;23: 367–84. Available from: http://www.ncbi.nlm.nih.gov/pubmed/3774922 pmid:3774922
  50. 50. O’Connell RG, Dockree PM, Kelly SP. A supramodal accumulation-to-bound signal that determines perceptual decisions in humans. Nat Neurosci. 2012;15: 1729–1735. pmid:23103963
  51. 51. Kopp B. The P300 Component of the Event-Related Brain Potential And Bayes’ Theorem. In: Sun M-K, editor. Cognitive Sciences at the Leading Edge. New York: Nova Science; 2007. https://doi.org/10.13140/2.1.4049.4402
  52. 52. Kopp B, Seer C, Lange F, Kluytmans A, Kolossa A, Fingscheidt T, et al. P300 amplitude variations, prior probabilities, and likelihoods: A Bayesian ERP study. Cogn Affect Behav Neurosci. 2016;16: 911–928. pmid:27406085
  53. 53. Mars RB, Debener S, Gladwin TE, Harrison LM, Haggard P, Rothwell JC, et al. Trial-by-trial fluctuations in the event-related electroencephalogram reflect dynamic changes in the degree of surprise. J Neurosci. 2008;28: 12539–12545. pmid:19020046
  54. 54. Yordanova J, Kolev V, Polich J. P300 and alpha event-related desynchronization (ERD). Psychophysiology. 2001;38: 143–152. pmid:11321615
  55. 55. Lakatos P, Chen CM, O’Connell MN, Mills A, Schroeder CE. Neuronal Oscillations and Multisensory Interaction in Primary Auditory Cortex. Neuron. 2007;53: 279–292. pmid:17224408
  56. 56. Kayser C, Petkov CI, Augath M, Logothetis NK. Functional Imaging Reveals Visual Modulation of Specific Fields in Auditory Cortex. J Neurosci. 2007;27: 1824–1835. pmid:17314280
  57. 57. Molholm S, Ritter W, Murray MM, Javitt DC, Schroeder CE, Foxe JJ. Multisensory auditory-visual interactions during early sensory processing in humans: A high-density electrical mapping study. Cogn Brain Res. 2002;14: 115–128.
  58. 58. Noesselt T, Rieger JW, Schoenfeld MA, Kanowski M, Hinrichs H, Heinze H-J, et al. Audiovisual Temporal Correspondence Modulates Human Multisensory Superior Temporal Sulcus Plus Primary Sensory Cortices. J Neurosci. 2007;27: 11431–11441. pmid:17942738
  59. 59. Lewis R, Noppeney U. Audiovisual Synchrony Improves Motion Discrimination via Enhanced Connectivity between Early Visual and Auditory Areas. J Neurosci. 2010;30: 12329–12339. pmid:20844129
  60. 60. Werner S, Noppeney U. Distinct Functional Contributions of Primary Sensory and Association Areas to Audiovisual Integration in Object Categorization. J Neurosci. 2010;30: 2662–2675. pmid:20164350
  61. 61. Lee H, Noppeney U. Temporal prediction errors in visual and auditory cortices. Curr Biol. Elsevier; 2014;24: R309–R310. pmid:24735850
  62. 62. Schroeder CE, Foxe J. Multisensory contributions to low-level, “unisensory” processing. Current Opinion in Neurobiology. 2005. pp. 454–458. pmid:16019202
  63. 63. Atilgan H, Town SM, Wood KC, Jones GP, Maddox RK, Lee AKC, et al. Integration of Visual Information in Auditory Cortex Promotes Auditory Scene Analysis through Multisensory Binding. Neuron. 2018;97: 640–655.e4. pmid:29395914
  64. 64. Butler JS, Foxe JJ, Fiebelkorn IC, Mercier MR, Molholm S. Multisensory Representation of Frequency across Audition and Touch: High Density Electrical Mapping Reveals Early Sensory-Perceptual Coupling. J Neurosci. 2012;32: 15338–15344. pmid:23115172
  65. 65. Fetsch CR, Pouget A, Deangelis GC, Angelaki DE. Neural correlates of reliability-based cue weighting during multisensory integration. Nat Neurosci. 2012;15: 146–154. pmid:22101645
  66. 66. Fetsch CR, Deangelis GC, Angelaki DE. Bridging the gap between theories of sensory cue integration and the physiology of multisensory neurons. Nat Rev Neurosci. Nature Publishing Group; 2013;14: 429–442. pmid:23686172
  67. 67. Morgan ML, DeAngelis GC, Angelaki DE. Multisensory Integration in Macaque Visual Cortex Depends on Cue Reliability. Neuron. 2008;59: 662–673. pmid:18760701
  68. 68. Nardo D, Santangelo V, Macaluso E. Spatial orienting in complex audiovisual environments. Hum Brain Mapp. 2014;35: 1597–1614. pmid:23616340
  69. 69. Brainard DH. The Psychophysics Toolbox. Spat Vis. 1997;10: 433–436. pmid:9176952
  70. 70. Nichols T, Holmes A. Nonparametric Permutation Tests for Functional Neuroimaging. Human Brain Function: Second Edition. 2003. pp. 887–910.
  71. 71. Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci. 2011;2011: 156869. pmid:21253357
  72. 72. Chang C-C, Lin C-J. Libsvm. ACM Trans Intell Syst Technol. 2011;2: 1–27.
  73. 73. Anderson CM, Wu CFJ. Measuring Location Effects from Factorial Experiments with a Directional Response. Int Stat Rev. 1995;63: 345–363. Available from: https://www.jstor.org/stable/1403484?casa_token=Av2iApF4mDsAAAAA:hhhrn0uyvFu8InecNUk5jdSAiEtbZt6R18wNShodIpNV3nM9oIurNKU5l-e92rZI_wXvcZKK-ywcWV5TrevOQzj3zGeI94nCICN3gDa5yMLqdtgMDEw&seq=1#metadata_info_tab_contents
  74. 74. Edgington ES, Onghena P. Randomization Tests [Internet]. 4th ed. SpringerReference. Boca Raton: Chapman & Hall/CRC; 2007. https://doi.org/10.1007/978-3-642-04898-2
  75. 75. Gonzalez L, Manly BFJ. Analysis of variance by randomization with small data sets. Environmetrics. 1998;9: 53–65.
  76. 76. Berens P. CircStat: A MATLAB Toolbox for Circular Statistics. J Stat Softw. 2009;31: 361–8.
  77. 77. Nagelkerke NJD. A note on a general definition of the coefficient of determination. Biometrika. 1991;78: 691–692.
  78. 78. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90: 773–795.
  79. 79. Friston KJ, Holmes AP, Worsley KJ, Poline J ‐ P, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: A general linear approach. Hum Brain Mapp. 1994;2: 189–210.
  80. 80. Harper NS, McAlpine D. Optimal neural population coding of an auditory spatial cue. Nature. 2004;430: 682–686. pmid:15295602
  81. 81. Salminen NH, May PJC, Alku P, Tiitinen H. A population rate code of auditory space in the human cortex. PLoS ONE. 2009;4: e7600. pmid:19855836
  82. 82. Middlebrooks JC, Clock AE, Xu L, Green DM. A panoramic code for sound location by cortical neurons. Science (80-). 1994;264: 842–4. Available from: http://www.ncbi.nlm.nih.gov/pubmed/8171339
  83. 83. Brugge JF, Reale RA, Hind JE. Spatial receptive fields of primary auditory cortical neurons in quiet and in the presence of continuous background noise. J Neurophysiol. 1998;80: 2417–32. pmid:9819253
  84. 84. Cichy RM, Teng S. Resolving the neural dynamics of visual and auditory scene processing in the human brain: a methodological approach. Philos Trans R Soc Lond B Biol Sci. 2017;372. pmid:28044019
  85. 85. Kriegeskorte N. Representational similarity analysis–connecting the branches of systems neuroscience. Front Syst Neurosci. 2008;2: 4. pmid:19104670
  86. 86. Rosa MJ, Daunizeau J, Friston KJ. EEG-fMRI integration: a critical review of biophysical modeling and data analysis approaches. J Integr Neurosci. 2010;9: 453–476. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21213414 pmid:21213414