Buildup and bistability in auditory streaming as an evidence accumulation process with saturation

Quynh-Anh Nguyen; John Rinzel; Rodica Curtu

doi:10.1371/journal.pcbi.1008152

Abstract

A repeating triplet-sequence ABA₋ of non-overlapping brief tones, A and B, is a valued paradigm for studying auditory stream formation and the cocktail party problem. The stimulus is “heard” either as a galloping pattern (integration) or as two interleaved streams (segregation); the initial percept is typically integration then followed by spontaneous alternations between segregation and integration, each being dominant for a few seconds. The probability of segregation grows over seconds, from near-zero to a steady value, defining the buildup function, BUF. Its stationary level increases with the difference in tone frequencies, DF, and the BUF rises faster. Percept durations have DF-dependent means and are gamma-like distributed. Behavioral and computational studies usually characterize triplet streaming either during alternations or during buildup. Here, our experimental design and modeling encompass both. We propose a pseudo-neuromechanistic model that incorporates spiking activity in primary auditory cortex, A1, as input and resolves perception along two network-layers downstream of A1. Our model is straightforward and intuitive. It describes the noisy accumulation of evidence against the current percept which generates switches when reaching a threshold. Accumulation can saturate either above or below threshold; if below, the switching dynamics resemble noise-induced transitions from an attractor state. Our model accounts quantitatively for three key features of data: the BUFs, mean durations, and normalized dominance duration distributions, at various DF values. It describes perceptual alternations without competition per se, and underscores that treating triplets in the sequence independently and averaging across trials, as implemented in earlier widely cited studies, is inadequate.

Author summary

Segregation of auditory objects (auditory streaming) is widely studied using ambiguous stimuli. A sequence of repeating triplets ABA₋ of non-overlapping brief pure tones, A and B, frequency-separated, is a valued stimulus. Studies typically focus on one of two behavioral phases: the early (say, ten seconds) buildup of segregation from the default integration or later spontaneous alternations (bistability) between seconds-long integration and segregation percepts. Our experiments and modeling encompass both. Our novel, data-driven, evidence-accumulation model accounts for key features of the observations, taking as input recorded spiking activity from primary auditory cortex (as opposed to most existing, more abstract, models). Our results underscore that assessing individual triplets independently and averaging across trials, as in some earlier studies, is inadequate (lacking neuronal-accountability for percept duration statistics, the underlying basis of buildup). Further, we identify fresh parallels between evidence accumulation and competition as potential dynamic processes for choice in the brain.

Citation: Nguyen Q-A, Rinzel J, Curtu R (2020) Buildup and bistability in auditory streaming as an evidence accumulation process with saturation. PLoS Comput Biol 16(8): e1008152. https://doi.org/10.1371/journal.pcbi.1008152

Editor: Wolfgang Einhäuser, Technische Universitat Chemnitz, GERMANY

Received: January 3, 2020; Accepted: July 15, 2020; Published: August 27, 2020

Copyright: © 2020 Nguyen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code, data, and analysis files for this report can be found at: https://osf.io/b32uf/.

Funding: RC was supported by the National Science Foundation, CRCNS grant #1515678; https://www.nsf.gov/ and by the Simons Foundation, grant #267196, https://www.simonsfoundation.org/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Stimulus sequences of interleaved A and B pure tones have been widely used in studying segregation of distinct objects in an auditory scene (auditory streaming), in human psychophysics [1–6], invasive neurophysiology [1, 3, 7, 8], or in experiments implementing both [9]. A valued stimulus is triplet-streaming ABA₋ with the tone frequency difference, DF, as a tunable parameter [10]; Fig 1. For small DF human listeners most likely perceive integration (one galloping rhythm); for DF large, segregation dominates (two simultaneously heard parallel streams). The initial percept is typically integration but within seconds the probability of segregation increases (“the buildup phase”) and perceptual switching eventually occurs (“perceptual bistability”). Alternating percepts have variable durations, described by either gamma or lognormal distributions [2]. Time courses of spiking activity (macaque, primary auditory cortex, A1, [1]) show dynamical features (adaptation over 1-2 seconds) that were interpreted as neural correlates of buildup, although the behavioral and physiological experiments were not conducted together [1, 3].

Download:

Fig 1. Stimulus, buildup, and distribution of first percept durations in auditory streaming of triplets.

A: Stimulus paradigm (left) used for behavioral experiments and corresponding percept types (right). Stimulus consists of sequences of high (A) and low (B) pure tones presented as repeated triplets ABA₋ where ‘₋’ denotes silent gap. Depending on DF between tones A and B, there are two fundamental percepts: integration (I; blue), one connected stream with galloping rhythm, and segregation (S; red), two parallel streams of high tone A₋ A₋ A₋ A₋ and low tone ₋B₋₋₋ B₋ occurring simultaneously. B: Computation of the buildup function (time course of probability of S) obtained by determining the frequency of occurrence of S over all trials at each time point τ up to 30 s (45 trials for each DF and subject; 15 subjects). Non-S includes both latency (gray) and I (blue) states. Due to latency, the buildup function always starts at 0 even though the first percept is not necessarily I. For example, in Trial 2, the first percept is S. C: Experimental-based psychometric buildup function (upper panel) and distribution of first percept durations (middle and lower panels). Buildup functions are computed for DF = 3 (green), 5 (red), and 7 (cyan). The error bars indicate 95% CI around the mean using statistical bootstrapping. Durations are normalized by dividing by mean duration. Likelihood ratio test confirms that normalized first percept durations are gamma distributed—shown here at DF = 5 for I (N = 533, p = 0.49) and S (N = 114, p = 0.47). The shape parameters α, obtained by Maximum Likelihood Estimation (MLE), and the mean durations μ are indicated in the graphic. D: Model-based simulated buildup function (upper panel) and distribution of normalized first percept durations (middle and lower panels). Buildup functions from the evidence accumulation model (EVA; solid) closely resemble those from the behavioral experiment (dashed, also in C). Normalized first percept durations are gamma distributed (shown at DF = 5). Similar results are obtained for other DF values; see S1 Fig.

https://doi.org/10.1371/journal.pcbi.1008152.g001

Dynamics of buildup and/or perceptual alternation for ambiguous auditory stimuli were described by computational models based on signal processing [1, 3, 11–13], competition dynamics [6, 14], coupled-oscillator patterning [15, 16], evidence accumulation [5], and statistical descriptions [17]; also reviews by [18] and [19]. However, with few exceptions (e.g. [1, 6]) these models did not incorporate neurophysiological data. Furthermore, experimental and modeling studies primarily focused on either buildup, describing the probability of segregation during short, tens of seconds, trials [1, 3, 4, 20, 21], or on the stationary phase of alternations, characterizing the statistics of percept durations over long, several minutes, trials [2, 5, 6].

Here we designed the experiment (30 s trials with many trials per condition/subject) so that we could characterize these features simultaneously. Then we proposed a model that takes spike-recordings from A1 as input, and accounts for both the behavioral time course of buildup and the observed duration statistics during alternations, over a range of DF values: 3, 5, 7 semitones. Our model is neuromechanistic-like, transforming the neuronal input for processing in two evidence-accumulators downstream of A1. From the input-sensory level, sampling of spike counts across A1-units provides a measure for the contribution of each triplet to the evidence-accumulation stage; if evidence against the current percept exceeds a threshold then a perceptual switch occurs and accumulation resets. This approach parallels in spirit Barniv and Nelken’s model [5] although that was implemented from a Bayesian-viewpoint. Our model is data-driven: input is neural-based; initial parameters are estimated from our behavior data (mean probability of segregation) then fine-tuned to match the gamma-distributed percept durations.

We propose that although the model is not competition-based it shares some features of such approaches: Adaptation is key in competition dynamics; evidence accumulation might be viewed as recovery from adaptation. Matching duration statistics with competition requires some balancing of noise and adaptation [22]; its analogue is the interplay between accumulation and noise. Adaptation strength, when set near the boundary between noise-free oscillatory and noise-driven attractor dynamics, constrains dominance durations [23]; comparably, our accumulators have a novel feature of saturation which if set below but near the switching threshold, produces observed statistics only if adequate noise is present.

Importantly, our modeling highlights that accounting for the duration statistics of behavioral data is key when studying auditory bistable perception. With quantitative matches to these data the buildup phase is then naturally reproducible by an alternating renewal process [17]. We show that a widely cited signal-detection approach [1, 3, 12, 21], based on treating each triplet independently without accumulation, that overlooked this crucial feature does not account for the single-trial percept duration statistics. We argue for caution when applying it to test neural-inspired behavioral hypotheses.

Results

We first outline the rationale for our study and presentation. In behavioral experiments human participants continuously reported their ongoing perception, integration or segregation, which after analysis yielded distributions for percept durations (section Auditory triplet-streaming). We introduce the essence of our EVidence Accumulation (EVA) model in a basic form (section A basic state-dependent model). With each triplet we suppose there is an incremental urge r, to switch from the current percept/interpretation to the alternate one; r is the “drift” rate for the event sequence that, with zero-mean noise, drives fluctuating accumulation in the EVA model that eventually surpasses threshold. We illustrate that this basic model captures the duration statistics for a chosen case, near-equidominance. We next elaborate the model by formulating a neuronal basis for evaluating r (section Linking neural data with behavioral data). We utilize the single-unit spike counts for A-tone selective A1 neurons recorded over a range of experimental conditions [1] and applied them to our case of DF = 3, 5, 7. The relative responses to B-tones are viewed, according to the population separation hypothesis [7], as evidence for segregation (against integration) when spike counts are generally smaller, or against segregation when larger. A challenge arises. If N_in A1 neurons are recorded the spike count deviates from the mean like . Thus, if N_in is large and one supposes a fixed threshold for signal detection, the classification based on spike counts becomes binary and problematic for resolving a perceptual response that is graded over conditions. Our full EVA model attempts to meet the challenge by having a two-layer pre-processing stage that includes N_sl units, each of them sampling a few A1 neurons (N_in not large). The proportion of N_sl units which respond to thresholded activity over neuronal ensembles in A1 provides the incremental evidence, the value of r, for the accumulator that favors integration. The complementary proportion of N_sl units that do not respond to thresholded A1 activity provides the incremental evidence to the accumulator against integration.

Our approach overcomes two shortcomings of a well-known signal detection model for auditory streaming [1]. The Micheyl et al treatment [1] does not account for single-trial data, the duration distributions which form the basis for computing the buildup function (BUF); it averages across trials, without accumulating evidence event-to-event. The signal detection scheme of Micheyl does not resolve, with N_in large, a family of BUFs that show gradation across conditions.

In short, we combined neural data from [1] with behavioral data from our experiments to investigate if the signal detection model when applied on a single-trial basis could yield percept durations in a self-consistent fashion. We found it did not; and moreover that it was unable to fit buildup functions that, for different stimulus conditions, were graded, not widely separated (section on page 16). We then developed a neural-based evidence accumulation-like explanation of the observed data, as alternative to explicit competition, and with the advantage of being intuitive (section on page 6).

Auditory triplet-streaming

Experimental protocol.

Fifteen human subjects with normal hearing listened to sequences of repeating ABA₋ triplets and were instructed to continuously report their ongoing percept by selectively pressing one of two different buttons on a keypad. Subjects began reporting their percept typically 2 s after stimulus onset as integration (I; a single, coherent stream ABA₋ ABA₋) or segregation (S; two distinct streams A₋ A₋ A₋ A₋ and ₋B₋₋₋ B₋). Stimuli were sequences of triplets ABA₋ that consisted of alternating high (A) and low (B) pure tones followed by a 125 ms silent pause “₋” (Fig 1A and 1B). In total, triplets were 500-ms in duration and were repeated 60 times per trial. Tones were separated in frequency by DF semitones chosen from three conditions (DF = 3, 5, 7) with each condition being presented five times per experimental block (nine blocks total). This resulted in group data (from 15 subjects and 45 trials per subject) with 675 30-s trials for each of three DF values.

Behavioral task performance.

For each DF condition the buildup function was constructed by computing the probability of segregation from trial-averaging (Fig 1B and 1C). The buildup functions started at zero and increased over time before stabilizing to certain DF-dependent asymptotic values, similar to reports by [1, 3, 5, 12]. They started at zero due to the latency period (when no percept was identified) and not because the initial percept was I; see Methods, also [4]. While I first percepts were indeed more likely, S first percepts were reported too. The proportion of segregation as initial percept increased with DF from 103 out of 675 trials at DF = 3 to 137 at DF = 5 and 220 at DF = 7. The probability of segregation increased faster and reached higher levels at larger DF, with transient times of approximately 16, 10, 5 s after stimulus onset and with asymptotic values 0.45, 0.6, and 0.65 at DF = 3, 5, 7 respectively.

We computed distributions of normalized phase durations for subsequent durations, separately for each DF, and found them to be gamma-like, consistent with previous results on subsequent percepts [2, 5, 6]. Herein we report that duration distributions of the first percept are also gamma-like (Fig 1C; see also S1 Fig). We used statistical bootstrapping to compute the shape parameter α of each gamma distribution (see Methods), and determined that α ≈ 2 for normalized first durations and α ≈ 2.6 for subsequent durations. The distributions satisfied the scaling property γ₁ ≈ 2CV with skewness γ₁ and coefficient of variation CV ≈ 0.7 and CV ≈ 0.6 respectively, similar to reports by [24]. For integration, first percept durations were found to be longer in the mean than subsequent percept durations (with statistical significance near equidominance; p-value of 0.0003 at DF = 3 and 0.0184 at DF = 5, right-sided Wilcoxon rank-sum test at significance level 5%). Mean durations of first I-percept were 10.9, 5.3 and 3.1 s, decreasing with DF = 3, 5, 7 (p-value of 0.0002 when comparing DF = 3, 5 and 0.0014 for DF = 5, 7). Mean durations of first S-percept were 3.5, 6.6, 8.1 s (comparisons did not produce statistically significant differences, possibly due to fewer instances of first S percepts). For subsequent percepts the means were the following: 5.4, 3.4, 3.1 s for I and 4.9, 5.2, 5.6 s for S at DF = 3, 5, 7, showing a decreasing trend for integration between DF = 3 and DF = 5 or 7.

Auditory streaming as an evidence accumulation process

Herein we propose an evidence accumulation model that accounts for the observed dynamical features of buildup and alternations: gamma-like distributions for first and subsequent durations, DF-dependent mean durations, and psychometric buildup functions. Data-based [1] estimates of spike counts of neurons in the primary auditory cortex (area A1) are sampled by a population of units and their summed responses lead to a population vote and to an increment of evidence “for” and “against” the current percept. When enough evidence has built up against the current percept, there is a switch to the opposite percept. Current increments can be positive or negative but only when the accumulated evidence is adequate, does a switch occur.

A basic state-dependent model for evidence accumulation.

Our EVA model describes activity that accumulates and saturates at a target-level, T, just-subthreshold. The activity X_n is updated at the nth triplet according to: (1) where T < 1 (assuming a unitary threshold) and where are independent random variables (Gaussian noise of zero mean and standard deviation σ). The activity increments are state dependent and proportional to the difference T − X, with constant rate r. Accordingly, the activity X drifts towards T stochastically if 0 < r < 1. Accumulation slows with X_n near T and the activity can cross the threshold only due to noise. At each threshold-crossing X_n is reset to a value X_R taken as the initial condition for the subsequent dynamics. The time D between successive threshold crossings represents a percept duration; it equals N_D, the number of triplets between threshold crossings multiplied by the onset time from one ABA₋ to the next (500 ms).

Phenomenologically, Eq (1) accounts for the features of the behavioral data described in section Auditory triplet-streaming: the observed DF-dependent mean durations, the gamma-like shape of the distributions for first and subsequent durations, and the time course of the psychometric buildup. As an example, consider DF = 5 and take r = 0.6, which is the asymptotic, approximate value of the behavioral buildup, near equidominance (Fig 1C; red curve). With initial and resetting conditions X₀ = 0.7 and X_R = 0.6, and with parameter settings T = 0.9, σ = 0.085, we simulated Eq (1) 675 times. The computed distribution of first integration normalized durations and the corresponding trial-averaged buildup function (Fig 2; right panels) are in agreement with Fig 1C and 1D.

Download:

Fig 2. A basic state-dependent model for evidence accumulation yields percept durations that are gamma-like distributed and with mean values similar to those observed in behavioral data.

To demonstrate the robustness of the model results and dependence on parameter values we simulated Eq (1) with various values for target T and noise level σ. Shown for DF = 5 with r = 0.6: (Left) Two-parameter response diagram of the first I-percept with respect to T and σ. There is no switch for very small noise levels (na; gray area). Threshold-crossing activity appears with increased noise and leads to percept durations that are distributed according to normal distributions (region N), gamma-like distributions (region G between the black dashed and black solid curves), or exponential distributions (region E). Parameter values that lie on the sheets of black and gray dots yield numerically generated first integration mean durations within one and two standard deviation(s) of the experimental mean. (Right) Insets are shown for T = 0.9, σ = 0.085 (black diamond in the diagram): computed distribution of first integration normalized durations and the early phase of numerical buildup obtained during one simulation run of Eq (1) are in agreement with behavioral data. For simplicity, the drift rate r was kept constant to 0.6 between all threshold crossings.

https://doi.org/10.1371/journal.pcbi.1008152.g002

To demonstrate the robustness of the model results and dependence on parameter values, we simulated Eq (1) with various values for target T and noise level σ. The simulations took into account the latency period during each trial, and the proportion of first percepts reported as integration and segregation, during the behavioral experiment at DF = 5 (as in Switches and resetting conditions, in Methods). In this way we could assign a “percept”-type identity label to each event between consecutive resets. For any fixed T we found no threshold-crossings when σ was small. Alternations between “percepts” occurred only as σ increased, with dominant durations distributed as follows (e.g. for first I-percept; see Fig 2): normal distributions (region labeled N), gamma-like distributions with shape close to that found experimentally (region G) and exponential distributions (region E), respectively. The values of σ and T for which numerically generated mean percept durations were within one and two standard deviation(s) of the experimental mean were also calculated (Fig 2; sheets of black dots and gray dots). There is, therefore, a region in the parameter space where critical statistical properties of the behavioral data can be reproduced.

Linking neural data with behavioral data in the EVA framework.

As in Micheyl et al. [1] we seek to relate perceptual buildup and bistability reported by human subjects during triplet-streaming to animal neural data. Spiking activity evoked by the B-tone was recorded from tone-A-selective neurons in macaque primary auditory cortex, A1 [1]. At any triplet position in the ABA₋ sequence the mean spike counts, m_DF, decreased with increased DF. The time course of m_DF exhibited fast adaptation and stabilized by the third triplet [1, (Fig.3 in Ref)].

For modeling we assume that individual spike counts are Poisson distributed with means m_DF, and that m₃ > m₅ > m₇ for DF = 3, 5, 7. The responses of A1 neurons will be processed by downstream neurons whose responses at each triplet then feed into the EVA accumulator. Suppose that N_in A1-neurons activate a neuronal unit downstream if the mean input exceeds a threshold C_th (2) where H(⋅) is a Heaviside function. Such a sampler neuron is binary, taking a value u of 0 or 1 with probabilities p and 1 − p, for triplets after a brief transient phase of adaptation in A1. In line with the population separation hypothesis [7], when spike counts are large (so u = 1) we tag the sampler as evidence for integration; likewise when spike counts are small (u = 0), we tag the sampler as evidence against integration. As N_in increases, the averaged spike count variability around the mean m_DF decreases inversely with (Fig 3). As such, probabilities p_DF obtained from m_DF at different DF = 3, 5, 7 vary from values being graded (when N_in is small) to values spread apart (when N_in is large), approaching extreme values of zero or one (when N_in is very large); Fig 3. A suitable variability in the A1-neuronal population can be chosen (more about this later; sections on pages 9 and 16) to ensure that probabilities p_DF at DF = 3, 5, 7 achieve the graded asymptotic values of the behavioral buildup functions (e.g. 0.45, 0.6, and 0.65 as in Fig 1C). This is an important feature of the EVA model; indeed, without adequate variability in the readout of A1 responses we cannot account for graded BUF levels.

Download:

Fig 3. Linking neural data with behavioral data in the EVA framework.

Individual spike counts of A1 neurons are assumed to be Poisson with means m_DF such that m₃ > m₅ > m₇ for DF = 3, 5, 7. Averaged spike counts over N_in A1-neurons, , are normal-like distributed with means m_DF and standard deviation decreasing inversely with ; shown in gray (DF = 3), black (DF = 5) and light-gray (DF = 7) for N_in = 10 (upper panel) and N_in = 100 (lower panel). At each triplet, < Spk > activates a sampler unit downstream if it exceeds a threshold C_th (solid black, vertical line). The area under the probability distribution to the left of C_th (white-dots pattern; DF = 5) determines the probability p of the sampler neuron to be inactive (0); the complementary probability 1 − p is for the sampler to be active (1). For each N_in, the threshold C_th was chosen such that p = 0.6 at DF = 5, which is the asymptotic, approximate value of the corresponding behavioral buildup near equidominance (see Fig 1C; red curve). Probabilities p obtained at different DF vary from values being graded when N_in is small (N_in = 10), to values spread apart approaching zero or one when N_in is large (N_in = 100). A suitable variability in the A1-neuronal population is key if aiming to account for graded BUF levels observed in behavioral data (Fig 1C).

https://doi.org/10.1371/journal.pcbi.1008152.g003

A question remains: How can we link the probability of a sampler unit becoming active stimulated by A1-neurons to the neuronal drive of the accumulator, r in Eq (1)? We resolve this problem by including an entire layer of binary units u as above, say N_sl total, and use the percentages p_I, p_S, as cluster sizes, of active and inactive samplers as input-drive to two accumulators: for and against integration, respectively (Fig 4A). In particular, the output p_S of the sampler layer (not binary anymore) is a stochastic process with mean p and variance p(1 − p)/N_sl (for justification, see Statistical properties of SL-activation, in Methods). Noteworthy, under this construction, the output p_S of the sampler layer (the input to the accumulator “against integration”) takes indeed values very close to r, defined as p, in Eq (1) if N_sl is large enough.

Download:

Fig 4. Accumulation model as feed-forward auditory network of 3 layers.

A: State of neurons at triplet t in the input layer and sampler layer of the evidence accumulation model. Input layer comprises A1 units with (triplet- and DF-dependent) mean spike counts presented in panel B. Sampler layer has N_sl = 20 binary neuronal units, either in state 1 (blue; favoring I percept) or state 0 (red; favoring S percept). Each unit samples a small number of input units (N_in = 5) and the averaged spike count across the units is compared to C_th (see panel B) to determine the unit’s appropriate perceptual state. B: Mean spike counts (scatter plot) for tone B of tone-A-selective neurons, and exponential fit (solid) of mean spike counts. These values are interpolated for our specific DF = 3,5,7 using data from cortical area A1 of awake macaque extracted from [1]. A Poisson spike count is generated using the mean value at each triplet. Asymptotic values of mean spike count (printed in parenthesis next to corresponding DF values) are used to generate spike counts after the 20-th triplet. Poisson spike counts are averaged across sets of N_in = 5 neuronal units, and the resulting values are subject to a binary neural threshold C_th (black horizontal line). The error bars indicate the standard errors of the mean spike counts. C: Accumulation layer has 2 accumulators drifting over successive triplets towards their own target values T_a and T_f where T_a > T_f. Their activities are governed by input factors from the sampler layer and stochastic factors. The noise level depends on the target (σ_a > σ_f). During a cycle, the suppressed unit accumulates evidence against the current percept. A switch to the other percept occurs when the accumulator of the suppressed unit reaches the switching threshold of 1. A new cycle starts, with accumulators reset to appropriate values, and targets values switched to corresponding perceptual states. Shown for DF = 5. For other DF values, see S2 Fig. For the complete list of parameter values, see Methods—section Parameter values used in model simulations.

https://doi.org/10.1371/journal.pcbi.1008152.g004

The EVA model.

Our proposed EVA model is structured as a three-layer network (Fig 4; see details in Methods). It takes Poisson spike counts from tone-A-selective A1-neurons (the Input Layer, IL) [1, 7, 8] and passes them through binary units in the Sampler Layer, SL (Fig 4A). Only spike counts recorded during tone B are included (Fig 4B). Each SL-unit compares the averaged spike count across a small number N_in of input units to a fixed threshold C_th and places the outcome into either state 0 (for S) or 1 (for I). High activation in IL (above C_th) is assumed to support percept I while low activation facilitates percept S (Fig 4A and 4B). The proportions p_I(t), p_S(t) (p_S = 1 − p_I) of SL-units in states 1 and 0, together with stochastic noise terms ξ_I(t), ξ_S(t), modulate the activity of the Accumulation Layer, ACC (Fig 4C). Two accumulators representing evidence for the percepts drift towards two targets. Their activities x_I, x_S are updated at discrete time steps determined by the position t of each triplet in the ABA₋ sequence. One unit accumulates evidence for the current percept (e.g. x_I during integration) in the presence of additive “neural” noise defined by a Gaussian process of strength σ_I = σ_f, and approaches target T_I = T_f. The other accumulator works against the current percept (x_S during integration). It experiences stronger noise level σ_a, and approaches another target, T_a. Differential noise levels enable the accumulator “against” to be the first to reach the threshold and initiate the switch; meanwhile, the accumulator “for” remains confined to a neighborhood of its target. In the deterministic (noise free) case, alternations between percepts are not possible given that both T_a and T_f are subthreshold targets (T_f < T_a < 1), a distinctive feature of our accumulation model. Instead, the ACC system is bistable with accumulators x_I, x_S reaching either steady state (T_a, T_f) or (T_f, T_a) depending on the initial conditions (Fig 4C, dotted lines in blue and red). In the presence of noise, however, the accumulator against the current percept reaches the decision threshold; a switch to the other percept occurs, the accumulators are reset, the targets are swapped (T_S = T_f, σ_S = σ_f and T_I = T_a, σ_I = σ_a), then another accumulation cycle begins (Fig 4C, traces for x_I, solid blue, and x_S, solid red; the percept’s type is identified by the background color, blue for I, red for S. See also S2 Fig). It is essential that the accumulators are subjected to noise in order for the distribution of threshold crossing events idealizing the percept durations to exist.

EVA model captures DF-dependence of mean durations.

Numerical simulations of the EVA model followed the experimental setup with N_tr = 675 repetitions (trials) per DF. The model-generated mean durations were computed separately for each percept type (I, S; first, subsequent) and DF. They approximated well their counterparts from behavioral data (Figs 5B and 6B). They also captured two important DF-related trends reported by other studies. First, near equidominance (DF = 3, 5) mean durations of the first I percept were found to be longer than those of subsequent I percepts [2, 6]. Secondly, mean durations for I and S showed a “cross-diagram” like behavior [6] with equidominance near DF = 5. Mean durations for I were greater than mean durations for S when DF low (DF <5), and smaller than mean durations for S when DF large (DF >5), results similar to [6]. The model was robust to noise as demonstrated by 100 Monte Carlo runs of each DF simulation that yielded consistent results in terms of average values and 95% CI (Figs 5B and 6B; error bars).

Download:

Fig 5. EVA model yields realistic first percept durations.

A: Mean percept durations (top) and fitted α value (bottom) from gamma distribution of first I (blue) and first S (red) percepts from behavioral experiment for DF = 3,5,7. The error bars indicate 95% CI around the mean and are obtained from statistical bootstrapping (see Methods). The mean durations of I decrease with DF while those of S increase with DF. The shape parameters α from gamma-fit using MLE for DF = 3,5,7 are also presented here. There is no observed trend for α values. B: Mean percept durations (top) and fitted shape parameter α (bottom) from gamma distribution of first I (blue) and first S (red) states from EVA model. The error bars are 95% CI obtained from 100 Monte Carlo runs to show the robustness of the model. The mean values of duration follow the similar trend as those from experiment. Also, the shape parameters show a close resemblance to those from the experiment. Related results are included in S5 Fig.

https://doi.org/10.1371/journal.pcbi.1008152.g005

Download:

Fig 6. EVA model yields realistic subsequent percept durations and distributions.

A: Distribution of normalized subsequent percept durations (top) and other properties (bottom) from behavioral experiment. (Top) Likelihood ratio test confirms that both subsequent I (blue; N = 1642, p = 0.49) and S (red; N = 1785, p = 0.49) percepts follow a gamma distribution, shown here for DF = 5. The shape parameters α computed using MLE, and the mean durations μ are shown in the graphic. (Bottom) Mean subsequent durations for I (blue) and S (red) for DF = 3,5,7. The error bars indicate 95% CI around the mean, computed using statistical bootstrapping (see Methods). Similar to the first percept, mean durations of subsequent I and S show a “cross-diagram” like behavior [6, (Fig.9B in Ref)] with equidominance near DF = 5; the ratio between mean durations for I and S percepts changes from larger than 1 to smaller than 1 when crossing DF = 5, near equidominance. The shape parameters α from MLE for DF = 3,5,7 are also presented, and no trend for α values is found. B: Distribution of normalized subsequent percept durations (top) and properties (bottom) from EVA model. (Top) Normalized subsequent percepts are gamma distributed for DF = 5 with mean durations μ and shape parameters α shown in the figure; similar results are obtained for other DF values, see S1 Fig. (Bottom) Mean percept durations and fitted shape parameters α for DF = 3,5,7 from EVA model. Mean subsequent durations follow the same trend and the shape parameters have similar values as compared to those from the experiment. The error bars are 95% CI obtained from 100 Monte Carlo runs of the model. The result shows the robustness and consistency of the model. Related results are shown in S5 Fig.

https://doi.org/10.1371/journal.pcbi.1008152.g006

In EVA model, the switch to a new accumulation cycle occurred when the ACC-unit that accumulated evidence against the current percept reached the decision threshold. Target-against T_a, noise level σ_a, and increment rate p_S(t) determined the trajectory of the suppressed unit x_S and the length of the corresponding dominant percept I. Similarly, T_a, σ_a and p_I(t) determined the duration of percept S. We studied the effect of T_a and σ_a on the model-generated mean durations at each DF by varying their values while keeping all other parameters fixed (see Parameter values used in model simulations, in Methods). At a given T_a, EVA model exhibited no alternations if σ_a was small (Fig 7; region in gray). For moderate σ_a values, perceptual switches occurred but yielded percepts of mean durations much longer than those found experimentally (in warm colors); then, for larger σ_a, simulated durations became comparable to (in green) and then much shorter than (in cool colors) the behavioral mean durations. Similar results were obtained for σ_a fixed when varying T_a. As a general rule, the smaller the target-against, the stronger the noise level had to be in order for the accumulator’s trajectory to be pushed above the threshold and to generate acceptable statistical approximations of the data (Fig 7, in green; also black dots; within one standard error to the experimental mean, SEM).

Download:

Fig 7. Dependence of mean percept durations and shape of distributions on target T_a and stochastic term σ_a, in the EVA model.

Diagrams show the difference between the mean duration μ derived from numerical simulations of the model and mean μ_exp from the behavioral data, represented as ratio μ/μ_exp; see the color scheme. Results are shown for A: first and B: subsequent integration and segregation percepts at conditions DF = 3,5,7. Given a fixed value for T_a, the dynamics changes from no alternations between percepts at small σ_a (na; in gray); to alternations of mean durations much longer than the experimental mean (region in warm colors); to mean durations that approximate well the corresponding experimental values (within one standard error to μ_exp; in green; black dots depict a discrete selection of values in the green region); then, to mean durations much shorter than μ_exp (region in cool colors). In each diagram, σ_a, T_a that were used to generate model-based results are identified by a red diamond (see Methods, σ_a = 0.085, T_a varies). Besides mean durations, the shapes of the gamma-like distributions that fit normalized percept durations depend on T_a and σ_a as well (α is the shape-parameter in the gamma-fit; see Eq (3) in Methods). There are three main regions that characterize α and they are delineated by the dashed-white and solid-white curves. Low-level of noise σ_a yields normal distributions (region N, to the left of dashed-white line; α ≫ 3) while high-level of noise yields exponential distributions (region E, to the right of solid-white line; α near 1). For intermediate level of noise, the distributions are gamma-like with shape close to that found experimentally (region G, between white contours; α ≈ 2 for first percepts and α ≈ 2.6 for subsequent percepts; α differs from α_exp by relative error up to 20% except for integration at DF = 7 where it is up to 30%). The parameter range where both model-generated mean duration and shape of distribution are good approximations of their corresponding experimental observations is found at the intersection of region G with the sheet of black dots. Related results are shown in S3 Fig.

https://doi.org/10.1371/journal.pcbi.1008152.g007

The decrease in mean durations of I percepts with increasing DF (Figs 5B and 6B, blue) stemmed from the increase in probability of a sampler to support segregation (Fig 8B) which led to an increase of increment rate p_S of accumulator x_S (see Statistical properties of SL-activation, in Methods). The increasing trend of mean durations of S percept, with DF, could also be associated with the decrease of increment rate p_I of accumulator x_I. These DF-dependent properties of p_S, p_I, inherited from A1 spike counts (Fig 8A) enabled the EVA model to capture the correct qualitative trend of the experimental means across all percept-types and DF conditions. Suitable quantitative agreements were then obtained by fine-tuning the value of target-against T_a (Fig 7, red diamond; error between numerical and behavioral results was restricted to 0.1 SEM. See also S3 Fig).

Download:

Fig 8. Parameter fitting for input and sampler layers in the EVA model.

The signal detection algorithm for constructing a neurometric function (the probability of a sampler to support the segregation percept) utilizes spike count time courses as shown in panel A (data extracted from [1] and interpolated for the cases DF = 3, 5, 7); see below for more detail. The behavioral buildup functions (dashed, in panel B) occupy intermediate ranges of probability of S, and show slow initial rise for DF = 3, 5. The simulated functions (solid, in panel B) do not capture the slow-rising phase of behavior buildup, and the spread between the neurometric curves increases unacceptably at larger N_in. For an optimal choice of parameters N_in, C_th, the algorithm yields well-fit asymptotic values of behavioral data. A: Mean spike counts m_t;DF are interpolated at DF = 3, 5, 7 st from data in [1], and then extended for triplets t ≤ 60; see Methods. (Note: In [1, (Fig.3 in Ref)] mean spike count data were shown for A-tone selective neurons in A1 during triplet tones at DF = 1, 3, 6, 9. They decreased exponentially and stabilized within a few seconds. Mean spike counts changed with DF only during B-tone.) Herein, spike counts during B-tone are generated using Poisson processes of means m_t;DF (DF = 3, 5, 7) and then averaged over N_in neuronal units of the input layer IL (e.g. N_in = 1, 5, 30, 100). The average values of the mean spike counts and the standard error to the mean (SEM) are computed over 675 trials. Averages, including asymptotic values (written in parenthesis, at each DF), do not change with N_in but SEM decreases with a factor of . B: The signal detection algorithm [1] generates neurometric functions using numerical data from IL-pools of N_in neuronal units; parameter C_th is chosen to yield the least-squares error of the experimental buildups and the computer-simulated neurometric functions for DF = 3,5,7. If N_in is small the neurometric curves tend to bunch together due to overlapping and large SEM regions across conditions. As N_in gets bigger, the neurometric curves are pushed apart. The best approximation to the set of psychophysical buildups is obtained for N_in = 5, C_th = 4.21.

https://doi.org/10.1371/journal.pcbi.1008152.g008

EVA-modeled first and subsequent percept durations match observations.

The model reproduces an important statistical feature of the behavioral data, the distributions of normalized durations for all first and subsequent I, S percepts at DF = 3, 5, 7. Histograms were drawn and fitted by gamma probability density functions of shape parameters α (see Eq (3) in Methods) whose values agreed with those from the behavioral experiment. The shape of distributions was tested and confirmed statistically by 100 Monte Carlo runs of the model for each DF condition separately (Figs 5B and 6B; error bars indicate 95% CI around α-mean values). Exemplar distributions for first and subsequent durations are shown in Figs 1D and 6B at DF = 5. For other DF values, see S1 Fig.

Since alternations were caused in the model by the accumulator that gathered evidence against the current percept, the distribution of threshold crossing event times depended on T_a and σ_a. In particular, for a given T_a, EVA model generated percept durations that were normally distributed for σ_a small (Fig 7, region N; α was much bigger than 3) and exponentially distributed for σ_a large (Fig 7, region E; α was close to 1). For intermediate σ_a, the distributions were gamma-like matching those fit to the observed data (Fig 7, region G; model-generated α values were similar to those determined from experiments, α_exp, at relative error up to 20-30%). The closer T_a was to the decision threshold 1, the easier it was to find σ_a that yielded gamma-like distributions. With decreasing T_a, the transition to a narrower region G was either sharp-edged (e.g. DF = 7, first I) or rather smooth (DF = 7, first S). Percept durations that approximated well both the distribution shape and the mean duration of the experimental data were obtained by using parameters from region G that overlapped with the black dotted sheet.

EVA model captures DF-dependence of stream segregation buildup.

The model-generated buildup functions captured both the rising and the asymptotic phases of the behavioral buildup for each DF = 3, 5, 7 (Fig 1D). These trends were a consequence of already having simulated percept durations and percept means well-fit to behavioral data, in accordance with previous works describing the buildup of stream segregation as a byproduct of an alternating renewal process [17].

Computational advantages of the EVA model.

The model is pseudo-neuromechanistic; it takes A1 responses as input, it allows for attractor-states, and it includes accumulators that are saturating akin to synaptic currents. The spike counts are in accordance with neurophysiological data from A1 [1] and provide input to the computation of perception dominance downstream, as in the conceptual population-separation model of [7] and in competition-based model of [6]. The model incorporates fast habituation (after one triplet or so, Fig 4B) as in [1, (Fig.3 in Ref)] and it accounts for the decrease in response amplitudes and in spatial activity patterns evoked by tone B at tone A tonotopic locations observed as DF increases [1, 7]. Indeed, if most of tone-A-selective A1-neurons are active, the model predicts a large proportion of samplers in SL to be active (p_I large) and thus favors I percept. If the opposite happens and A1 is mostly inactive, a large proportion of samplers are inactive (p_S is large) and the model favors S percept. Activation in IL decreases with larger DF (fewer IL-units have mean spike counts above threshold C_th) and so does p_I(t); Fig 4A and 4B, compare DF = 3, 5, 7; see also see also [1, (Fig.3 in Ref)] and [7, (Fig.11 in Ref)]. This affects the dynamics of the accumulators since x_I(t), x_S(t) gather evidence about percepts with increment rates proportional to p_I(t), p_S(t) respectively, while also being modulated by a certain level of noise (Fig 4C, Eqs).

The accumulators resemble discrete time versions of the leaky integrate-and-fire neuron model with conductance-based synaptic input [25], . The voltage-like variable (here, normalized, by the threshold value for switching) has a maximum amplitude of one. The reversal potential V_R is set to either target T_a or T_f depending on the type of the dominant percept and the type of the accumulator. The synaptic drive consists of feedforward input from SL and is analogous to the reciprocal of the time constant. Finally, Gaussian white noise represents input from other brain sources or internal to ACC. Its strength σ needs to reach an appropriate level for the statistics of percept durations generated by the EVA model to match the behavioral data (see Methods).

Our EVA model is data-driven. Initial conditions are set using the latency periods and the proportion of first S-percepts from the experimental data. Poisson spike counts of IL neuronal units at each triplet t and semitone difference DF are generated using mean values m_t;DF derived from macaque A1 multi-unit spiking neural data [1] (Fig 8A). Parameters N_in and C_th are obtained by least-squares fit between the probability of a sampler to support an S percept and the behavioral buildup at all three DF values (Fig 8B; see also Methods). Neuronal granularity as a suitable substrate for perceptual representations [24] is implemented through SL. The number of samplers is chosen flexibly from a wide range of values (N_sl ≥ 1; here N_sl = 20). There are few other free parameters, T_a, σ_a, T_f, σ_f, b (baseline), but only the former two are major players in fitting the model to data (as shown in sections on pages 9 and 11).

Signal detection algorithm yields fast buildup and unrealistic percept durations

Modeling the buildup with Micheyl’s model for auditory streaming.

The signal detection algorithm of Micheyl et al. [1] (see also [26, 27]) has been extensively cited in the auditory streaming research [3–6, 12, 21] in regard to computing time-varying probabilities of stream segregation from neuronal responses in A1. The model was based on choosing a threshold number, C_th, for mean spike count (first, trial-averaged; then averaged across sampled neurons) to classify each triplet as I or S; doing this for each ABA₋ in the sound sequence generated a time course, a “neurometric” function. Briefly, for a given triplet position, a probability distribution was constructed for the B-tone responses measured at and convolved over A1 neurons whose best frequency was that of the A-tones. The area under the probability distribution to the right of C_th determined the probability that tone B was detected and, consequently, that the triplet belonged to I percept; the complementary probability was associated with S percept. The algorithm classified each triplet independently and assumed no memory among nearby triplets.

A simplified view of this procedure is to consider the distribution of trial-averaged counts for the neurons as straddling the mean for each triplet in the time course of an ABA₋ sequence [1, (Fig.3 in Ref)]. Conceptually, one chooses a level C_th that will cut across the distributions, say for DF = 3, and correspond to low probability of S for early time and correspond to the asymptotic level (from behavior) for late time (e.g. C_th, thin horizontal line, in S4 Fig). This classification could likely provide a decent fit for DF = 3 but for DF = 6 the spike counts will fall below the threshold, leading to an overestimate of probability of S. The remaining cases will yield extreme classifications: for DF = 1, spike counts for each triplet in the sequence will be above C_th and probability of S will be estimated as near zero; for DF = 9, all spike counts will be below C_th (except maybe for the initial triplet) and therefore probability of S will be near one for the time course. The spread of the behavioral time courses in [1, (Fig.4 in Ref)], one lying intermediate for DF = 3 and the others at very low or quite high levels, provided an opportunity for reasonable fitting with a single C_th level [1]. For details on numerical fitting with such signal detection algorithm, see S4 Fig.

In the case of our data, the conditions were DF = 3, 5, 7 and the behavioral buildup functions lay in more intermediate levels and clustered around the asymptotic value of 0.5 (Fig 1C; also Fig 8B, dotted lines). It thus became challenging to fit the buildup functions (especially the early, slower rising portions for multiple DF values, 3 and 5 semitones) using Micheyl’s model with a unique C_th.

We attempted to meet this challenge by applying the signal detection algorithm to our behavioral data while using interpolated and Poisson distributed spike counts based on the neural data from [1]. The approach was equivalent to the computation of the probability of a sampler to support segregation from the EVA model, observing only the input layer, IL, and passing its output through one single sampler, N_sl = 1. While the mean spike counts over a pool of N_in A1-neurons did not change significantly with N_in, the standard error to the mean did (Fig 8A). The decrease in the spike count error to the mean made the horizontal line C_th intersect fewer local distributions and biased the data-fitting towards the behavioral curve for a certain DF, at the expense of others. Choosing more A1 neuronal units (larger values of N_in; Fig 8A) led to larger spread in the simulated neurometric functions and poorer fitting (Fig 8B; at DF = 5, 7 the neurometric functions (solid lines) plateaued at probability approximately 1 after triplet-sequence onset, as N_in increased; e.g. case N_in = 100). The best approximation of the asymptotic levels of the behavioral buildup for all DF conditions was found at a relatively low N_in however the rising transients of the neurometric functions were still much faster than in the experiment.

Modeling percept durations with Micheyl’s model.

We extended the work from [1] by using the signal detection algorithm to generate “percepts” and characterize their distributions. For each DF, adjacent triplets of the same type (I or S) were grouped together to create percept phase durations and construct frequency graphs (Fig 9). Theoretical calculations showed that subsequent percept durations generated by Micheyl’s model were exponentially distributed, as opposed to gamma-like. During subsequent durations we could assume that the buildup functions of stream segregation had reached an asymptotic level p (Fig 9A and 9B, upper panel) and that the activity in the A1 pool was independent at each triplet. Then the probability that percept S consisted of n-triplets could be calculated as Prob(D_S) = pⁿ(1 − p) = (1 − p)e^{n ln p}, depending exponentially on n. Likewise the probability that I consisted of n-triplets was Prob(D_I) = p(1 − p)ⁿ = pe^{n ln(1 − p)}. Similar results were obtained from numerical simulations of Micheyl’s model. The probability density functions were found to be discrete versions of exponential curves and the mean durations were small at about 1 s (Fig 9A and 9B, middle and lower panels), suggesting that the signal detection algorithm is not appropriate to describe perceptual alternations and percept durations, key aspects of bistable stream segregation. (Compare to Fig 1C; also S1 Fig and [2, 5, 6]).

Download:

Fig 9. Signal detection algorithm adapted from [1] yields exponential distributions and unrealistic mean durations of percepts.

(Top) Binary threshold C_th is chosen to yield the least-squares error between neurometric buildups (solid) and behavioral buildups (dashed) at DF = 3,5,7. Poisson spike counts are averaged across a sample of N_in = 5 neuronal units and compared to C_th to classify a triplet either as I or S. Trial-averaging the S-tagged responses produces the neurometric functions. The threshold value is determined by least-squares fit for A: the first 15 seconds of the stimulus to match the transients, or for B: the last 15 seconds of the stimulus to match the asymptotic level of the behavioral buildup; See also Fig 8. (Bottom) Trial-by-trial applications of the signal detection algorithm from [1] with A: C_th = 4.01 and B: C_th = 4.21 yield exponentially distributed subsequent percept durations for I (blue) and S (red). Their mean values μ are significantly smaller than those reported in the experiment. Note: Same parameter values N_in = 5, C_th = 4.21 were used in the EVA model for activation of the sampler layer SL (Fig 4A and 4B) and obtain gamma-like distributions of percepts (Figs 1D and 6B).

https://doi.org/10.1371/journal.pcbi.1008152.g009

Discussion

We developed a new evidence accumulation model for auditory streaming of triplet sequences ABA₋ ABA₋ … that takes as input neuronal responses of primary auditory cortex, A1 (macaque, [1]). Our neural-like model accounts for the (human) behavior we observed under three conditions (tone frequency difference, DF). During trials, subjects reported spontaneous alternations (bistability) between integration, I, and segregation, S. The first percept was usually I; the probability of S built up over time rising from near zero and plateaued within a few seconds to a level that increased with DF. In the model, switching between I and S occurred when noisy accumulation of evidence against the current percept exceeded threshold. Our simulations matched both buildup time-courses and percept-duration distributions.

Our model draws inspiration from the population separation hypothesis of [7] and focuses primarily on the B-tone responses of A-tone selective neurons. Micheyl et al [1] used similar principles to compute “neurometric” functions for segregation buildup. Their signal-detection model was applied to A1 and to sub-cortical neuronal spike count data to conclude that perceptual organization of auditory streams was present in early stages of the auditory pathway [3, 21]. It treated each triplet as independent of the previous ones, without an accumulation process from triplet to triplet. The only time dependent mechanism was adaptation of A1 neurons that was nearly complete after 2-3 triplets—too fast to account for buildup. Herein we show that Micheyl’s model behaves as if classification is like coin-tossing with possible bias. Simulated durations are therefore like run-lengths in coin-tossing, exponential-like and very brief, contrary to the observed data (gamma or lognormal-like).

Our approach underscores the essential significance of duration distributions as characterizations of streaming and switching, a constraint overlooked by previous analysis [1]. It emphasizes that neuronal-based modeling of behavioral data that goes beyond trial-averaged behavior may need to involve an evidence accumulation process in order to account for the statistics of single trials.

Novel features

Our model is intuitively straightforward. It describes the accumulation of evidence, incremental from each triplet, for or against the current percept. The estimated A1-spike counts are passed through a sampler layer, SL, each of whose units sample a few A1-neurons. SL-units vote 1 or 0 if the summed spike counts for the current triplet are above or below threshold. The fraction p_I (p_S) of sampler-votes 1 (0) represents the net output which favors integration (segregation), transmitted to the accumulators. After multiplicative weighting, p_I, p_S are used together with additive noise to update the accumulators. Of significance, the weighting factor is state dependent, proportional to the difference, T−x, between the current accumulator value x and a target T. Accumulation slows when x is closer to T and, importantly, we can choose T<1 in which case our model mimics noise-driven attractor competition dynamics [23]. Further, if T is close to one (i.e. accumulation saturates below, near threshold), gamma-like distributed threshold-crossing times are more robustly obtained with modest noise levels [22, 28].

State-dependent dynamics of stochastic accumulators in the framework of bistable perception were highlighted in other previous works [24, 29]. Our approach implements several distinctive features: a link to spike count neural data, an intuitive equation for the accumulator (see Eq 1, basic model for behavior), and a theoretical framework that goes beyond equidominance by looking at graded responses across multiple stimuli conditions.

Our model is a hybrid: it incorporates some neuro-based phenomenology (A1 neuronal responses as input, saturating driving force, escape dynamics) but it is non-committal to specific neuronal mechanisms of inhibition and adaptation. Moreover, key parameters are not directly linked to neuromechanistic processes but rather determined by fitting model dynamics (simulated threshold-passage times) to observed duration distributions.

Duration distributions underlie buildup

Buildup functions (BUFs) for behavioral data are based on trial-averaging of ongoing reporting of percepts; the buildup functions can be well-reproduced by an alternating renewal process applied to the percept duration distributions [17], in spite of disregarding the small inter-duration correlations. Our EVA model, using neural data as input, as well reflects a choice process, a neuronal computation, based on single-trials. From the EVA-simulated switch times we computed the “percept” duration distributions and generated BUFs that compared well with the behavioral data. The single-trial percept durations are the critical observations for a model to match in order to characterize streaming dynamics for stimuli with constant parameter values such as DF. We conclude that trial-averaging of the spike counts, especially from too early in the cortical pathway, and a triplet-based signal detection scheme [1], washes out the dynamical aspects of accumulating neuronal computations that underlie perceptual multi-stability. Model-based analyses of trial-averaged neuronal responses that show ramping behavior in decision-making tasks have recently come under scrutiny by consideration of single-trial data [30]. Arguments were made, admittedly still under debate [31, 32], that trial-averaged smooth time courses of evidence accumulation during decision-making might arise from temporally “discrete steps” rather than from continuous ramping dynamics. We suggest that care be exercised when making interpretations from trial-averaged neuronal responses, neuronal ramping or neuronal BUFs, to consider that such averaging may overlook the discrete event nature of perceptual switching and/or decision-making that involve evidence-accumulation/competition.

Fitting of model to data

We assigned different values of noise and targets to “against” and “for” accumulators to ensure switching was caused by the against-unit crossing the decision threshold. Intuitively, as the accumulator-against saturates around target-against T_a (subthreshold), enough noise σ_a guarantees threshold-crossing. The closer T_a is to one, the less noise is required to produce alternations. Within the switching domain different combinations of T_a and σ_a yield different distributions and means of percept durations. Our model reproduces the experimental data when T_a, σ_a are taken from a restricted parameter region. With T_a constant across conditions we captured the observed trend of mean durations although some values were off. With fine-tuning of T_a across conditions (but σ_a constant) we match the observed duration distribution shapes and means. This approach is analogous to obtaining the proper balance between noise and adaptation necessary for alternations in other models for bistable perception [22, 33]. Noteworthy, our model shows switching behavior when tuned in other parameter regimes, including with T_a>1. However, in such a drift-dominated regime although noise is not needed for alternations, we found that matching the statistical features and behavioral trends required a substantially higher (unacceptable) noise level.

Comparison with other models

Barniv and Nelken (2015) and Cao et al (2016) also modeled auditory bistable perception as evidence accumulation based. The former’s model used Bayesian assignments of B-tones to either the same class as A-tones (integration) or to a different class (segregation). Its noise-free version shows periodic alternations, as does our system for T_a>1, but the dynamics do not reset. Instead, our accumulators undergo discontinuous resetting after each switch. Most importantly, in contrast to [5], the parameters in our model are interpretable, functionally if not physiologically. Cao et al. formulated a stochastic accumulator that reproduced (like ours) several scaling properties of bistable behavior but without a description of switching and of, possibly asymmetric, alternations. Our work differs from both models by incorporating directly A1-spiking data as input. In this sense it is more akin to [6], a literal competition model. Notably, our approach predicts that neuronal computation for percept representation and evidence accumulation takes place beyond A1; its input (activity from A1 devoid of switch-dynamics) implicitly includes inhibition, adaptation and noise that occur within A1 and preceding stages.

Dynamic competition models commonly include two or more units representing response patterns associated with different percepts, and share mechanistic features of mutual inhibition, adaptation, and noise [6, 14]. In our two-process model only one percept is currently dominant thereby realizing mutual exclusivity. However, inhibition is not explicitly invoked; rather, our model performs as if a firm choice is made at the switch time, further accumulation of evidence in favor of the fresh percept is prevented; the in-favor accumulator is reset and targeted to a low value, T_f, despite continued incoming evidence from SL.

In oscillator-based alternations, switching may be determined by strongly rising adaptation in the dominant unit, leading to “release” from inhibition, or by stronger recovery from adaptation in the suppressed unit, leading to “escape” from inhibition [34–37]. Our model has no explicit adaptation variable as negative feedback. However, the accumulation of evidence against the current percept may be viewed as recovery of salience of the non-dominant percept. The rise and eventual take-over of dominance is therefore analogous to the escape from suppression in competition models.

In such models if adaptation is weak, changes in dominance may be represented by noise-driven switches between stable states in attractor state dynamics [22, 23]. These insights motivated our choice of an evidence-against accumulator that saturates just-subthreshold. Dominance durations are longer with reduced noise, and no switches occur in the noise-free idealization. Further, gamma-like duration distributions are more robustly obtainable with this mechanism: rise to saturation and wait for switch-favoring fluctuations to induce a switch. Satisfactory results are also obtainable with T_a>1, but if T_a exceeds threshold by too much, acceptable duration statistics seemed to require strong noise, and accumulator time courses were noise-dominated.

Bistable perception for ambiguous visual displays was modeled by [38] as a continuous time accumulation of binary (bistable) units becoming active with state-dependent transition rates between the active and inactive states. Our modeling shares a key feature: saturation to a level that strongly affects the percept durations; saturation near/below threshold underlies escape-like dynamics with gamma-like duration distributions. Distinguishing from [38], our model is event-based (discrete-time) with stimuli-induced positive increments and additive zero-mean noise that allow positive/negative increments, not a Markov model. It is applied directly to the neural data and includes saturation with noise-driven attractor dynamics as in competition models.

Limitations, extensions, predictions

Reports on triplet-streaming are conflicted about correlations between successive I, S durations, showing either statistical independence [2] or small positive correlations [5]. In our model both accumulators are reset after a switch approximately to target T_f so correlation between successive percepts is weak. However, we could likely match the reported correlations [5] by changing the resetting to generate continuous dynamics of accumulators.

Alternations between percepts are generated by the evidence that accumulates against the current percept. Its dynamics depends primarily on the distance to target T_a and on input p from the sampler layer, with T_a assumed relatively constant across conditions. Alternatively, one might choose T_a as a DF-dependent parameter and keep p unchanged. Such an approach suggests an interpretation of the target, with nearness to threshold, reflecting a combination of condition-dependent input and inhibition, and possibly excitation (in-line with a population separation hypothesis [7]). Then p, as constant and independent of DF, can be viewed as rate of recovery from adaptation. However, to establish a derivable connection between T_a and the experimental condition and A1 spiking activity presents challenges.

In our model the number of A1 neurons that are sampled by each SL-unit is much lower than the number of recorded units used in the signal detection approach in [1] (N_in = 5 vs 91 cortical neurons). We found that the granularity of sampling A1 by a unit in the sampler layer is important in order to preserve sufficient variability in the averaged spike count over trials and thereby obtain graded BUFs across different DF conditions. Perhaps the constraint on N_in derives from our assumption of statistically independent A1 neurons. As shown by [27], trial-to-trial variability in spike counts for N_in small, if spikes are statistically independent, is equivalent to the variability over a much higher number of A1 neurons if correlations exist within the pool. We did not have access to the original spike times from [1] to verify this hypothesis; we only extracted mean spike counts from the published data. However, this observation is supported by a subsequent study by Micheyl et al [12] and could be explored in future simulations; when spike counts from a subset of 30 cortical neurons (or even just one neuron) out of 91 were analyzed with the signal-detection model, the resulting neuronal-based BUFs were less widely spread across conditions, matching the graded behavioral BUFs from a different subject pool (see [12, (Fig.5 in Ref)], compare with Figs 3 and 8 for our model).

Our model could be extended to mimic the transient behavior of buildup by relaxing the constraints on initial conditions and treating the baseline as DF-dependent. Two hypotheses may be tested: that integration emerges with first percept probability as in the behavioral data and that early adaptation of A1-responses accounts for longer first, than subsequent, I-durations [39].

With minimal modifications to our model we could test for behavior at other DF values or for dependence on presentation rate. Assuming lower target-against levels T_a for faster presentations, we predict at constant DF similar mean I-durations but longer S-durations, and higher probability of segregation [6]. With increased presentation rate mean spike counts for B-tones will decrease [8] and lead to lower vote counts p_I and lower effective accumulation rate, T_a p_I. Although p_S (= 1-p_I) would increase, the increase would be compensated by the decreased T_a leaving T_a p_S relatively unchanged.

To conclude, we propose an evidence accumulation model for auditory bistable perception with neurally-plausible mechanisms that accounts for statistics of behavioral data. In principle, it could be extended to study dynamics induced by transient perturbations (deviants/distractors; [39]) or associated with multiple percepts [14]; implementations of such generalizations remain as open topics for future research.

Methods

Experimental design and statistical analyses

Participants.

Fifteen human subjects with normal hearing (8 female and 7 male; ages 18-45 yrs.; median 22 yrs.) were included in the behavioral study. They listened to sequences of repeating ABA₋ triplets and were instructed to continuously report their ongoing percept by selectively pressing one of two different buttons on a keypad. Subjects began reporting their percept typically 2 s after stimulus onset as integration (I; a single, coherent stream, the galloping pattern ABA₋ ABA₋) or segregation (S; two simultaneous distinct streams A₋ A₋ A₋ A₋ and ₋B₋₋₋ B₋); Fig 1A and 1B.

Stimuli.

Stimuli were 30-s long sequences of triplets ABA₋ that consisted of alternating high (A) and low (B) pure tones gated with 10 ms raised cosine ramps and followed by a 125 ms silent pause “₋”; Fig 1A. In total, triplets were 500-ms in duration and were repeated 60 times per trial. Tones were separated in frequency by DF semitones chosen from three conditions (DF = 3, 5, 7) with each condition being presented five times per experimental block. To prevent habituation to a certain frequency, for each DF the tones were generated by roving through variants of frequencies taken 0, ±1 or ±2 semitones apart from their geometric mean (middle pair in the list below); see also [4]. Frequencies (f_A, f_B), in Hz, were chosen as: (494, 415), (523, 440), (554, 466), (587, 494) or (622, 523) Hz at DF = 3; (523, 392), (554, 415), (587, 440), (622, 466), (659, 494) at DF = 5; and (554, 370), (587, 392), (622, 415), (659, 440), (698, 466) at DF = 7. Stimuli were digitally generated in Matlab at 48 kHz sampling rate and were delivered through earphones in a soundproof isolated room. Subjects had the sound volume adjusted to their comfortable hearing level.

Experimental protocol.

Each subject performed 9 experimental blocks. Each DF condition was randomly presented 5 times per experimental block, using different combinations of frequencies for tones A and B, without repetition (see Stimuli). A Latin square design was used to determine the order of presentation of each condition in each block. This resulted in group data with 675 30-s trials for each of three DF values. The frequency separation values (DF = 3, 5, 7) were chosen to fall within the range of ambiguity of the van Noorden diagram in which listeners can perceive both integration and segregation [10, 40]. All subjects underwent a training session in which they were given verbal explanations and auditory illustrations of the two possible percepts, and they practiced distinguishing between them. Then, during the recording session, listeners were instructed to press and hold one key on a keypad when they perceived stimuli as I, and to release it while pressing another key when they perceived stimuli as S, and so on. They were encouraged to respond as soon as they heard the change in percept. The key-response data were converted to binary vectors with value 0 assigned to I (and to the latency period defined as the time before identification of either percept) and 1 to S, for further analysis. Experiments were performed in a dedicated soundproof booth in the Human Brain Research Laboratory, Neurosurgery Department at The University of Iowa. Written informed consent was obtained from all subjects. Research protocols were approved by the University of Iowa Institutional Review Board.

Build-up functions and the latency period.

The time course of S percept after stimulus onset was computed from key-pressed data, 0 for I and 1 for S. Those were sampled at 1 ms to create vectors of binary values corresponding to appropriate percept type at a particular time instance. At each DF condition, binary vectors were averaged across 675 trials to obtain the build-up function of S; bootstrapping was also used to compute the 95% confidence interval (CI) around the mean (Fig 1B and 1C). The time course of I was computed with the same procedure but over key-pressed data labeled as 1 for I and 0 for S. At a particular time t, the proportion of trials classified as I, S or neither (during the latency period/the first few seconds after stimulus onset) were p_int(t), p_seg(t) and p_Lat(t), and summed up to p_int(t) + p_seg(t) + p_Lat(t) = 1. The first percepts were typically I. However, for larger values of tone frequency separation, subjects tended to report S as first percept more often which led to an increase in p_seg(t).

First durations and subsequent durations.

All complete percept durations across trials and conditions were included in the behavioral analysis. Unfinished percepts and button presses recorded after the end of stimulus presentation were discarded. For each DF = 3, 5, 7, the statistics was evaluated over four subsets of data, separately: first I, first S, subsequent I and subsequent S. For each DF and each of these four percept types, the mean dominance duration was computed in two steps: first, it was computed per subject, say μ_i for subject i = 1, …, 15; then the mean duration μ of the group data was defined (and reported) as the unweighted average across all subjects, μ = (μ₁+ μ₂ + ⋯ + μ₁₅)/15 (e.g. Fig 1C; mean μ is shown for first duration distributions at DF = 5). By this approach, any potential bias of the calculation towards fast switchers who might contribute more durations to the pool and concurrently spend less time in a particular percept, was mitigated. Error bars at 95% CI of the mean were also determined; error bars corresponded to 1.96 SE; standard error SE was computed from the standard deviation std_exp over the group of means μ_i. For analysis of grouped data we used subject-specific normalized (by individual subject mean) percept durations as follows: at each DF condition and each percept type (first/subsequent, I/S), any raw percept duration D of subject i was normalized by the corresponding mean μ_i to . Histograms of normalized phase durations for each DF condition and percept type were computed and fit by gamma distributions with density functions (3) Mean was well-fit to 1 due to normalization. Then the coefficient of variation and the skewness depended exclusively on the shape parameter α. If α was large then Eq (3) was equivalent to a normal distribution. If α ≈ 1 then Eq (3) was equivalent to an exponential distribution. On the other hand, for α ≈ 2 (as observed for first durations in behavioral data) and α ≈ 2.6 (as observed for subsequent durations), the distributions satisfied the scaling property γ₁ = 2CV with CV ≈ 0.7 and CV ≈ 0.6 respectively. The latter case was similar to the findings of [24] that described the statistics of percept durations for other examples of perceptual bistability.

Distribution testing of phase durations.

The fitting of the experimental (and numerical) data was obtained by calculating the values of α and with the Maximum Likelihood Estimation (MLE) algorithm. The goal was to determine α and that yielded the maximum product ∏_k y_k of all y_k gamma-likelihood values of the normalized percepts counted by index k for each run of the experiment. This was equivalent to maximizing the log-likelihood based on Eq (3). The optimization of LL was implemented numerically with MATLAB function fminsearch. Distribution testing on normalized durations was done by statistical bootstrapping. We generated 10000 bootstrapping sets of gamma distributions with fitted parameters α and and constructed the distribution of maximum log likelihood values for those sets. The test statistics LL was compared to this distribution to obtain the probability of log likelihood to be less than LL (the p-value). The normalized durations were well fit by a gamma distribution with the optimal values α and (as null hypothesis) at significance level 0.05 if p-value ≥ 0.05.

Statistical analysis of model-generated data.

The histograms of first and subsequent durations I and S in trials generated by the model (see below), and their fitting by gamma distributions, were computed in a similar manner as for the experimental data. Likewise, build-up functions for the model were constructed as those for the behavioral data.

The evidence accumulation model

Our proposed EVA model is a feedforward network of 3 layers: the input layer of spiking units, the sampler layer of binary response units, and the accumulation layer of two accumulators. The time-unit of the model is discrete and defined as the position of the triplet in the auditory sequence. Every DF-dependent numerical simulation of the EVA model consisted of N_tr = 675 repetitions (trials) to mimic the setup from the behavioral experiment. The trials were then used to generate the statistics of the percept durations in terms of mean values, shape of distributions, and buildup functions. Finally, in order to test for the model’s robustness in the presence of noise (not for sensitivity to model parameter values), this numerical procedure was run 100 times for each DF = 3, 5, 7 condition separately, and the results were characterized by averaged values and their 95% CI.

Input layer (IL).

The IL-units were assumed to be tone-A selective neurons from primary auditory cortex (A1) as described in [1]. The averaged (over trials) spike counts m_t,DF of the IL-units were derived from data (see section Data-driven parameters for EVA model) and depended on the position t of the triplet in the ABA₋ sequence (t = 1, …, 60 for a 60-triplet long stimulus; 30 s in duration) and on the semitone difference DF. The model was simplified by focusing only on the spike counts during the B-tone presentation at A-tone selective neurons in A1. As reported by multi-unit recordings in monkeys, such A1-neurons adapted strongly and rapidly during presentation of triplet-repeating auditory sequences [1, total of 91 neurons]. Temporal correlations between the means of an A1-neuron from triplet to triplet were captured in the model by the trend of m_t,DF that decreased exponentially with t (Fig 4B). Trial-to-trial variability of the dynamics of IL units as well as unit variability in IL during a single trial were implemented using Poisson point processes. (We used this approach because we could extract mean spike counts from published data of [1] but did not have access to the original spike times.) For an A1-neuron with mean spike count λ = m_t,DF we supposed that its instantaneous spike count k (k = 0, 1, 2, …) at triplet t and condition DF, was randomly generated from a Poisson distribution with probability P(X = k) = λ^k e^−λ/k!. The Poisson spike counts are generated independently for each neuronal unit in IL, each triplet and each semitone difference condition. Note that the model could be generalized by assuming neuronal heterogeneity, with averaged spike counts at neuron j chosen from a normal distribution of mean m_t,DF and standard deviation s_t,DF derived from [1]. However, the impact of heterogeneity on the model’s outcome would be negligible given that EVA model was formulated to use mean spike counts over IL neuronal pools as input rather than mean spike counts of individual neurons (see below).

Sampler layer (SL).

The SL-units were tasked with summating and classifying spike counts from subsets of N_in IL units (A1-neurons). Consider that a trial of length N_t (N_t = 60 triplets) during a certain DF condition was simulated by EVA model: each sampler summed the input of a pool of N_in neuronal units from IL; weighted by N_in, this gave the mean spike count for B-tones of the corresponding pool of A-tone selective A1-neurons, for triplet t; then was compared to a DF-independent, fixed, neuronal threshold C_th. If the averaged spike count was large () then the subset activity was high and the pool was assumed to support, for this triplet, the integration percept I. The sampler’s response at triplet t was classified as “1”. If the averaged spike count was small (), the subset activity was low and the IL-pool was said to support the segregation percept S. The sampler’s response was classified as “0” (Fig 4A). Therefore, at each triplet t, each sampler behaved like a biased coin being flipped with probabilities p_0;t,DF and p_1;t,DF over the binary probability space of 0 and 1. Since neuronal units in each IL-pool followed independent Poisson distributions of parameters m_t,DF, the pool itself was also a Poisson process defined by the product N_in m_t,DF. Then, the samplers were binary signal detectors with probabilities (4) and p_1;t,DF = 1 − p_0;t,DF calculated over 675 repetitions of the model in order to maintain similarities to the behavioral experiment, and with expected value and variance and .

Three DF-independent parameters were associated with SL: N_in, the number of A1 inputs to a sampler unit; C_th, the neuronal counting threshold that categorizes ensemble activity in A1 as high (class 1) or low (class 0); and N_sl, the number of neuronal units in SL. The values of N_in and C_th were obtained by least-squares fit between probabilities p_0;t,DF and the “asymptotic” levels 0.45, 0.6, 0.65 of the psychometric buildup functions (last 15 seconds of trial duration) for all DF = 3, 5, 7 (Fig 8B; also section Data-driven parameters for EVA model). The psychometric buildup represented the fraction (over the trials) of the segregation percept S reported by all subjects at each time point and DF during the 30-s long trial. Through this optimization procedure (Fig 8B; optimal values obtained for N_in = 5, C_th = 4.21) probabilities p_0;t,DF of a sampler to be in a state that supported percept S were estimated—at least for triplets several seconds from the stimulus onset—as (5) The inclusion of the sampler layer in the model (N_sl > 1) ensured neuronal granularity that was found by other studies to be a suitable substrate for perceptual representations [24]. In particular, at each triplet t, some of the N_sl samplers were in class 0 showing low A1 spiking and thereby associated with segregation [1, 7] while others were in class 1 supporting integration. The percentages p_S(t) and p_I(t) = 1 − p_S(t) of such samplers were taken herein as stochastic (over trials) output of SL (Fig 4A).

Accumulation layer (ACC).

The accumulation layer consisted of two units whose dynamic states x_I and x_S changed from triplet to triplet according to Eqs (6) The accumulator for x_I gathered evidence that favored integration (through input p_I(t) from SL) while the accumulator for x_S gathered evidence that favored segregation (through input p_S(t) from SL). Importantly, their states were influenced by the perceptual context as well (Fig 4C, solid lines in blue and red illustrated traces for x_I and x_S; the background color showed the percept’s type, blue for I and red for S). In particular, if segregation was the current dominant percept then x_I accumulated evidence against segregation and aimed to reach target T_I = T_a; simultaneously, x_S accumulated evidence for the current percept and it drifted instead towards target T_S = T_f. Discrete-time Gaussian white noise processes σ_Iξ_I(t), σ_Sξ_S(t) with zero mean and standard deviations σ_I = σ_a and σ_S = σ_f, interacted with the stochastic inputs from SL to produce certain levels of fluctuations. The additive stochastic terms in ACC were target-dependent with σ_a > σ_f for T_a > T_f (Fig 4C). On the contrary, if the current percept was integration then x_I accumulated evidence for I and approached T_I = T_f while x_S accumulated evidence against I and approached T_S = T_a. The level of local noise was adjusted accordingly to values σ_I = σ_f and σ_S = σ_a.

Switches and resetting conditions.

In the experiment, subjects identified the dominant percept by pressing a certain button on the keypad. Equivalently, in EVA model, the switch from one dominant percept to the next occurred when either x_I(t) or x_S(t) crossed the ACC threshold set to 1. Herein, the alternations were initiated by the accumulator that observed how many samplers in SL opposed the current percept at each triplet t. Its trace was attracted to target T_a that lay near the threshold, then was pushed across the threshold by the noise of strength σ_a. In the meantime, the accumulator in favor of the current percept hovered around T_f with fluctuations set by σ_f. For example during percept I, accumulator x_S was the first to reach threshold 1 producing a switch to percept S; then, during S, x_I reached threshold 1 leading to another switch to subsequent percept I, and so on (Fig 4C). At every change in percept, the accumulators were reset to new levels , . These were defined as where was the value that the evidence-for accumulator x_* (* = I, S during current percept I, S respectively) reached just before the switch.

The simulations took into account the proportion of first percepts reported as segregation at each DF during the behavioral experiment as well as the latency period during each trial. The initial conditions of the accumulators were set to a DF-independent baseline value b and kept constant during the entire latency period (calculated in length of T_Lat triplets) of any given trial. We defined x_I(t) = x_S(t) = b for all triplets t between 1 and T_Lat; then at t = T_Lat the type of the current first percept, I or S, was imported from the behavioral data; the dynamics of the accumulators for t ≥ T_Lat were then determined according to Eq (6) and the associated reset conditions.

The choice of parameters (σ_f small) and of reset conditions ( where x_a, x_f are accumulators “against” and “for” the dominant percept) ensured that the switch was triggered by the dynamics of x_a. Rare events when x_f might have crossed the threshold ahead of x_a were disregarded. Another possible implementation of resetting would allow for correlations between consecutive percepts; it could depend on each accumulator state just before a switch, a simple interchange of roles such that ACC variables remained continuous, , .

Model analysis

Parameter values used in model simulations.

All figures for the full EVA model (Figs 1, 4—8, and S1—S3 and S5 Figs), were generated with the following parameter values:

N_in = 5, C_th = 4.21, N_sl = 20, b = 0.7, T_f = 0.6, σ_a = 0.085, σ_f = 0.03

and decision threshold θ = 1, unless otherwise stated in their caption. (For parameter values m_t,DF associated with IL, see Data-driven parameters for EVA model.) Target T_a was initially chosen equal to 0.9 and then was fine-tuned to best fit the mean dominance durations of the first and subsequent integration and segregation percepts from the behavioral data—within ±10% standard error (SE) of the experimental mean values (Figs 5 and 6; also S5 Fig). Its values changed with DF, and with classification (first or subsequent) and type (I or S) of the percept. We used the notation T_aI1 for “target against integration, first percept”, T_aS1 for “target against segregation, first percept”, T_aI2 for “target against integration, subsequent percepts”, and T_aS2 for “target against segregation, subsequent percepts”, respectively. Therefore, in the model, whenever acting as target-against, T_S = T_aI1 or T_aI2 while T_I = T_aS1 or T_aS2. In simulations we used the following values (see red diamonds in Fig 7):

At DF = 3: T_aI1 = 0.8273, T_aS1 = 0.9273, T_aI2 = 0.8924, T_aS2 = 0.8924;
At DF = 5: T_aI1 = 0.9000, T_aS1 = 0.8909, T_aI2 = 0.9288, T_aS2 = 0.9106;
At DF = 7: T_aI1 = 0.9348, T_aS1 = 0.8773, T_aI2 = 0.9242, T_aS2 = 0.9318.

Selection of target-against T_a values.

The mean duration μ obtained by numerical simulations of EVA model was compared to its behavioral counterpart μ_exp for first and subsequent I, S percepts, and each DF = 3, 5, 7. Behavioral results from 15 subjects were characterized by group mean data μ_exp and 95% CI with CI corresponding to 1.96 SE (Figs 5A and 6A, lower panel). Then the model was considered to provide a good approximation of the experimental data if μ belonged to a narrow band within 1 SE from μ_exp (Fig 7, green region and black dots). This was equivalent to the relative error where CV = std_exp/μ_exp was the coefficient of variation computed over the group of subjects. Parameters T_a, σ_a used for the numerical simulations of EVA model (Fig 7, red diamond) were chosen as follows: σ_a = 0.085 was kept fixed while values of T_a were determined by restricting the error magnitude to only 10% SE (i.e. |μ − μ_exp| ≤ 0.1 SE); then, among the latter set we selected the value T_a that yielded the least error in shape of the gamma-fit distributions (see Eq (3)).

Statistical properties of SL-activation.

At any fixed triplet position t in the ABA₋ sequence presentation, each of the N_sl samplers was equivalent to an independent Bernoulli process (during repeated trials) with probability of success p_1;t,DF and probability of failure p_0;t,DF = 1 − p_1;t,DF. Likewise, the state of SL described a binomial process equivalent to the random variable, over trials, N_sl p_I(t) where p_I(t) represented the percentage of samplers in class 1 that favored integration at triplet t. The stochastic process had mean N_sl p_1;t,DF and variance N_sl(1 − p_1;t,DF)p_1;t,DF. As a result, the first two moments of the output p_S(t) and p_I(t) of SL were well-approximated, for sufficiently large triplet-indexes (t ≥ 30; see Eq (5) and Fig 8B, 2nd panel), by (7) In particular, for large N_sl the variance of p_S(t) and p_I(t) became negligible while their means remained unchanged.

Selection of noise level σ_a.

Our EVA model features accumulation that could saturate, given that target-against T_a was assumed to be subthreshold (T_a < 1). Hence the noise level σ_a has to be sufficiently large in order for the trajectory of the accumulator drifting towards T_a to reach the ACC-threshold. A theoretical lower-bound estimate for σ_a was obtained by assuming N_sl large and focusing only on the properties of the subsequent percept durations. Under such assumptions, p_S(t) and p_I(t) were approximately constant as demonstrated by Eqs (5) and (7). Then, after each switch, both accumulators satisfied an equation of the form X_n+1 = X_n + (T − X_n)p + ε_n+1 with X₀ ≈ T_f; T, σ taken as either T_a, σ_a or T_f, σ_f; and independent random variables ; see also Eq (1) in section A basic state-dependent model. Such an equation describes a stationary first order autoregressive model with parameter λ = 1-p [41]. Therefore, at the nth triplet during a percept immediately following a switch, the states of the accumulators followed a normal distribution with mean E[X_n] = T-(T-T_f)λⁿ and variance Var[X_n] = σ²(1 − λ²ⁿ)/(1 − λ²). In particular, at the nth triplet during integration, the mean and variance of x_S (x_I) that accumulated evidence against (for) the percept were E[x_S] = E_a, Var[x_S] = V_a, E[x_I] = E_f, Var[x_I] = V_f where with p = 1 − p_seg;DF. Then at the nth triplet during segregation they were E[x_I] = E_a, Var[x_I] = V_a, E[x_S] = E_f, Var[x_S] = V_f with E_a, V_a, E_f, V_f defined as above but computed with p = p_seg;DF. Given that 3 times the standard deviation from the mean accounts for 99.73% of values in a normal distribution, if σ_a was too small the accumulators could cross the threshold 1 only with very small probabillity. From the calculation above, a lower bound for σ_a in the model was estimated at σ_a > σ_a,min with , where p_M was the maximum of p_seg;DF and 1 − p_seg;DF for all DF = 3, 5, 7. For example, if T_a = 0.9 then a necessary condition for switching was σ_a > 0.022.

Statistical properties of ACC-activation.

As explained in the previous section, for small σ_a the accumulators in the EVA model could not cross threshold 1 (Fig 7; na, gray region). Numerical simulations showed that when σ_a increased to the right of curve σ_a = σ_a,min in the (σ_a, T_a)-plane, alternations between percepts occurred and the dominant durations were distributed according to: normal distributions at σ_a small (the parameter α in Eq (3) was very large) or to exponential distributions at σ_a large (α in (3) was near 1); see Fig 7, regions labeled “N” and “E”, respectively. At intermediate values σ_a, the distributions were gamma-like with shape close to that found experimentally (Fig 7, region labeled “G” between the two white curves). In the latter case, parameter α in Eq (3) was either near 2 (for first percept durations) or near 2.6 (for subsequent durations), and it differed from α_exp by relative error up to 20%, |α/α_exp − 1| ≤ 0.2 (except for first and subsequent I at DF = 7 at which the range for α was extremely narrow and we allowed for a larger error, up to 30% instead). The range for σ_a that led to gamma-like distributions varied slightly with N_sl with the biggest difference being identified at N_sl = 1; see S3 Fig.

Data-driven parameters for EVA model.

Parameter mean values m_t,DF were used to generate Poisson spike counts of IL-units at each triplet t (t = 1, 2, …, 60) in the ABA₋ sequence and for each semitone difference DF = 3, 5, 7; see section Input Layer and Eq (4). They were derived from multi-unit spiking neural data recorded from macaque monkey primary auditory cortex A1 by [1], using a procedure that combined exponential fitting with numerical interpolation. First, mean spike counts m_t,DF at A-tone selective neurons in A1 during presentation of tones A, B and A in ABA₋ were extracted from [1] for each triplet t ≤ 20 in the sequence and for each DF = 1, 3, 6, 9; See scatter points in [1, (Fig.3 in Ref)]; also S4 Fig. The mean spike counts at each tone decreased from value m_1,DF measured at the first triplet to some level at which they stabilized after a few seconds since stimulus onset. They were fitted by functions (8) with parameters chosen to minimize the least-squares error between the extracted mean spike counts m_t,DF and the corresponding exponential curve (S4 Fig, solid curves). In particular, significant differences in mean spike counts at different DF values were observed only during tone-B presentation with fitting in Eq (8) achieved for parameter values m_1,1 = 7.25, m_1,3 = 6.25, m_1,6 = 6, m_1,9 = 5.25 (according to data from [1]) and , , , respectively. Secondly, simulations of EVA model were performed for DF = 3, 5, 7 instead of 1, 3, 6, 9, and for a total of 60 rather than 20 triplets. We implemented these constraints in two steps: for DF = 3 we chose the mean spike counts m_t,DF as in [1] for t ≤ 20 and as for t > 20. Then for DF = 5 and DF = 7 and each triplet t we defined the mean spike counts by interpolation using the power function whose coefficients a_t, b_t satisfied the least-squares fit between this curve and the points (DF, m_t,DF) defined by Eq (8) for all DF = 1, 3, 6, 9 at any fixed t.

The mean spike counts for B-tones of any pool of A-tone selective IL neuronal units, as well as the standard error to the mean, were computed from simulation of Poisson processes with parameter m_t,DF while varying N_in (Fig 8A). Then a threshold value C_th was chosen to minimize the squared differences error between the model-based probabilities from Eq (4) of a sampler to support the segregation percept for all DF = 3, 5, 7 and the behavioral buildup functions, applied to the last 30 triplets (15 seconds) of the stimulus (Fig 8B). The pair of parameter values N_in = 5, C_th = 4.21 that generated the minimum error was then used in numerical simulations of the EVA model.

EVA model versus classical drift-diffusion models

To gain some intuition about the accumulation process in our EVA model and about the timing of switch events, we considered an approximation of the stochastic Eq (6) in continuous time. For that, we assumed the drift in Eq (6) to be constant and neglected its dependence on activity x. Then percept durations corresponded to the first-passage time of the ACC-unit that accumulated evidence against the current percept. Its equation could be interpreted as the constant-drift continuous-time diffusion model (DDM) dx = γ_a dt+ σ_a dW_t with positive drift rate γ_a, noise amplitude σ_a, Gaussian white noise dW_t, and decision threshold θ = 1. In this DDM, the likelihood of first-passage at time t follows an inverse Gaussian distribution [28] with density function and mean 1/γ_a, variance , and coefficient of variation . Moreover, the inverse Gaussian resembles a gamma distribution for large CV but converges to a normal distribution as σ_a decreased in relation to drift rate γ_a [28]. To some extent, the dynamics of the discrete-time ACC from Eq (6) share similarities with DDM above. Numerical simulations of our EVA model showed that gamma-like distributions of percept durations were possible only for σ_a chosen in a restricted parameter range, given fixed targets T_a and T_f (see section Statistical properties of ACC-activation). Outside this range, percept durations followed either normal distributions (for lower values of σ_a) or exponential distributions (for larger values of σ_a). However, the accumulation process in the EVA model is different than in the DDM for several reasons: Eq (6) are discrete-time drift diffusion models; they include leakage; their deterministic version admits bistable non-oscillatory solutions (no threshold crossing); and the input drive from SL is itself stochastic with fluctuations described by Eq (7).

Supporting information

S1 Fig. The evidence accumulation (EVA) model captures experimental mean duration (μ) and shape of gamma-like distributions (α) for first and subsequent percept durations at other DF values.

Distributions of normalized phase durations are shown for A: DF = 3 and B: DF = 7. They are obtained from numerical simulations of the EVA model (columns 2,4) and compared to those derived from behavioral data (columns 1,3).

https://doi.org/10.1371/journal.pcbi.1008152.s001

(TIF)

S2 Fig. Exemplar time courses of accumulators in the EVA model, shown for DF = 3 (top) and DF = 7 (bottom).

In only a few trials, 103 out of 675 for DF = 3 and 220 out of 675 for DF = 7, the first percept is segregation (see panels 2,4). During a cycle, the suppressed unit accumulates evidence against the current percept until it reaches the switching threshold. Then, a perceptual switch occurs and accumulators are reset to the same value. In the noise free case, the accumulators stabilize to their corresponding target values and there are no alternations. Such trajectories are depicted by dashed lines.

https://doi.org/10.1371/journal.pcbi.1008152.s002

(TIF)

S3 Fig. Mean percept durations (μ) and shape parameter values (α) in the EVA model are largely unaffected by changes in N_sl, the number of units in the sampler layer.

Some important differences occur, however, at N_sl = 1 (e.g. for subsequent durations). A two-parameter response diagram of the dependence of μ and α on target-against T_a and noise strength σ_a is shown for DF = 5 and varying N_sl for A: first percepts and B: subsequent percepts. All parameters are chosen as described in Methods, except for N_sl (here N_sl = 1, 5 or 10). For comparison, see Fig 7, middle column; N_sl = 20 at DF = 5. Red diamonds correspond to same parameter choices as in Fig 7 for N_sl = 20, as well. The heat map represents the ratio μ/μ_exp between model-generated μ and mean duration μ_exp from the behavioral data. Regions of no alternations (na) are colored in gray. Mean durations are much longer than their experimental counterparts μ_exp (region in warm colors), much shorter than μ_exp (region in cool colors), or close to μ_exp (within one standard error to μ_exp; in green; black dots depict a discrete selection of values in the green region). The distributions of normalized percept durations are characterized by three distinct regions: for small σ_a the distributions are normal (region N, to the left of dashed-white line; α ≫ 3); for large σ_a the distributions are exponential (region E, to the right of solid-white line; α near 1); for intermediate values σ_a the distributions are gamma-like with shape close to that found experimentally (region G, between white contours; α ≈ 2 for first percepts and α ≈ 2.6 for subsequent percepts; α differs from α_exp by relative error up to 20% except for integration at DF = 7 where it is up to 30%). As in Fig 7, middle column, the intersection of white contours with the sheet of black dots identifies parameter values that yield well-fit data. Note that at N_sl = 1 this intersection is empty for both subsequent percepts I and S (panel B, first column).

https://doi.org/10.1371/journal.pcbi.1008152.s003

(TIF)

S4 Fig. The signal detection algorithm for constructing a neurometric function (the probability of segregation as a function of time) generates acceptable buildup fits at DF = 1, 3, 6, 9.

For comparison, see Micheyl et al (2005) [1]. Upper panel: mean spike counts m_t;DF (scatter points) at A-tone selective neurons in A1 during tone B were extracted from [1, (Fig.3A in Ref)]. They correspond to conditions DF = 1 (blue), 3 (green), 6 (red), 9 (cyan), based on 10 s (20 triplets) long trials. The mean spike counts decrease exponentially and stabilize within a few seconds (solid curves for the exponential fits). The algorithm generates spike counts during B-tone by using Poisson processes of means m_t;DF, and then average them over N_in neuronal units. The average values of the mean spike counts, including asymptotic values (written in parenthesis) at each DF, and the standard error to the mean (SEM) are computed over 675 trials. Lower panel: The signal detection algorithm constructs neurometric functions using numerical data from all N_in neuronal units. Parameters N_in and C_th are chosen to yield SEM similar to those observed in the spike count data [1, (Fig.3A in Ref)] and to yield the least-squares error of the experimental buildups (dashed, extracted from [1, (Fig.4 in Ref)] and the computer-simulated neurometric functions (solid) for DF = 1, 3, 6, 9. The best approximation is obtained for N_in = 30, C_th = 4.64. Note: Statistics of percept durations were not reported in [1]; this prevented us from comparing these aspects of behavioral data from [1] to our numerically-generated duration distributions at DF = 1, 3, 6, 9.

https://doi.org/10.1371/journal.pcbi.1008152.s004

(TIF)

S5 Fig. EVA model with fixed target-against value T_a across all conditions and percept types captures some, but not all, characteristics of perceptual alternations.

For comparison, mean durations and shape parameter α of gamma distributions are shown for A: Experimental data; B: EVA model with optimized values for target-against (see Methods, Parameter values used in model simulations). EVA-generated results are identical to those in Figs 5 and 6; and C: Non-optimized EVA simulated with T_a = 0.9 across all DF = 3, 5, 7 and first and subsequent I, S. All other parameters are as in panel B. The mean durations from simulations follow the trend of experimental data which is decreasing/increasing with DF for I/S respectively. However, they fail to approximate well the entire set of behavioral data (e.g. approximations of mean first durations at DF = 3 and DF = 7 are inaccurate). On the other hand, gamma-fit shape values α are comparable to those from panels A and B. This is not surprising given that α depends mostly on the noise-level σ_a, as shown in Fig 7.

https://doi.org/10.1371/journal.pcbi.1008152.s005

(TIF)

Acknowledgments

The authors thank Xiayi Wang, Haiming Chen and Kirill V. Nourski for help with data acquisition, and Dan Tranchina for suggestions on data analysis.

References

1. Micheyl C, Tian B, Carlyon R, Rauschecker J. Perceptual Organization of Tone Sequences in the Auditory Cortex of Awake Macaques. Neuron. 2005; 48:139–148. pmid:16202714
- View Article
- PubMed/NCBI
- Google Scholar
2. Pressnitzer D, Hupé JM. Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Current Biology. 2006; 16:1351–1357. pmid:16824924
- View Article
- PubMed/NCBI
- Google Scholar
3. Pressnitzer D, Sayles M, Micheyl C, Winter IM. Perceptual organization of sound begins in the auditory periphery. Current Biology. 2008; 18:1124–1128. pmid:18656355
- View Article
- PubMed/NCBI
- Google Scholar
4. Deike S, Heil P, Böckmann-Barthel M, Brechmann A. The build-up of auditory stream segregation: a different perspective. Frontiers in Psychology. 2012; 3:461:1–7.
- View Article
- Google Scholar
5. Barniv D, Nelken I. Auditory streaming as an online classification process with evidence accumulation. PLoS ONE. 2015; 10(12):e0144788:1–20.
- View Article
- Google Scholar
6. Rankin J, Sussman E, Rinzel J. Neuromechanistic model of auditory bistability. PLoS Computational Biology. 2015. pmid:26562507
- View Article
- PubMed/NCBI
- Google Scholar
7. Fishman YI, Reser DH, Arezzo JC, Steinschneider M. Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hearing Research. 2001; 151:167–187. pmid:11124464
- View Article
- PubMed/NCBI
- Google Scholar
8. Fishman YI, Arezzo JC, Steinschneider M. Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. J Acoust Soc Am. 2004; 116(3):1656–1669. pmid:15478432
- View Article
- PubMed/NCBI
- Google Scholar
9. Curtu R, Wang X, Brunton BW, Nourski KV. Neural Signatures of Auditory Perceptual Bistability Revealed by Large-Scale Human Intracranial Recordings. Journal of Neuroscience. 2019; 39(33):6482–6497. pmid:31189576
- View Article
- PubMed/NCBI
- Google Scholar
10. van Noorden L. Temporal coherence in the perception of tone sequences. Doctoral Dissertation, Eindhoven University of Technology; 1975.
11. Beauvois M, Meddis R. Computer simulation of auditory stream segregation in alternating-tone sequences. J Acoust Soc Am. 1996; 99(4):2270–2280. pmid:8730073
- View Article
- PubMed/NCBI
- Google Scholar
12. Micheyl C, Carlyon RP, Gutschalk A, Melcher JR, Oxenham AJ, Rauschecker JP, et al. The role of auditory cortex in the formation of auditory streams. Hearing Research. 2007; 229:116–131. pmid:17307315
- View Article
- PubMed/NCBI
- Google Scholar
13. Krishnan L, Elhilali M, Shamma S. Segregating Complex Sound Sources through Temporal Coherence. PLoS Computational Biology. 2014; 10(12):e1003985:1–10.
- View Article
- Google Scholar
14. Mill RW, Böhm TM, Bendixen A, Winkler I, Denham SL. Modelling the Emergence and Dynamics of Perceptual Organisation in Auditory Streaming. PLoS Computational Biology. 2013; 9(3):e1002925. pmid:23516340
- View Article
- PubMed/NCBI
- Google Scholar
15. Almonte F, Jirsa V, Large E, Tuller B. Integration and segregation in auditory streaming. Physica D. 2005; 212:137–159.
- View Article
- Google Scholar
16. Wang D, Chang P. An oscillatory correlation model of auditory streaming. Cognitive Neurodynamics. 2008; 2(1):7–19. pmid:19003469
- View Article
- PubMed/NCBI
- Google Scholar
17. Steele S, Tranchina D, Rinzel J. An alternating renewal process describes the buildup of perceptual segregation. Frontiers in Computational Neuroscience. 2015; 8:166:1–13.
- View Article
- Google Scholar
18. Szabó BT, Denham SL, Winkler I. Computational Models of Auditory Scene Analysis: A Review. Frontiers in Neuroscience. 2016; 10:524:1–16.
- View Article
- Google Scholar
19. Rankin J, Rinzel J. Computational models of auditory perception from feature extraction to stream segregation and behavior. Current Opinion in Neurobiology. 2019; 58:46–53. pmid:31326723
- View Article
- PubMed/NCBI
- Google Scholar
20. Hill KT, Bishop CW, Miller LM. Auditory grouping mechanisms reflect a sound’s relative position in a sequence. Frontiers in Human Neuroscience. 2012; 6(158):1–7.
- View Article
- Google Scholar
21. Scholes C, Palmer A, Sumner CJ. Stream segregation in the anesthetized auditory cortex. Hearing Research. 2015; 328:48–58. pmid:26163899
- View Article
- PubMed/NCBI
- Google Scholar
22. Shpiro A, Moreno-Bote R, Rubin N, Rinzel J. Balance between noise and adaptation in competition models of perceptual bistability. Journal of Computational Neurosience. 2009; 27(1):37–54.
- View Article
- Google Scholar
23. Moreno-Bote R, Rinzel J, Rubin N. Noise-induced alternations in an attractor network model of perceptual bistability. Journal of Neurophysiology. 2007; 98:1125–1139. pmid:17615138
- View Article
- PubMed/NCBI
- Google Scholar
24. Cao R, Pastukhov A, Mattia M, Braun J. Collective activity of many bistable assemblies reproduces characteristic dynamics of multistable perception. Journal of Neuroscience. 2016; 36(26):6957–6972. pmid:27358454
- View Article
- PubMed/NCBI
- Google Scholar
25. Ermentrout GB, Terman DH. Mathematical Foundations of Neuroscience. Springer. 2010; 35.
- View Article
- Google Scholar
26. Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Wiley; 1966.
27. Parker AJ, Newsome WT. Sense and the single neuron: probing the physiology of perception. Annual Review of Neuroscience. 1998; 21:227–277. pmid:9530497
- View Article
- PubMed/NCBI
- Google Scholar
28. Simen P, Rivest F, Ludvig EA, Balci F, Killeen P. Timescale Invariance in the Pacemaker-Accumulator Family of Timing Models. Timing &amp; Time Perception. 2013; 1:159–188.
- View Article
- Google Scholar
29. Cao R, Braun J, Mattia M. Stochastic accumulation by cortical columns may explain the scalar property of multi-stable perception. Physical Reviews Letters. 2014; 113(9):098103:1–5.
- View Article
- Google Scholar
30. Latimer K, Yates J, Meister M, Huk A, Pillow J. Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science. 2015; 349 (6244):184–187. pmid:26160947
- View Article
- PubMed/NCBI
- Google Scholar
31. Shadlen MN, Kiani R, Newsome WT, Gold JI, Wolpert DM, Zylberberg A, et al. Comment on “Single-trial spike trains in parietal cortex reveal discrete steps during decision-making”. Science. 2016; 351 (6280):1406. pmid:27013723
- View Article
- PubMed/NCBI
- Google Scholar
32. Zoltowski DM, Latimer KW, Yates JL, Huk AC, Pillow JW. Discrete Stepping and Nonlinear Ramping Dynamics Underlie Spiking Responses of LIP Neurons during Decision-Making. Neuron. 2019; 102:1249–1258. pmid:31130330
- View Article
- PubMed/NCBI
- Google Scholar
33. Shpiro A, Curtu R, Rinzel J, Rubin N. Dynamical characteristics common to neuronal competition models. Journal of Neurophysiology. 2007; 97:462–473. pmid:17065254
- View Article
- PubMed/NCBI
- Google Scholar
34. Wang XJ, Rinzel J. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Computation. 1992; 4:84–97.
- View Article
- Google Scholar
35. Skinner F, Kopell N, Marder E. Mechanisms for oscillation and frequency control in reciprocally inhibitory model neural networks. Journal of Computational Neurosience. 1994; 1:69–87.
- View Article
- Google Scholar
36. Curtu R, Shpiro A, Rubin N, Rinzel J. Mechanisms for frequency control in neuronal competition models. SIAM Journal on Applied Dynamical Systems. 2008; 7(2):609–649. pmid:20953287
- View Article
- PubMed/NCBI
- Google Scholar
37. Curtu R, Rubin J. Interaction of canard and singular Hopf mechanisms in a neural model. SIAM Journal on Applied Dynamical Systems. 2011; 10(4):1443–1479.
- View Article
- Google Scholar
38. Gigante G, Mattia M, Braun J, Giudice PD. Bistable Perception Modeled as Competing Stochastic Integrations at Two Levels. PLoS Computational Biology. 2009; 5 (7):e1000430:1–9.
- View Article
- Google Scholar
39. Rankin J, Osborn Popp PJ, Rinzel J. Simulus pauses and perturbations differentially delay or promote the segregation of auditory objects: psychoacoustics and modeling. Frontiers in Neuroscience. 2017; 11 (198):1–12.
- View Article
- Google Scholar
40. Bregman AS. Auditory scene analysis: the perceptual organization of sound. The MIT Press; 1990.
41. Novikov A, Kordzakhia N. Martingales and first passage times of AR(1) sequences. Stochastics: An International Journal of Probability and Stochastic Processes. 2008; 80 (2-3):197–210.
- View Article
- Google Scholar

[ref1] 1. Micheyl C, Tian B, Carlyon R, Rauschecker J. Perceptual Organization of Tone Sequences in the Auditory Cortex of Awake Macaques. Neuron. 2005; 48:139–148. pmid:16202714
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Pressnitzer D, Hupé JM. Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Current Biology. 2006; 16:1351–1357. pmid:16824924
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Pressnitzer D, Sayles M, Micheyl C, Winter IM. Perceptual organization of sound begins in the auditory periphery. Current Biology. 2008; 18:1124–1128. pmid:18656355
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Deike S, Heil P, Böckmann-Barthel M, Brechmann A. The build-up of auditory stream segregation: a different perspective. Frontiers in Psychology. 2012; 3:461:1–7.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref5] 5. Barniv D, Nelken I. Auditory streaming as an online classification process with evidence accumulation. PLoS ONE. 2015; 10(12):e0144788:1–20.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Rankin J, Sussman E, Rinzel J. Neuromechanistic model of auditory bistability. PLoS Computational Biology. 2015. pmid:26562507
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Fishman YI, Reser DH, Arezzo JC, Steinschneider M. Neural correlates of auditory stream segregation in primary auditory cortex of the awake monkey. Hearing Research. 2001; 151:167–187. pmid:11124464
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Fishman YI, Arezzo JC, Steinschneider M. Auditory stream segregation in monkey auditory cortex: effects of frequency separation, presentation rate, and tone duration. J Acoust Soc Am. 2004; 116(3):1656–1669. pmid:15478432
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Curtu R, Wang X, Brunton BW, Nourski KV. Neural Signatures of Auditory Perceptual Bistability Revealed by Large-Scale Human Intracranial Recordings. Journal of Neuroscience. 2019; 39(33):6482–6497. pmid:31189576
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. van Noorden L. Temporal coherence in the perception of tone sequences. Doctoral Dissertation, Eindhoven University of Technology; 1975.

[ref11] 11. Beauvois M, Meddis R. Computer simulation of auditory stream segregation in alternating-tone sequences. J Acoust Soc Am. 1996; 99(4):2270–2280. pmid:8730073
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Micheyl C, Carlyon RP, Gutschalk A, Melcher JR, Oxenham AJ, Rauschecker JP, et al. The role of auditory cortex in the formation of auditory streams. Hearing Research. 2007; 229:116–131. pmid:17307315
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Krishnan L, Elhilali M, Shamma S. Segregating Complex Sound Sources through Temporal Coherence. PLoS Computational Biology. 2014; 10(12):e1003985:1–10.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref14] 14. Mill RW, Böhm TM, Bendixen A, Winkler I, Denham SL. Modelling the Emergence and Dynamics of Perceptual Organisation in Auditory Streaming. PLoS Computational Biology. 2013; 9(3):e1002925. pmid:23516340
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref15] 15. Almonte F, Jirsa V, Large E, Tuller B. Integration and segregation in auditory streaming. Physica D. 2005; 212:137–159.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref16] 16. Wang D, Chang P. An oscillatory correlation model of auditory streaming. Cognitive Neurodynamics. 2008; 2(1):7–19. pmid:19003469
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref17] 17. Steele S, Tranchina D, Rinzel J. An alternating renewal process describes the buildup of perceptual segregation. Frontiers in Computational Neuroscience. 2015; 8:166:1–13.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref18] 18. Szabó BT, Denham SL, Winkler I. Computational Models of Auditory Scene Analysis: A Review. Frontiers in Neuroscience. 2016; 10:524:1–16.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref19] 19. Rankin J, Rinzel J. Computational models of auditory perception from feature extraction to stream segregation and behavior. Current Opinion in Neurobiology. 2019; 58:46–53. pmid:31326723
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref20] 20. Hill KT, Bishop CW, Miller LM. Auditory grouping mechanisms reflect a sound’s relative position in a sequence. Frontiers in Human Neuroscience. 2012; 6(158):1–7.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref21] 21. Scholes C, Palmer A, Sumner CJ. Stream segregation in the anesthetized auditory cortex. Hearing Research. 2015; 328:48–58. pmid:26163899
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref22] 22. Shpiro A, Moreno-Bote R, Rubin N, Rinzel J. Balance between noise and adaptation in competition models of perceptual bistability. Journal of Computational Neurosience. 2009; 27(1):37–54.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref23] 23. Moreno-Bote R, Rinzel J, Rubin N. Noise-induced alternations in an attractor network model of perceptual bistability. Journal of Neurophysiology. 2007; 98:1125–1139. pmid:17615138
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref24] 24. Cao R, Pastukhov A, Mattia M, Braun J. Collective activity of many bistable assemblies reproduces characteristic dynamics of multistable perception. Journal of Neuroscience. 2016; 36(26):6957–6972. pmid:27358454
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref25] 25. Ermentrout GB, Terman DH. Mathematical Foundations of Neuroscience. Springer. 2010; 35.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref26] 26. Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Wiley; 1966.

[ref27] 27. Parker AJ, Newsome WT. Sense and the single neuron: probing the physiology of perception. Annual Review of Neuroscience. 1998; 21:227–277. pmid:9530497
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref28] 28. Simen P, Rivest F, Ludvig EA, Balci F, Killeen P. Timescale Invariance in the Pacemaker-Accumulator Family of Timing Models. Timing &amp; Time Perception. 2013; 1:159–188.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref29] 29. Cao R, Braun J, Mattia M. Stochastic accumulation by cortical columns may explain the scalar property of multi-stable perception. Physical Reviews Letters. 2014; 113(9):098103:1–5.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref30] 30. Latimer K, Yates J, Meister M, Huk A, Pillow J. Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science. 2015; 349 (6244):184–187. pmid:26160947
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref31] 31. Shadlen MN, Kiani R, Newsome WT, Gold JI, Wolpert DM, Zylberberg A, et al. Comment on “Single-trial spike trains in parietal cortex reveal discrete steps during decision-making”. Science. 2016; 351 (6280):1406. pmid:27013723
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref32] 32. Zoltowski DM, Latimer KW, Yates JL, Huk AC, Pillow JW. Discrete Stepping and Nonlinear Ramping Dynamics Underlie Spiking Responses of LIP Neurons during Decision-Making. Neuron. 2019; 102:1249–1258. pmid:31130330
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref33] 33. Shpiro A, Curtu R, Rinzel J, Rubin N. Dynamical characteristics common to neuronal competition models. Journal of Neurophysiology. 2007; 97:462–473. pmid:17065254
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref34] 34. Wang XJ, Rinzel J. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Computation. 1992; 4:84–97.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref35] 35. Skinner F, Kopell N, Marder E. Mechanisms for oscillation and frequency control in reciprocally inhibitory model neural networks. Journal of Computational Neurosience. 1994; 1:69–87.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref36] 36. Curtu R, Shpiro A, Rubin N, Rinzel J. Mechanisms for frequency control in neuronal competition models. SIAM Journal on Applied Dynamical Systems. 2008; 7(2):609–649. pmid:20953287
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref37] 37. Curtu R, Rubin J. Interaction of canard and singular Hopf mechanisms in a neural model. SIAM Journal on Applied Dynamical Systems. 2011; 10(4):1443–1479.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref38] 38. Gigante G, Mattia M, Braun J, Giudice PD. Bistable Perception Modeled as Competing Stochastic Integrations at Two Levels. PLoS Computational Biology. 2009; 5 (7):e1000430:1–9.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref39] 39. Rankin J, Osborn Popp PJ, Rinzel J. Simulus pauses and perturbations differentially delay or promote the segregation of auditory objects: psychoacoustics and modeling. Frontiers in Neuroscience. 2017; 11 (198):1–12.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref40] 40. Bregman AS. Auditory scene analysis: the perceptual organization of sound. The MIT Press; 1990.

[ref41] 41. Novikov A, Kordzakhia N. Martingales and first passage times of AR(1) sequences. Stochastics: An International Journal of Probability and Stochastic Processes. 2008; 80 (2-3):197–210.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Auditory triplet-streaming

Experimental protocol.

Behavioral task performance.

Auditory streaming as an evidence accumulation process

A basic state-dependent model for evidence accumulation.

Linking neural data with behavioral data in the EVA framework.

The EVA model.

EVA model captures DF-dependence of mean durations.

EVA-modeled first and subsequent percept durations match observations.

EVA model captures DF-dependence of stream segregation buildup.

Computational advantages of the EVA model.

Signal detection algorithm yields fast buildup and unrealistic percept durations

Modeling the buildup with Micheyl’s model for auditory streaming.

Modeling percept durations with Micheyl’s model.

Discussion

Novel features

Duration distributions underlie buildup

Fitting of model to data

Comparison with other models

Limitations, extensions, predictions

Methods

Experimental design and statistical analyses

Participants.

Stimuli.

Experimental protocol.

Build-up functions and the latency period.

First durations and subsequent durations.

Distribution testing of phase durations.

Statistical analysis of model-generated data.

The evidence accumulation model

Input layer (IL).

Sampler layer (SL).

Accumulation layer (ACC).

Switches and resetting conditions.

Model analysis

Parameter values used in model simulations.

Selection of target-against Ta values.

Statistical properties of SL-activation.

Selection of noise level σa.

Statistical properties of ACC-activation.

Data-driven parameters for EVA model.

EVA model versus classical drift-diffusion models

Supporting information

S1 Fig. The evidence accumulation (EVA) model captures experimental mean duration (μ) and shape of gamma-like distributions (α) for first and subsequent percept durations at other DF values.

S2 Fig. Exemplar time courses of accumulators in the EVA model, shown for DF = 3 (top) and DF = 7 (bottom).

S3 Fig. Mean percept durations (μ) and shape parameter values (α) in the EVA model are largely unaffected by changes in Nsl, the number of units in the sampler layer.

S4 Fig. The signal detection algorithm for constructing a neurometric function (the probability of segregation as a function of time) generates acceptable buildup fits at DF = 1, 3, 6, 9.

S5 Fig. EVA model with fixed target-against value Ta across all conditions and percept types captures some, but not all, characteristics of perceptual alternations.

Acknowledgments

References

Selection of target-against T_a values.

Selection of noise level σ_a.

S3 Fig. Mean percept durations (μ) and shape parameter values (α) in the EVA model are largely unaffected by changes in N_sl, the number of units in the sampler layer.

S5 Fig. EVA model with fixed target-against value T_a across all conditions and percept types captures some, but not all, characteristics of perceptual alternations.