The audio-visual integration effect on music emotion: Behavioral and physiological evidence

Fada Pan; Li Zhang; Yuhong Ou; Xinni Zhang

doi:10.1371/journal.pone.0217040

Abstract

Previous research has indicated that, compared to audio-only presentation, audio-visual congruent presentation can lead to a more intense emotional response. In the present study, we investigated the audio-visual integration effect on emotions elicited by positive or negative music and the role of visual information presentation durations. The participants were presented with audio-only condition, audio-visual congruent condition, and audio-visual incongruent condition and then required to judge the intensity of emotional experience elicited by the music. Their emotional responses to the music were measured using self-ratings and physiological aspects, including heart rate, skin temperature, EMG root mean square and prefrontal EEG. Relative to the audio-only presentation, the audio-visual congruent presentation led to a more intense emotional response. More importantly, the audio-visual integration occurred both in the positive music and in the negative music. Furthermore, the audio-visual integration effect was larger for positive music than for negative music; meanwhile the audio-visual integration effect was strongest with the visual information presented within 80s for negative music, which indicated that this integration effect was more likely to occur in the negative music. These results suggest that when the music was positive, the effect of audio-visual integration was greater. When the music was negative, the modulation effect of the presentation durations of visual information on the music-induced emotion was more significant.

Figures

Citation: Pan F, Zhang L, Ou Y, Zhang X (2019) The audio-visual integration effect on music emotion: Behavioral and physiological evidence. PLoS ONE 14(5): e0217040. https://doi.org/10.1371/journal.pone.0217040

Editor: Lutz Jäncke, University of Zurich, SWITZERLAND

Received: November 14, 2018; Accepted: May 5, 2019; Published: May 30, 2019

Copyright: © 2019 Pan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: Fada Pan was supported by Grant number: 14SHC006, The Social Sciences Foundation of Jiangsu Province, China. http://jspopss.jschina.com.cn/; and Grant number: C-a/2013/01/003, The Program for Twelfth Five-Year Plan of Educational Science of Jiangsu Province, China. http://www.jssghb.cn/. Li Zhang was supported by Grant number: KYCX18_2389, Postgraduate Research & Practice Innovation Program of Jiangsu Province. http://jyt.jiangsu.gov.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In our daily life, we perceive the external world by processing information from multiple sensory modalities involving vision, hearing, touch, and so on. For example, when we hold a cup in our hands, both vision and touch provide meaningful information about the shape of the cup. This phenomenon is known as multisensory integration, whereby stimuli from multiple sensory modalities interact to form a coherent and meaningful representation [1]. The McGurk effect demonstrated the multisensory integration effect (i.e., audio-visual integration). McGurk and MacDonald found that when a film of a young woman repeating the syllable [ba] was dubbed on to the lip movements as [ga], individuals also reported hearing [da] [2]. After this study, many scholars started to examine the neural mechanism for audio-visual integration. Calvert et al. found that the audio-visual integration effects started at the early stage [3]. Exploring this problem further, Calvert et al. investigated the neural mechanism for audio-visual integration again and identified an area of the heteromodal cortex in the left superior temporal sulcus that exhibited significant supra-additive response enhancement to match audio-visual inputs [4]. In recent years, more and more electrophysiological studies have shown that audio-visual integration occurred at multiple neural areas, such as the superior temporal gyrus [5, 6], the left inferior frontal gyrus, the left somatosensory association cortex, and the left supramarginal cortex [7].

A large number of studies showed that the information, including emotional information, from different sensory modalities could be integrated. The McGurk effect was also suitable for the study of audio-visual integration in emotion. A study by de Gelder explored the combination of emotional information from facial expression and voice tone [8]. The results showed that the judgment of facial emotion was affected by the auditory information and that the effective integration would occur even though the emotional information from visual and auditory modalities was conflicted. The same effect was more obvious in the judgment of fear. When accompanied by a fearful voice, the subjective self-rating expressed more fear whether the visual stimulus was fearful or neutral. However, no such influence on the rating of faces was found for happy voices [9].

Research over the past two decades demonstrated that numerous cortical and subcortical brain regions, such as the superior colliculus (SC), the superior temporal sulcus (STS), and the parietal, premotor and prefrontal cortices, were involved in audio-visual integration [10]. However, at least two questions remained unanswered: 1) What was the neural mechanism for audio-visual integration in emotion? and 2) At which stage of cognitive processing did integration take place, and which brain structure was related to it? A large number of studies showed that the integrated area of emotional face-sound was related to the middle temporal gyrus (MTG) [9, 11, 12], the posterior superior temporal sulcus (pSTS) [9], and the posterior superior temporal gyrus (pSTG) [13]. Moreover, compared with stimuli from the single modality, audio-visual stimuli also activated the thalamus [13]. Furthermore, the activation for fear-face and fear-voice was the amygdala [14], whereas the activation for happy-face and happy-voice was the hippocampus, the claustrum, and the inferior parietal lobule. In sum, the integration of different emotional information activate common as well as specific brain regions [11].

Further studies showed that audio-visual integration has certain time properties. According to the experimental results, the integration of auditory and visual information was extremely fast and automated, and the interaction of visual and auditory information was likely to occur at an early stage [15, 16]. Frassinetti et al. found that the effect of audio-visual integration was enhanced only when stimuli in the two different modalities were presented synchronously [15]. In other words, the visual and auditory stimuli must be presented at the same time for the enhancement of audio-visual integration effect to occur. However, previous research suggested that it was not necessary to present the visual and auditory information synchronously to induce the audio-visual integration effect. This effect could also appeared when the stimuli from two sensory modalities subsequently input in a short interval [16, 17]. Moreover, the time window, which was typically between 150 and 450ms, depended on the experimental material and the experimental paradigm [16]. The latest study used BOLD-FMRI to research the neural substrates involved in processing temporal synchrony and asynchrony with audio-visual signals. The results showed that S-mSTC (Synchrony-defined multisensory superior temporal cortex) responded only to synchronous audiovisual stimulus presentations, with no activation observed at asynchrony levels greater than or equal to 100ms, while B-mSTC showed activation with any multisensory audiovisual speech stimulus and increased activation as stimulus asynchrony increased [18]. These results showed that the audio-visual integration effects were diverse at different time windows, which activated specific brain regions.

At the same time, the audio-visual integration of emotional information also occurred at the early stage of processing. Pourtois et al. suggested that the combination of the two sensory modalities occurred in a short time window (110ms) and translated as a specific enhancement in the N1 component [19]. Another study found that the combination of the audio-visual sensory modalities translated as a reduction in N1, P2, and N3 components [20, 21]. In other words, the audio-visual integration occurred at about 100ms. Similarly, Chen et al. pointed out that N1 and P3 amplitudes were larger for the bimodal-change conditions [22]. The results suggested that the facial-vocal integration during emotional change perception was subserved by at least two processes denoted by N1 and P3 [22] and that the time of audio-visual integration was about 300ms. Overall, the audio-visual integration happened within 300ms, but the time window was affected by the task requirements, such that the integration might take longer (i.e., over 300ms) if different tasks were employed.

Research over the past two decades has shown that the perception of musical emotion could be influenced by visual cues, such as body postures, gestures, and facial expressions [22, 23]. When the music and visual information were presented synchronously, it was unclear whether the emotional information from the two sensory modalities would be potentially integrated or not. Some studies demonstrated that the audio-visual integration would occur in the musical emotion expression, while other studies indicated that the two sensory modalities would not be integrated. Vines et al. [24] and Vuoskoski et al. [25] found that AO (audio-only presentation) elicted the strongest emotional reaction in the galvanic skin response and that there was no significant difference between AV (audio-visual presentation) and AO. These results indicated that audio-visual presentation did not induce a higher emotional response than unimodal (i.e., visual-only or audio-only) presentation. In other words, musical and visual information would not be integrated. However, additional research offered different results. Chapados and Levitin discovered that there was a significant effect between AV and AO, and between AV and VO (visual-only presentation) in the galvanic skin response. This suggested that the interaction effect between the multiple sensory modalities (audio and visual sensory modalities) was larger than the effect caused by the single sensory modality in the expression of music communication [23]. Moreover, Platz conducted a meta-analysis to examine the emotional expression elicited by music and supported the result that audio-visual presentations enhanced musical appreciation [26]. In other words, music and visual information were integrated. Some studies also discovered the integration effect even when musical emotion and visual emotion were different. Weijkamp and Sadakata pointed out that there was an interference effect when the stimuli from the visual sensory modality and the auditory sensory modality were incongruent [27]. Recently, researchers using FMRI technology analyzed this phenomenon. Jeong et al. observed that congruence and incongruence of auditory and visual stimuli might be integrated through increments and decrements in neuronal activity in the superior temporal gyrus (STG) and the fusiform gyrus (FG). There was increased activation in the STG under the congruent condition. However, there was decreased activation in the STG and increased activation in the FG under the incongruent condition [28]. To summarize, most research had verified that music-induced emotion was influenced by visual stimulation. However, future research is required to explain whether the two sensory modalities inputs would be integrated.

It should be mentioned that there are two limitations in the previous studies regarding the audio-visual integration effect on the emotion elicited by music. First, there is no separation between positive and negative emotions when studying whether visual information can influence music-induced emotion. Some studies on multisensory processing have indicated that all bimodal conditions could induce significantly strong activation in the superior temporal gyrus (STS), the inferior frontal gyrus (IFG), and the parahippocampal gyrus, including the amygdala. By contrast, some studies suggested that different emotions had different mechanisms of emotional processing. For instance, the structures activated by happy audio-visual pairs were mainly lateralized in the left hemisphere, whereas the structures activated by fearful audio-visual pairs were lateralized in the right hemisphere [12]. Above all, emotional activation regions imply that all types of emotional information not only have a common neural basis but also have unique integration characteristics. Therefore, the different types of emotion should be separately processed when we explore the audio-visual integration effects on emotions. Second, according to previous studies, we found that the audio-visual integration effect on both cognitive and emotional processing had certain time properties. Previous studies have indicated that when the two sensory modalities were presented at the same time, or the time window was at 300ms, the information from the two sensory modalities would be integrated. This is also verified in musical emotion. Thus, further research is needed to determine whether a stronger integration effect occurs when the presentation durations of visual information are longer.

In brief, the present study aims to investigate the audio-visual consistency effects in emotions elicited by positive and negative music. Furthermore, we will examine whether presentation durations of visual information can modulate the audio-visual integration effects elicited by music. In general, we hypothesize that the congruence of auditory and visual emotions will promote the inducing of the emotion. Additionally, we expect that the presentation durations of visual information will modulate the effect of music-induced emotion for audio-visual integration.

Experiment 1

“This study was approved by the Human Ethics Committee of Nantong University. A written informed consent was obtained from all participants, each of whom was given a small gift in return for their participation.”

Methods

Participants.

Thirty-two healthy undergraduate students (16 males and 16 females ranging from 17 to 26 years of age, with an average age of 20.75 years) volunteered to participate in the present study. All participants were right-handed, had no mental illness, color blindness or color weakness, and had normal or corrected-to-normal vision.

Stimulus material.

Visual stimuli. The pictures in the present study were selected from the Chinese Facial Affective Picture System [29]. We obtained the ratings of the Arousal and Valence from this database (Chinese Facial Affective Picture System) and then tested the significance of differences between the positive and negative pictures. We obtained the ratings of the Arousal and Valence from this database (Chinese Facial Affective Picture System) on a 9-point scale and then tested the significance of differences between the positive and negative pictures. According to t-test analysis, the average valence ratings between the positive (i.e., happy) pictures (M = 7.44) and the negative (i.e., sad) pictures (M = 2.20) were significantly different (t = 12.65, p < 0.01). In addition, the average arousal ratings of positive (i.e., happy) and negative (i.e., sad) pictures were 7.16 and 6.67, respectively. The result of the t-test showed no significant difference on the average arousal ratings between these two groups of pictures (t = 0.82, p > 0.05). When the participants did the audio-only task, the visual stimuli was a 35% gray level picture. All pictures were presented in jpg format, at a resolution of 260 × 300 pixels. These visual stimuli were presented for 120 seconds at the center of a 35% gray level computer background.

Auditory stimuli. Four pieces of music were selected, including two positive (i.e., happy) music clips and two negative (i.e., sad) music clips. The happy music was Piano Concerto No. 23 in A Major, K. 488-(2), by Wolfgang Amadeus Mozart, and Brandenburg concerto No. 2, by Johann Sebastian Bach. The sad music was Adagio, by Tomaso Albinoni, and Kol Nidrei, by Max Bruch. Before the formal experiment, 15 participants were randomly selected to evaluate the music pieces. The valence and arousal were scored with a 9-point scale for music pieces. According to the t-test analysis, the average valence ratings of positive (i.e., happy) and negative (i.e., sad) music clips were 6.43 and 2.80, respectively. The result of the t-test showed that the difference between them was significant (t = 9.19, p < 0.001). However, there was no significant difference on the average arousal ratings between the positive (i.e., happy) music clips (M = 6.83) and the negative (i.e., sad) music clips (M = 6.80) (t = -0.08, p > 0.05). All the music segments were presented for 120 seconds in MP4 format.

Subjective self-rating scale. On the 9-point self-rating scale for the degree of happiness, “1” represented “not happy at all,” “5” represented the middle degree of happiness, and “9” represented “very happy.” This is similar to the self-rating for the degree of sadness, where “1” represented “not sad at all,” “5” represented the middle degree of sadness, and “9” represented “very sad.”

Procedure.

In Experiment 1, a 2 (music emotion type: positive emotion, negative emotion) × 3 (audio-visual congruence: audio-visual congruent, audio-visual incongruent, control group) ×2 (valence: positive, negative) factorial design was applied. Valence is the extent to which an individual is made happy or sad by the stimuli. In this present, it included happiness and sadness. The experimental stimuli consisted of audio-visual congruence (AV-C), audio-visual incongruence (AV-I) and control group (audio-only: AO) versions. For the stimuli of audio-visual congruence, the audio and visual emotions were congruent (e.g., happy face and happy music). Specifically, when the happy music were presented, the participants would see a happy face at the center of 35% gray level computer background. For the stimuli of audio-visual incongruence, the audio and visual emotions were incongruent (e.g., happy face and sad music). Specifically, when the sad music were presented, the participants would see a happy face at the center of 35% gray level computer background. For the stimuli of control group (audio-only), when the music were presented, the participants would see a 35% gray level computer background. The duration of each stimulus was 120 seconds. The experimental procedure and stimulus presentation were edited with PowerPoint 2016.

Before the formal experiment, participants were required to sit and rest for 5 minutes in front of the computer screen. They were then informed about the experimental tasks and signed the consent form. Then, the physiological sensors were connected to the participant’s non-dominant hand. When all signal indicators were stable, asking the participants to press the “start” button to start the experiment. In the experiment, music was played through the “Mi” noise-canceling headphones (model: type-C), while it was played with emotional face or grey picture simultaneously. Then, the participants judged the emotion induced by music after playing the music, and scored 1–9 grades for the the happiness ratings and sadness ratings (lasting 60 seconds). At the end of judgment, the participants were asked to complete 10 arithmetic questions (lasting 60 seconds), which enabled them to return to a calm, natural state and prepare for the next stimulus.

To balance the differences of the stimuli and reduce practice effects, we set up that each piece of music corresponds to a happy facial picture, a sad facial picture and a 35% gray picture. Twelve trials were presented in the study, and the length of the study was approximately 48 minutes.

Bio-signal recording.

The equipment used was an MP150 (Biopac System, U.S) to measure all ECG (Electrocardiography), EMG (Electromyography), SKT, and EEG (electroencephalogram) signals. For ECG signals, the amplifier gain was set at 500, the high-pass filter was set at 0.5 Hz, and the low-pass filter was set at 35 Hz. Two electrodes were attached to the heart on the up and down. The white lead was slightly upward on the sword process. The red lead was attached to the left 3–4 ribs. The black lead was the ground pole and attached on the right side of the navel. For EMG signals, the amplifier was set at 2000, the high-pass filter was set at 1 Hz, and the low-pass filter was at 100 Hz HPN OFF. Two electrodes were attached to the left hand to collect electromyography. For SKT, the amplifier was set at 1000, the high-pass filter was set at DC, and the low-pass filter was at 10 Hz. The skin temperature sensor was attached behind the neck. For EEG signals, the amplifier was set at 5000, the high-pass filter was set at 0.5 Hz, and the low-pass filter was at 35 Hz. Two electrodes were attached at the frontal area.

Data analysis.

For ECG (Electrocardiography), EMG (Electromyography) and SKT, we chose the median of one minute in each musical segment to test the difference. For the electroencephalogram (EEG), we filtered the frequency of each type of brain wave, then copied the waves of the median of 1 minute in each musical segment. Then we took the waves to Fast Fourier Transform (FFT) to obtain the maximum power. Repeated-measures ANOVAs were conducted in SPSS 22.0.

Results

Subjective self-ratings.

A 2 (music type: positive music, negative music) ×3 (audio-visual congruence: audio-visual congruent, audio-visual incongruent, control group) ×2 (valence: positive, negative) repeated-measures ANOVA was performed on the subjective self-ratings. A significant effect of music type was observed, F(1,31) = 4.78, p < 0.05, η² = 0.14. The subjective self-ratings for positive music were significantly lower than those for negative music. The main effect of valence was not significant, F(1,31) = 0.18, p > 0.05. The main effect of audio-visual was also not significant, F(2,62) = 2.83, p > 0.05.

The three-way interaction of music type, audio-visual congruence and valence was significant, F(2, 62) = 38.67, p < 0.001, η² = 0.56. After studying the interaction further, the interaction between music type and audio-visual congruence was significant in the positive valence, F(2, 62) = 22.35, p < 0.001, η² = 0.42. Further examination of the interaction between music type and audio-visual congruence revealed a significant difference between audio-visual congruence when the music was negative, F(2, 62) = 7.02, p < 0.01, η² = 0.32. Specifically, the self-ratings of the condition of audio-visual incongruence and control group were higher than the condition of audio-visual congruence. Meanwhile, it showed an even larger effect of audio-visual congruence for positive music, F(2, 62) = 19.43, p < 0.001, η² = 0.56. Specifically, the self-ratings of the condition of audio-visual incongruence and audio-visual congruence were higher than the condition of control group.

Meanwhile, the interaction between music type and audio-visual congruence was significant in the negative valence, F(2, 62) = 35.04, p < 0.001, η² = 0.53. Further examination of the interaction between music type and audio-visual congruence revealed a significant difference between audio-visual congruence when the music was negative, F(2, 62) = 12.34, p < 0.001, η² = 0.45. Specifically, the self-ratings of the condition of audio-visual congruence and incongruence were lower than the condition of control group. Meanwhile, it showed an even larger effect of audio-visual congruence for positive music, F(2, 62) = 29.22, p < 0.001, η² = 0.66. Specifically, the self-ratings of the condition of audio-visual incongruence and control group were lower than the condition of audio-visual congruence.

These results revealed that the positive music induced significant positive emotion and the negative music induced significant negative emotion. Moreover, this indicated that the difference of audio-visual congruence was significant in both positive and negative music conditions, and the difference among the three conditions (audio-visual congruence, audio-visual incongruence and audio only) was larger in positive music than that in negative music. More importantly, the strength of individuals’ emotional experienced intensity when the facial emotion and musical emotion were congruent. However, when the music was positive, the negative facial emotion could also enhance individuals’ emotional experienced intensity, which the positive facial emotion didn’t have significant effect when the music was negative (Fig 1).

Download:

Fig 1. Means self-ratings of music-induced emotion as a function of music type and audio-visual congruence.

Fig 1A displays the results of Repeated measures analysis in the self-rating scores of happiness. Fig 1B displays the results of Repeated measures analysis in the self-rating scores of sadness.

https://doi.org/10.1371/journal.pone.0217040.g001

Heart rate.

A 2 (music type: positive emotion, negative emotion) × 3 (audio-visual congruence: audio-visual congruent, audio-visual incongruent, control group) repeated-measures ANOVA was performed on the heart rates. A significant effect of audio-visual congruence was observed, F(2,62) = 5.01, p < 0.05, η² = 0.14. The heart rates for the audio-visual congruent condition were significantly lower than those for the control group, while the heart rates for the audio-visual incongruent condition were significantly higher than those for the control group. This suggested that the visual emotional information had a significant effect on music-induced emotion. That means, when the valence of visual stimuli was congruent with the musical stimuli, it would decrease the heart rates and enhance the emotional experience. Oppositely, it would enhance the heart rates and decrease the experience. However, the main effect of music type was not significant, F(1, 31) = 3.29, p > 0.05.

The interaction between music type and audio-visual congruence was significant, F(2, 62) = 4.17, p < 0.05, η² = 0.12. Simple-effect tests showed a significant difference among the three conditions (audio-visual congruence, audio-visual incongruence and audio only) when the music was negative, F(2, 62) = 14.33, p < 0.001, η² = 0.49, with the heart rates lower for audio-visual congruence than for audio-visual incongruence. The audio-visual congruence effects were not significant for the positive music, F(2, 62) = 2.48, p > 0.05 (Fig 2A).

Download:

Fig 2. Means physiological indicators of music-induced emotion as a function of music type and audio-visual congruence.

Fig 2A displays the results of Repeated measures analysis in heart rates. Fig 2B displays the results of Repeated measures analysis in skin temperatures. Fig 2C displays the results of Repeated measures analysis in EMG root mean squares.

https://doi.org/10.1371/journal.pone.0217040.g002

Skin temperature.

The repeated measures ANOVAs on skin temperature showed that the main effect of music type was not significant, F(1, 31) = 0.001, p > 0.05, and the main effect of audio-visual congruence was also not significant, F(2, 62) = 0.69, p > 0.05. The interaction between music type and audio-visual congruence was not significant, F(2, 62) = 0.35, p > 0.05 (Fig 2B).

EMG root mean square.

The repeated measures ANOVAs on EMG root mean square showed the main effect of music type was not significant, F(1, 31) = 2.60, p > 0.05, and the main effect of audio-visual congruence was also not significant, F(2, 62) = 0.96, p > 0.05. The interaction between music type and audio-visual congruence was significant, F(2, 62) = 6.24, p < 0.01, η² = 0.17. Simple effect analysis showed that the audio-visual congruence effects were only significant for the positive music, F(2, 62) = 3.89, p < 0.05, η² = 0.21, with the root mean square larger for audio-visual congruence than for incongruence. The audio-visual congruence effects were not significant for the negative music, F(2, 62) = 2.96, p > 0.05 (Fig 2C).

Prefrontal EEG.

The analysis of alpha power showed a significant main effect of music type, F(1, 31) = 5.05, p < 0.05, η² = 0.14, and a significant interaction between music type and audio-visual congruence, F(2, 62) = 4.49, p < 0.05, η² = 0.13. However, the main effect of audio-visual congruence was not significant, F(2, 62) = 2.45, p > 0.05. Simple effect analysis showed the audio-visual congruence effects were only significant for the positive music, F(2, 62) = 9.08, p < 0.001, η² = 0.38, with the alpha power lower for audio-visual congruence than for audio-visual incongruence. The audio-visual congruence effects were not significant for the negative music, F(2, 62) = 2.68, p > 0.05 (Fig 3A).

Download:

Fig 3. Means EEG power of music-induced emotion as a function of music type and audio-visual congruence.

Fig 3A displays the results of Repeated measures analysis in prefrontal alpha power. Fig 3B displays the results of Repeated measures analysis in prefrontal beta power. Fig 3C displays the results of Repeated measures analysis in prefrontal theta power. Fig 3D displays the results of Repeated measures analysis in prefrontal delta power.

https://doi.org/10.1371/journal.pone.0217040.g003

The repeated measures ANOVAs on beta power showed the main effect of music type was not significant, F(1, 31) = 0.22, p > 0.05, and the main effect of audio-visual congruence was also not significant, F(2, 62) = 0.88, p > 0.05. The interaction between music type and audio-visual congruence was significant, F(2, 62) = 4.81, p < 0.05, η² = 0.13. Simple effect analysis showed that the audio-visual congruence effects were only significant for the positive music, F(2, 62) = 2.96, p = 0.06, η² = 0.17, with the beta power larger for audio-visual congruence than for incongruence. The audio-visual congruence effects were not significant for the negative music, F(2, 62) = 1.45, p > 0.05 (Fig 3B).

The repeated measures ANOVAs on theta power showed the main effect of music type was not significant, F(1, 31) = 0.04, p > 0.05, and the main effect of audio-visual congruence was also not significant, F(2, 62) = 1.53, p > 0.05. The interaction between music type and audio-visual congruence was significant, F(2, 62) = 3.63, p < 0.05, η² = 0.11. Simple effect analysis showed that the audio-visual congruence effects were only significant for the negative music, F(2, 62) = 6.01, p < 0.01, η² = 0.29, with the theta power lower for audio-visual congruence than for incongruence. The audio-visual congruence effects were not significant for the positive music, F(2, 62) = 0.94, p > 0.05 (Fig 3C).

The repeated measures ANOVAs on delta power showed the main effect of music type was not significant, F(1, 31) = 0.02, p > 0.05, and the main effect of audio-visual congruence was also not significant, F(2, 62) = 0.98, p > 0.05. The interaction between music type and audio-visual congruence was significant, F(2, 62) = 5.31, p < 0.01, η² = 0.15. Simple effect analysis showed that the audio-visual congruence effects were only significant for the negative music, F(2, 62) = 4.13, p < 0.05, η² = 0.22, with the delta power lower for audio-visual congruence than for incongruence. The audio-visual congruence effects were not significant for the positive music, F(2, 62) = 2.16, p > 0.05 (Fig 3D).

Discussion

Experiment 1 was designed to explore the effect of visual information on music-induced emotion in different emotional states. The results of Experiment 1 revealed that when presented with a visual stimulus (i.e., face) and an auditory stimulus (i.e., music) synchronously, participants began to combine the two sensory modalities of information. The combination was manifested in both the subjective responses and the physiological aspects. More concretely, the differences of audio-visual congruence were significant both in positive and negative emotional conditions, and there was a larger difference between the audio-visual congruent and incongruent conditions for positive music than for negative music in the subjective responses. Similar results were found in the alpha power, beta power of the prefrontal area, and the EMG root mean square, which means that the difference among audio-visual congruence, control group and audio-visual incongruence was significant when the auditory stimulus is positive rather than negative. However, we obtained the opposite results for heart rate where the difference the three conditions (audio-visual congruence, audio-visual incongruence and audio only) was significant when the auditory stimulus was negative rather than positive. Moreover, these results revealed that when the music and facial emotions were congruent, individuals would evoke more intense emotion. Except the theta power and delta power of prefrontal areas, the other indexes showed that when the music and facial emotions were incongruent, the visual information had no influence on the music-induced emotion.

On the one hand, these results revealed that the audio-visual integration effect occurred in the music-induced response. Additionally, the results were consistent with previous studies. For example, Chapados and Levitin found that the interaction effect of auditory and visual stimuli was greater than the effect of a single sensory modality in the musical performance [23]. Chen et al. also supported the view that bimodal emotional changes were detected with shorter response latencies compared to unimodal condition [22]. This suggested that bimodal emotional cues facilitated emotional change detection and that an integration effect of bimodal emotional cues occurred. On the other hand, we found that the audio-visual integration effect was larger for the positive music. According to the attention distribution theory, positive emotion can broaden an individual’s breadth of attention, while negative emotions tend to narrow an individual’s breadth of attention [30]. Thus, the positive music broadened the range of attention [31] and the participants were more likely to be attracted by the visual information. In other words, when the musical and facial emotions were both positive, enhancing the combination of visual and auditory sensory stimuli, it became easier for the participants to perceive the emotional information from the facial picture.

Experiment 2

“This study was approved by the Human Ethics Committee of Nantong University. Written informed consent was obtained from all participants, each of whom was given a small gift in return for his or her participation.”