manuscriptedpod - Tumblr blog

manuscriptedpod · 4 years ago

Audio

When scenes speak louder than words: Verbal encoding does not mediate the relationship between scene meaning and visual attention

Manuscript authors: Gwendolyn Rehrig, Taylor R. Hayes, John M. Henderson, and Fernanda Ferreira

Read aloud by the first author. Please refer to the manuscript documents linked below for in-text citations, references, correspondence information, author affiliations, and figures.

Published manuscript: https://link.springer.com/article/10.3758/s13421-020-01050-4

DOI: doi:10.3758/s13421-020-01050-4

Preprint: https://psyarxiv.com/3h7au

Supplemental material: https://osf.io/8mbyv/

Citation: Rehrig, G., Hayes, T. R., Henderson, J. M., & Ferreira, F. (2020). When scenes speak louder than words: Verbal encoding does not mediate the relationship between scene meaning and visual attention. Memory & Cognition, 48(7), 1181-1195.

Transcript

Gwendolyn Rehrig: When scenes speak louder than words: Verbal encoding does not mediate the relationship between scene meaning and visual attention By Gwendolyn Rehrig, Taylor R. Hayes, John M. Henderson, and Fernanda Ferreira.

Abstract: The complexity of the visual world requires that we constrain visual attention and prioritize some regions of the scene for attention over others. The current study investigated whether verbal encoding processes influence how attention is allocated in scenes. Specifically, we asked whether the advantage of scene meaning over image salience in attentional guidance is modulated by verbal encoding, given that we often use language to process information. In two experiments, 60 subjects studied scenes (30 in experiment 1 and 60 in experiment 2) for 12 seconds each in preparation for a scene recognition task. Half of the time, subjects engaged in a secondary articulatory suppression task concurrent with scene viewing. Meaning and saliency maps were quantified for each of the experimental scenes. In both experiments, we found that meaning explained more of the variance in visual attention than image salience did, particularly when we controlled for the overlap between meaning and salience, with and without the suppression task. Based on these results, verbal encoding processes do not appear to modulate the relationship between scene meaning and visual attention. Our findings suggest that semantic information in the scene steers the attentional ship, consistent with cognitive guidance theory.

Keywords: scene processing, visual attention, meaning, salience, language

Introduction

Because the visual world is information-rich, observers prioritize certain scene regions for attention over others to process scenes efficiently. While bottom-up information from the stimulus is clearly relevant, visual attention does not operate in a vacuum, but rather functions in concert with other cognitive processes to solve the problem at hand. What influence, if any, do extra-visual cognitive processes exert on visual attention?

Two opposing theoretical accounts of visual attention are relevant to the current study: saliency-based theories and cognitive guidance theory. According to saliency-based theories, salient scene regions—those that contrast with their surroundings based on low-level image features (for example, luminance, color, orientation)—pull visual attention across a scene, from the most salient location to the least salient location in descending order. Saliency-based explanations cannot explain that physical salience does not determine which scene regions are fixated and that top-down task demands influence attention more than physical salience does. Cognitive guidance theory can account for these findings: the cognitive system pushes visual attention to scene regions, incorporating stored knowledge about scenes to prioritize regions that are most relevant to the viewer’s goals. Under this framework, cognitive systems—for example, long- and short-term memory, executive planning, etc.—operate together to guide visual attention. Coordination of cognitive systems helps to explain behavioral findings where saliency-based attentional theories fall short. For example, viewers look preferentially at meaningful regions of a scene (for example, those containing task-relevant objects), even when they are not visually salient (for example, under shadow), despite the presence of a salient distractor.

Recent work has investigated attentional guidance by representing the spatial distribution of image salience and scene meaning comparably. Henderson and Hayes introduced meaning maps to quantify the distribution of meaning over a scene. Raters on mTurk saw small scene patches presented at two different scales and judged how meaningful or recognizable each patch was. Meaning maps were constructed by averaging the ratings across patch scales and smoothing the values. Image salience was quantified using Graph-Based Visual Salience. The feature maps were correlated with attention maps that were empirically derived from viewer fixations in scene memorization and aesthetic judgement tasks. Meaning explained greater variance in attention maps than salience did, both for linear and semipartial correlations, suggesting that meaning plays a greater role in guiding visual attention than image salience does. This replicated when attention maps constructed from the same dataset were weighted on fixation duration, when viewers described scenes aloud, during free-viewing of scenes, when meaning was not task-relevant, and even when image salience was task-relevant. In sum, scene meaning explained variation in attention maps better than image salience did across experiments and tasks, supporting the cognitive guidance theory of attentional guidance.

One question that remains unexplored is whether other cognitive processes indirectly influence cognitive guidance of attention. For example, it is possible that verbal encoding may modulate the relationship between scene meaning and visual attention: Perhaps the use of language, whether vocalized or not, pushes attention to more meaningful regions. While only two of the past experiments were explicitly linguistic in nature (scene description), the remaining tasks did not control for verbal encoding processes.

There is evidence that observers incidentally name objects silently during object viewing. Meyer et al. asked subjects to report whether a target object was present or not in an array of objects, which sometimes included competitors that were semantically related to the target or were semantically unrelated, but had a homophonous name (for example, bat the tool vs. bat the animal). The presence of competitors interfered with search, which suggests information about the objects (name, semantic information) became active during viewing, even though that information was not task-relevant. In a picture-picture interference study, Meyer and Damian presented target objects that were paired with distractor objects with phonologically similar names, and instructed subjects to name the target objects. Naming latency was shorter when distractor names were phonologically similar to the name of the target object, suggesting that activation of the distractor object’s name occurred and facilitated retrieval of the target object’s name. Together, the two studies demonstrate a tendency for viewers to incidentally name objects they have seen.

Cross-linguistic studies on the topic of linguistic relativity employ verbal interference paradigms to demonstrate that performance on perceptual tasks can be mediated by language processes. For example, linguistic color categories vary across languages even though the visual spectrum of colors is the same across language communities. A 2007 study showed that observers discriminated between colors faster when the colors belonged to different linguistic color categories, but the advantage disappeared with verbal interference. These findings indicate that language processes can mediate performance on perceptual tasks that are ostensibly not linguistic in nature, and a secondary verbal task that prevents task-incidental language use can disrupt the mediating influence of language. Similar influences of language on ostensibly non-linguistic processes, and the disruption thereof by verbal interference tasks, have been found for spatial memory, event perception, categorization, and numerical representations, to name a few.

The above literature suggests we use internal language during visual processing, and in some cases those language processes may mediate perceptual processes. Could the relationship between meaning and visual attention observed previously have been modulated by verbal encoding processes? To examine this possibility, we used an articulatory suppression manipulation to determine whether verbal encoding mediates attentional guidance in scenes.

In the current study, observers studied 30 scenes for 12 seconds each for a later recognition memory test. The scenes used in the study phase were mapped for meaning and salience. We conducted two experiments in which subjects performed a secondary articulatory suppression task half of the time in addition to memorizing scenes. In Experiment 1, the suppression manipulation was between-subjects, and the articulatory suppression task was to repeat a three digit sequence aloud during the scene viewing period. We chose this suppression task because we suspected subjects might adapt to and subvert simpler verbal interference such as a syllable repetition, and because digit sequence repetition imposes less cognitive load than n-back tasks. In Experiment 2, we implemented a within-subject design using two experimental blocks: one with the sole task of memorizing scenes, the other with an additional articulatory suppression task. Because numerical stimuli may be processed differently than other verbal stimuli, we instead asked subjects to repeat the names of a sequence of three shapes aloud during the suppression condition. In the recognition phase of both experiments, subjects viewed 60 scenes—30 that were present in the study phase, 30 foils—and indicated whether or not they recognized the scene from the study phase.

We tested two competing hypotheses about the relationship between verbal encoding and attentional guidance in scenes. If verbal encoding indeed mediated the relationship between meaning and attentional guidance in our previous work, we would expect observers to direct attention to meaningful scene regions only when internal verbalization strategies are available to them. Specifically, meaning should explain greater variance in attention maps than saliency in the control condition, and meaning should explain less or equal variance in attention as salience when subjects suppressed internal language use. Conversely, if verbal encoding did not mediate attentional guidance in scenes, the availability of verbalization strategies should not affect attention, and so we would expect to find an advantage of meaning over salience whether or not subjects engaged in a suppression task.

Experiment 1: Methods

Sixty-eight undergraduates enrolled at the University of California, Davis participated for course credit. All subjects were native speakers of English, at least 18 years old, and had normal or corrected-to-normal vision. They were naive to the purpose of the experiment and provided informed consent as approved by the University of California, Davis Institutional Review Board. Six subjects were excluded from analysis because their eyes could not be accurately tracked, 1 due to an equipment failure, and 1 due to experimenter error; data from the remaining 60 subjects were analyzed (30 subjects in each condition).

Scenes were 30 digitized and luminance-matched photographs of real-world scenes used in a previous experiment. Of these, 10 depicted outdoor environments, and 20 depicted indoor environments. People were not present in any scenes. Another set of 30 digitized images of comparable scenes (similar scene categories and time period, no people depicted) were selected from a Google image search and served as memory foils. Because we did not evaluate attentional guidance for the foils, meaning and salience were not quantified for these scenes, and the images were not luminance-matched.

Digit sequences were selected randomly without replacement from all three digit numbers ranging from 100 to 999 (900 numbers total), then segmented into 30 groups of 30 sequences each such that each digit sequence in the articulatory suppression condition was unique.

Eye movements were recorded with an SR Research EyeLink 1000+ tower mount eyetracker at a 1000 Hz sampling rate. Subjects sat 83 cm away from a monitor such that scenes subtended approximately 26° x 19° visual angle at a resolution of 1024 x 768 pixels, presented in 4:3 aspect ratio. Head movements were minimized using a chin and forehead rest integrated with the eyetracker’s tower mount. Subjects were instructed to lean against the forehead rest to reduce head movement while allowing them to speak during the suppression task. Although viewing was binocular, eye movements were recorded from the right eye. The experiment was controlled using SR Research Experiment Builder software. Data were collected on two systems that were identical except that one subject computer operated using Windows 10, and the other used Windows 7.

Subjects were told they would see a series of scenes to study for a later memory test. Subjects in the articulatory suppression condition were told each trial would begin with a sequence of 3 digits, and were instructed to repeat the sequence of digits aloud during the scene viewing period. After the instructions, a calibration procedure was conducted to map eye position to screen coordinates. Successful calibration required an average error of less than 0.49° and a maximum error below 0.99°.

Following successful calibration, there were 3 practice trials to familiarize subjects with the task prior to the experimental trials. In the suppression condition, during these practice trials participants studied three-digit sequences prior to viewing the scene. Practice digit sequences were 3 randomly sampled sequences from the range 1 to 99, in 3-digit format (for example, “0 3 6” for 36). Subjects pressed any button on a button box to advance throughout the task.

Each subject received a unique pseudo-random trial order that prevented two scenes of the same type (for example, a kitchen) from occurring consecutively. A trial proceeded as follows. First, a five-point fixation array was displayed to check calibration. The subject fixated the center cross and the experimenter pressed a key to begin the trial if the fixation was stable, or reran the calibration procedure if not. Before the scene, subjects in the articulatory suppression condition saw the instruction “Study the sequence of digits shown below. Your task is to repeat these digits over and over out loud for 12 seconds while viewing an image of the scene” along with a sequence of 3 digits separated by spaces (for example, “8 0 9”), and pressed a button to proceed. The scene was shown for 12 seconds, during which time eye-movements were recorded. After 12 seconds elapsed, subjects pressed a button to proceed to the next trial. The trial procedure repeated until all 30 trials were complete.

Figure 1 shows a schematic of the trial procedure. The first phase shows a fixation array against a gray background. Four peripheral fixations are black, and the central fixation is red. The experimenter presses a button to advance from this screen. In the articulatory suppression condition only, a digit sequence display is then shown, which displays a digit sequence to be rehearsed in white text against a gray background. Subjects press a button on a button box to advance. The button box, represented in the figure as 5 circles corresponding to each of the 5 buttons, has a yellow circle at the top, with a white circle directly below it. On either side of the white circle are a green circle on the left and a red circle on the right. Below the white circle is a blue circle. The third shows an example of a real-world scene. The example scene shows an indoor scene of an area near an entryway. There is a wooden dresser with electronics on it, and cleaning supplies adjacent to the wooden dresser. The walls are sea green, and the floor is tiled in black and white. There is an umbrella in the background in front of a radiator. The scene is shown for 12 seconds, after which an end of trial screen appears to let inform subjects they can press a button on the button box to proceed.

A recognition memory test followed the experimental trials, in which subjects were shown the 30 experimental scenes and 30 foil scenes they had not seen previously. Presentation order was randomized without replacement. Subjects were informed that they would see one scene at a time and instructed to use the button box to indicate as quickly and accurately as possible whether they had seen the scene earlier in the experiment. After the instruction screen, subjects pressed any button to begin the memory test. In a recognition trial, subjects saw a scene that was either a scene from the study phase or a foil image. The scene persisted until a “Yes” or “No” button press occurred, after which the next trial began. Response time and accuracy were recorded. This procedure repeated 60 times, after which the experiment terminated.

Fixations and saccades were parsed with EyeLink’s standard algorithm using velocity and acceleration thresholds. Eye movement data were imported offline into Matlab using the Visual EDF2ASC tool packaged with SR Research DataViewer software. The first fixation was excluded from analysis, as were saccade amplitude and fixation duration outliers.

Attention maps were generated by constructing a matrix of fixation counts with the same x,y dimensions as the scene, and counting the total fixations corresponding to each coordinate in the image. The fixation count matrix was smoothed with a Gaussian low pass filter with circular boundary conditions and a frequency cutoff of -6dB. For the scene-level analysis, all fixations recorded during the viewing period were counted. For the fixation analysis, separate attention maps were constructed for each ordinal fixation.

We generated meaning maps using the context-free rating method introduced in Henderson & Hayes (2017). Each 1024 x 768 pixel scene was decomposed into a series of partially overlapping circular patches at fine and coarse spatial scales. The decomposition resulted in 12,000 unique fine-scale patches and 4,320 unique coarse-scale patches, totaling 16,320 patches.

Raters were 165 subjects recruited from Amazon Mechanical Turk. All subjects were located in the United States, had a HIT approval rating of 99% or more, and participated once. Subjects provided informed consent and were paid $0.50.

All but one subject rated 300 random patches extracted from the 30 scenes. Subjects were instructed to rate how informative or recognizable each patch was using a 6-point Likert scale (‘very low’, ‘low’, ‘somewhat low’, ‘somewhat high’, ‘high’, ‘very high’). Prior to rating patches, subjects were given two examples each of low-meaning and high-meaning patches in the instructions to ensure they understood the task. Patches were presented in random order. Each patch was rated 3 times by 3 independent raters totaling 48,960 ratings per scene. Because there was high overlap across patches, each fine patch contained data from 27 independent raters and each coarse patch from 63 independent raters.

Meaning maps were generated from the ratings for each scene by averaging, smoothing, and combining the fine and coarse scale maps from the corresponding patch ratings. The ratings for each pixel at each scale in each scene were averaged, producing an average fine and coarse rating map for each scene. The fine and coarse maps were then averaged. Because subjects in the eyetracking task showed a consistent center bias in their fixations, we applied center bias to the maps using a multiplicative down-weighting of scores in the map periphery. “Center bias” is the tendency for fixations to cluster around the center of the scene and to be relatively absent in the periphery of the image. The final map was blurred using a Gaussian filter via the Matlab function ‘imgaussfilt’ with a sigma of 10.

Image-based saliency maps were constructed using the Graph-Based Visual Saliency toolbox in Matlab with default parameters. We used GBVS because it is a state-of-the-art model that uses only image-computable salience. While there are newer saliency models that predict attention better, these models incorporate high-level image features through training on viewer fixations and object features, which may index semantic information. We used GBVS to avoid incorporating semantic information in image-based saliency maps, which could confound the comparison with meaning.

Prior to analysis, feature maps were normalized to a common scale using image histogram matching via the Matlab function ‘imhistmatch’ in the Image Processing Toolbox. The corresponding attention map for each scene served as the reference image. Map normalization was carried out within task conditions: for the map-based analysis of the control condition, feature maps were normalized to the attention map derived from fixations in the control condition only, and likewise for the suppression condition. Results did not differ between the current analysis and a second analysis using feature maps normalized to the same attention map generated from fixations in the control condition.

Figure 2 shows a schematic of the meaning mapping procedure on the top row and representation saliency, meaning, and attention maps on the bottom row. The first panel of row 1 shows the same example real-world scene (the entryway). The second panel shows the fine-scale spatial grid, which consists of small overlapping circles on the scene. An example small-scale patch is shown on the grid. The patch shows a small group of objects on the dresser. The third panel shows the coarse-scale spatial grid, which consists of larger overlapping circles, and an example coarse-scale scene patch that overlaps with the example small-scale scene patch. It shows the same small group of objects, but additionally shows the top drawer of the dresser and adjacent objects (a phone and a modem) that were not visible in the small scale patch. The fourth panel shows six examples of patches that received high meaning ratings or low meaning ratings. The three high meaning patches show a phone on the dresser, a handle of a drawer on the dresser, and a candle on the dresser. The there low meaning patches all show only surfaces: the green wall in a corner of the room, the door, and several tiles from the floor. The second row shows the saliency, meaning, and attention maps, all of which are heatmaps of the same dimensions as the scene image. Dark colors (black and dark red) indicate low map values, and bright colors (white and yellow) indicate high map values. The saliency map shows high values that correspond to contrasts between objects and the tile floor, a dark gray vacuum cleaner against a white door, white crown molding contrasting with the green walls, and the outlines around each drawer or the dresser. The meaning map shows high values corresponding to the top of the dresser where many small objects are located and dresser drawers, with more diffuse middle values (orange-red) corresponding to objects around the dresser. The attention map for the control condition has several bright spots corresponding to higher fixation density. The hot spots overlap with the objects on the top of the dresser, the vacuum cleaner, other cleaning supplies next to the dresser, and the objects in the background. The attention map for the suppression condition is more or less identical.

We computed correlations (R2) across the maps of 30 scenes to determine the degree to which saliency and meaning overlap with one another. We excluded the peripheral 33% of the feature maps when determining overlap between the maps to control for the peripheral downweighting applied to both, which otherwise would inflate the correlation between them. On average, meaning and saliency were correlated, and this relationship differed from zero.

Experiment 1: Results

To determine what role verbal encoding might play in extracting meaning from scenes, we asked whether the advantage of meaning over salience in explaining variance in attention would hold in each condition. To answer this question, we conducted two-tailed paired t-tests within task conditions.

To determine whether we obtained adequate effect sizes for the primary comparison of interest, we conducted a sensitivity analysis using G*Power 3.1. We computed the effect size index dz—a standardized difference score—and the critical t statistic for a two-tailed paired t-test with 95% power and a sample size of 30 scenes. The analysis revealed a critical t value of 2.05 and a minimum dz of 0.68.

We correlated meaning and saliency maps with attention maps to determine the degree to which meaning or salience guided visual attention. Squared linear and semipartial correlations (R2) were computed within each condition for each of the 30 scenes. The relationship between meaning and salience, respectively, and visual attention was analyzed using t-tests. Cohen’s d was computed to estimate effect size, interpreted as small, medium, or large following Cohen (1988).

In the control condition, when subjects were only instructed to memorize scenes, meaning accounted for 34% of the average variance in attention and salience accounted for 21%. The advantage of meaning over salience was significant. In the articulatory suppression condition, when subjects additionally had to repeat a sequence of digits aloud, meaning accounted for 37% of the average variance in attention whereas salience accounted for 23%. The advantage of meaning over salience was also significant when the task prevented verbal encoding.

Because meaning and salience are correlated, we partialed out the shared variance explained by both meaning and salience. In the control condition, when the shared variance explained by salience was accounted for, meaning explained 15% of the average variance in attention, while salience explained only 2% of the average variance once the variance explained by meaning was accounted for. The advantage of meaning over salience was significant. In the articulatory suppression condition, meaning explained 16% of the average unique variance after shared variance was partialed out, while salience explained only 2% of the average variance after shared variance with meaning was accounted for, and the advantage was significant.

Figure 3a shows scatter box plots for linear correlations on the left panel and semipartial correlations that explain the unique variance explained by meaning and salience, respectively, on the right panel. The y-axis for both panels ranges from 0.00 to 1.00. In the scatter box plots, the mean is shown on the center line and 95% confidence intervals as boxes around the mean. Whiskers correspond to plus or minus one standard deviation. Dots correspond to individual data points. Between the control condition and the suppression condition, image salience—indicated in blue—explains essentially the same amount of variance, and the boxes are almost identical. The box (and central line) for meaning is slightly higher and larger in the suppression condition than in the control. On the right panel, which shows semipartial correlations, the picture is much the same except the box plots for the variance explained by salience are barely visible—thick, dark lines hovering just above 0 on the y-axis. The boxes for meaning look even more similar across conditions than they did for the linear correlations, but unlike the boxes for salience they are clearly visible and hover around 0.15 on the y-axis.

To summarize, we found a large advantage of meaning over salience in explaining variance in attention in both conditions, for both linear and semipartial correlations. For all comparisons, the value of the t statistic and dz exceeded the thresholds obtained in the sensitivity analysis.

Following our previous work, we examined early fixations to determine whether salience influences early scene viewing. We correlated each feature map (meaning, salience) with attention maps at each fixation. Squared linear and semipartial correlations (R2) were computed for each fixation, and the relationship between meaning and salience with attention, respectively, was assessed for the first three fixations using paired t-tests.

In the control condition, meaning accounted for 37% of the average variance in attention during the first fixation, and 14% and 13% during the second and third fixations, respectively. Salience accounted for 9%, 8%, and 7% of the average variance during the first, second, and third fixations, respectively. The advantage of meaning was significant for all three fixations. For subjects in the suppression condition, meaning accounted for 42% of the average variance during the first fixation, 21% during the second, and 17% during the third fixation. Salience accounted for 10% of the average variance during the first fixation and 9% during the second and third fixations. The advantage of meaning over salience was significant for all three fixations.

To account for the correlation between meaning and salience, we partialed out shared variance explained by both meaning and salience, then repeated the fixation analysis on the semipartial correlations. In the control condition, after the shared variance explained by both meaning and salience was partialed out, meaning accounted for 30% of the average variance at the first fixation, 10% of the variance during the second fixation, and 8% during the third fixation. After shared variance with meaning was partialed out, salience accounted for only 2% of the average unique variance at the first and third fixations and 3% at the second fixation. The advantage of meaning was significant for all three fixations. In the articulatory suppression condition, after the shared variance with salience was partialled out, meaning accounted for 34% of the average variance during the first fixation, 14% at the second fixation, and 10% during the third fixation. After the shared variance with meaning was partialled out, on average salience accounted for 2% of the variance at all three fixations. The advantage of meaning was significant for all three fixations.

Figure 3b shows line graphs for linear correlations on the top row and semipartial correlations that explain the unique variance explained by meaning and salience on the bottom row. The y-axis for both panels ranges from 0.00 to 1.00. Lines in the graph corresponding to the suppression condition are dashed. Error bars around each point indicate 95% confidence intervals. For linear correlations, blue lines corresponding to image salience hover at or below 0.1 for the entire period shown (fixations 1-38), and are almost completely overlapping between conditions. Red lines for meaning start out quite high—around 0.4—and decrease after the first fixation, but both red lines are higher than the blue lines for image salience throughout. The same trend is visible on the graph showing semipartial correlations, except the blue lines for salience are barely above 0 on the y-axis, and the red lines for meaning are more clearly distinguishable from those for salience, but not terribly distinguishable from one another (across conditions). In both graphs, the dashed red lines corresponding to meaning in the suppression condition are higher than the solid red lines for the control condition.

In sum, early fixations revealed a consistent advantage of meaning over salience, counter to the claim that salience influences attention during early scene viewing. The advantage was present for the first three fixations in both conditions, when we analyzed both linear and semipartial correlations, and all effect sizes were medium or large.

To confirm that subjects took the memorization task seriously, we totaled the number of hits, correct rejections, misses, and false alarms on the recognition task for each subject, each of which ranged from 0 to 30. Recognition performance was high in both conditions. On average, subjects in the control condition correctly recognized scenes shown in the memorization task 95% of the time, while subjects who engaged in the suppression task during memorization correctly recognized scenes 90% of the time. Subjects in the control conditions falsely reported that a foil scene had been present in the memorization scene set 3% of the time on average, and those in the suppression condition false alarmed an average of 4% of the time. Overall, subjects in the control condition had higher recognition accuracy, though the difference in performance was small.

Figure 4a shows recognition task performance for each subject using violin plots with data points superimposed. Red violin plots indicate hits, green violin plots indicate correct rejections, blue violins indicate misses, and purple violins show false alarms. Recognition performance for the control condition is shown on the left, and for the suppression condition on the right. In both conditions, there are more hits and correct rejections than misses or false alarms, reflecting high accuracy. However, in the suppression condition (on the right), the violins are thinner and taller for all conditions, indicating more variation in the data for the suppression condition than the control.

We then computed d’ with log-linear correction to handle extreme values (ceiling or floor performance) using the dprime function from the psycho package in R, resulting in 30 data points per condition (1 data point per subject). On average, d’ scores were higher in the control condition than the articulatory suppression condition. The difference in performance was not significant, and the effect size was small.

Figure 4b shows d’ scores for the control condition and the suppression condition as violin plots with data points superimposed. The gray violin corresponds to d’ for the control condition, and is much higher on the y-axis and wider than that of the suppression condition, showing less variation in the data and slightly better performance in that condition. The yellow violin corresponds to the suppression condition, and it is more narrow and tall, though most of the d’ scores (all but 2) fall within the same range for both conditions. In sum, recognition was numerically better for subjects who were only instructed to study the scenes as opposed to those who additionally completed an articulatory suppression task, but the difference was not significant.

Experiment 1: Discussion

The results of Experiment 1 suggest that incidental verbalization does not modulate the relationship between scene meaning and visual attention during scene viewing. However, the experiment had several limitations. First, we implemented the suppression manipulation between-subjects rather than within-subjects out of concern that subjects might infer the hypothesis in a within-subject paradigm and skew the results. Second, because numerical cognition is unique, it is possible that another type of verbal interference would affect the relationship between meaning and attention. Third, we tested relatively few scenes (only 30).

We conducted a second experiment to address these limitations and replicate the advantage of meaning over salience despite verbal interference. In Experiment 2, the verbal interference consisted of sequences of common shape names (for example, square, heart, circle) rather than digits, and the interference paradigm was implemented within-subject using a blocked design. We added 30 scenes to the Experiment 1 stimulus set, yielding 60 experimental items total.

We tested the same two competing hypotheses in Experiments 1 and 2: If verbal encoding mediates the relationship between meaning and attentional guidance, and the use of numerical interference in Experiment 1 was insufficient to disrupt that mediation, then the relationship between meaning and attention should be weaker when incidental verbalization is not available, in which case meaning and salience may explain comparable variance in attention. If verbal encoding does not mediate attentional guidance in scenes and our Experiment 1 results cannot be explained by numerical interference specifically, then we expect meaning to explain greater variance in attention both when shape names are used as interference and when there is no verbal interference.

The method for Experiment 2 was the same as Experiment 1, with the following exceptions.

Sixty-five undergraduates enrolled at the University of California, Davis participated for course credit. All were native speakers of English, at least 18 years old, and had normal or corrected-to-normal vision. They were naive to the purpose of the experiment and provided informed consent as approved by the University of California, Davis Institutional Review Board. Four subjects were excluded from analysis because their eyes could not be accurately tracked, and an additional subject was excluded due to excessive movement; data from the remaining 60 subjects were analyzed.

We selected the following common shapes for the suppression task: circle, cloud, club, cross, arrow, heart, moon, spade, square, and star. Names for the shapes were monosyllabic for eight shape names and disyllabic for two shape names. Shape sequences consisted of 3 shapes randomly sampled without replacement from the set of 10.

Scenes were 60 digitized and luminance-matched photographs of real-world scenes. Thirty were used in Experiment 1, and an additional 30 were drawn from another study. Of the additional scenes, 16 depicted outdoor environments, and 14 depicted indoor environments, and each of the 30 scenes belonged to a unique scene category. People and text were not present in any of the scenes.

Another set of 60 digitized images of comparable scenes (similar scene categories from the same time period, no people depicted) served as foils in the memory test. Thirty of these were used in Experiment 1, and an additional 30 were distractor images drawn from a previous study. The Experiment 1 scenes and the additional 30 scenes were equally distributed across blocks.

The apparatus was identical to that used in Experiment 1.

Subjects were informed that they would complete two separate experimental blocks, and that in one block each trial would begin with a sequence of 3 shapes that they would repeat aloud during the scene viewing period.

Following successful calibration, there were 4 practice trials to familiarize subjects with the task prior to the experimental trials. The first 2 practice trials were control trials, and the rest were articulatory suppression trials. These consisted of shape sequences (for example, cloud arrow cloud) that were not repeated in the experimental trials. Before the practice trials, subjects were shown all of the shapes used in the suppression task, alongside the names of each shape.

Figure 5a shows the shape familiarization screen, which depicts all 10 shapes in black against a gray background, accompanied by white text labels to provide the name we wanted subjects to use for each shape. From left to right and top to bottom, the shapes shown are circle, cloud, club, cross, arrow, heart, moon, spade, square, and star.

The trial procedure was identical to Experiment 1, except that the pre-scene articulatory suppression condition displayed the instruction “Study the sequence of shapes shown below. Your task is to repeat these shapes over and over out loud for 12 seconds while viewing an image of the scene”, followed by a sequence of 3 shapes (for example, square, heart, cross) until the subject pressed a button.

Figure 5b shows the shape sequence display in the suppression condition, which includes instructions in white text against a gray background, with an example shape sequence shown in black. The shape sequence shown is square, heart, cross.

Following the experimental trials in each block, subjects performed a recognition memory in which 30 experimental scenes they saw earlier in the block and 30 foil scenes that they had not seen previously were shown. The remainder of the recognition memory task procedure was identical to that of Experiment 1. The procedure repeated 60 times, after which the block terminated. Following completion of the first block, subjects started the second with another calibration procedure. In the second block, subjects saw the other 30 scenes (and 30 memory foils) that were not displayed during the first block, and participated in the other condition (suppression if the first block was the control, and vice versa). Each subject completed 60 experimental trials and 120 recognition memory trials total. The scenes shown in each block and the order of conditions were counterbalanced across subjects.

Attention maps were generated in the same manner as Experiment 1.

Meaning maps for 30 scenes added in Experiment 2 were generated using the same procedure as the scenes tested in Experiment 1, with the following exceptions. Raters were 148 UC Davis undergraduate students recruited through the UC Davis online subject pool. All were 18 years or older, had normal or corrected-to-normal vision, and reported no color blindness. Subjects received course credit for participation.

In each survey, catch patches showing solid surfaces (for example, a wall) served as an attention check. Data from 25 subjects who did not attend to the task (responded incorrectly on fewer than 85% of catch trials), or did not respond to more than 10% of the questions, were excluded. Data from the remaining 123 raters were used to construct meaning maps.

Saliency maps were generated in the same manner as in Experiment 1. Maps were normalized in the same manner as in Experiment 1.

We determined the degree to which saliency and meaning overlap for the 30 new scenes by computing feature map correlations across the maps of 30 scenes, excluding the periphery to control for the peripheral downweighting associated with center biasing operations. On average, meaning and saliency were correlated, and this relationship differed from zero.

We again conducted a sensitivity analysis, which revealed a critical t value of 2.00 and a minimum dz of 0.47.

We correlated meaning and saliency maps with attention maps in the same manner as in Experiment 1. Squared linear and semipartial correlations (R2) were computed within each condition for each of the scenes. The relationship between meaning and salience with visual attention was analyzed using t-tests. Cohen’s d was computed, and effect sizes were interpreted in the same manner as the Experiment 1 results.

We examined early fixations to replicate the early advantage of meaning over image salience observed in Experiment 1 and previous work. We correlated each feature map (meaning, salience) with attention maps at each fixation. Map-level correlations and t-tests were conducted in the same manner as Experiment 1.

Experiment 2: Results

We sought to replicate the results of Experiment 1 using a more robust experimental design. If verbal encoding is not required to extract meaning from scenes, we expected an advantage of meaning over salience in explaining variance in attention for both conditions. We again conducted paired t-tests within task conditions.

Meaning accounted for 36% of the average variance in attention in the control condition and salience accounted for 25%. The advantage of meaning over salience was significant and the effect size was large. Meaning accounted for 45% of the variance in attention in the suppression condition and salience accounted for 27%. Consistent with Experiment 1, the advantage of meaning over salience was significant even with verbal interference, and the effect size was large.

To account for the relationship between meaning and salience, we partialed out the shared variance explained by both. When the shared variance explained by salience was accounted for in the control condition, meaning explained 15% of the average variance in attention, while salience explained 3% of the average variance after accounting for the variance explained by meaning. The advantage of meaning over salience was significant, and the effect size was large. Meaning explained 20% of the unique variance on average after shared variance was partialed out in the articulatory suppression condition, and salience explained 2% of the average variance after shared variance with meaning was accounted for, and the advantage was significant with a large effect size.

Figure 6a shows scatter box plots for linear correlations on the left panel and semipartial correlations that explain the unique variance explained by meaning and salience, respectively, on the right panel. The y-axis for both panels ranges from 0.00 to 1.00. In the scatter box plots, the mean is shown on the center line and 95% confidence intervals as boxes around the mean. Whiskers correspond to plus or minus one standard deviation. Dots correspond to individual data points. Between the control condition and the suppression condition, image salience—indicated in blue—explains essentially the same amount of variance, hovering around 0.25 on the y-axis, and the boxes are almost identical. The box (and central line) for meaning is higher and larger in the suppression condition than in the control, both of which are higher than image salience. On the right panel, which shows semipartial correlations, the picture is much the same except the box plots for the variance explained by salience are barely visible—thick, dark lines hovering just above 0 on the y-axis. The box for meaning in the control condition hovers around 0.15, but in the suppression condition it is higher and hovers around 0.20. Both boxes for meaning are clearly visible and higher than the blue boxes for image salience.

Consistent with Experiment 1, we found a large advantage of meaning over salience in accounting for variance in attention in both conditions, for both linear and semipartial correlations, and the value of the t statistic and dz exceeded the thresholds obtained in the sensitivity analysis.

In the control condition, meaning accounted for 30% of the average variance in attention during the first fixation, 17% during the second, and 16% during the third. Salience accounted for 11% of the variance at the first fixation and 10% of the variance during the second and third fixations. The advantage of meaning was significant for all three fixations, and effect sizes were medium or large. In the suppression condition, meaning accounted for 45% of the average variance during the first fixation, 32% during the second, and 25% during the third. Salience accounted for 13% of the average variance during the first fixation,15% during the second, and 11% during the third. The advantage of meaning over salience was significant for all three fixations.

Because meaning and salience were correlated, we partialed out shared variance explained by both and analyzed semipartial correlations computed for each of the initial three fixations. In the control condition, after the shared variance explained by both meaning and salience was partialed out, meaning accounted for 23% of the average variance at the first fixation, 11% of the variance during the second, and 9% during the third. After shared variance with meaning was partialed out, salience accounted for 3% of the average unique variance at the first fixation and 4% at the second and third. The advantage of meaning was significant for all three fixations. In the suppression condition, after the shared variance with salience was partialled out, meaning accounted for 35% of the variance on average during the first fixation, 20% of the variance at the second, and 16% during the third. After the shared variance with meaning was partialled out, on average salience accounted for 2% of the variance at the first and third fixations and 3% of the variance at the second. The advantage of meaning was significant for all three fixations, with large effect sizes.

Figure 6b shows line graphs for linear correlations on the top row and semipartial correlations that explain the unique variance explained by meaning and salience on the bottom row. The y-axis for both panels ranges from 0.00 to 1.00. Lines in the graph corresponding to the suppression condition are dashed. Error bars around each point indicate 95% confidence intervals. For linear correlations, blue lines corresponding to image salience again hover at or below 0.1 for the entire period shown (fixations 1-38), and are almost completely overlapping between conditions. Red lines for meaning start out quite high—around 0.3 and 0.45 for the control and suppression conditions, respecitvely—and decrease after the first fixation, but both red lines are higher than the blue lines for image salience until time points 33-38. The same trend is visible on the graph showing semipartial correlations, except the blue lines for salience are barely above 0 on the y-axis. The red lines for meaning are very clearly distinguishable from those for salience. In both graphs, the dashed red lines corresponding to meaning in the suppression condition are higher than the solid red lines for the control condition, moreso than they were for the Experiment 1 data shown in Figure 3b.

The results of Experiment 2 replicated those of Experiment 1: meaning held a significant advantage over salience when the entire viewing period was considered and when we limited our analysis to early viewing, both for linear and semipartial correlations.

As an attention check, we totaled the number of hits, correct rejections, misses, and false alarms on the recognition task for each subject. The totals for each response category ranged from 0 to 30. Recognition performance was high in both conditions. In the control condition, subjects correctly recognized scenes shown in the memorization task 97% of the time on average, while subjects correctly recognized scenes 91% of the time after they had engaged in the suppression task during memorization. In the control condition, subjects falsely reported that a foil had been present in the memorization scene set 1% of the time on average, and in the suppression condition, the average false alarm rate was 2%. Overall, recognition accuracy was higher in the control condition than the suppression condition, though the difference was small.

Figure 7a shows recognition task performance for each subject using violin plots with data points superimposed. Red violin plots indicate hits, green violin plots indicate correct rejections, blue violins indicate misses, and purple violins show false alarms. Recognition performance for the control condition is shown on the left, and for the suppression condition on the right. In both conditions, there are more hits and correct rejections than misses or false alarms overall, reflecting high accuracy. However, in the suppression condition (on the right), the violins are thinner and taller for all conditions, indicating more variation in the data for the suppression condition than the control, and the difference is more apparent in Figure 7a than it was in Figure 4a for Experiment 1.

We then computed d’ in the same manner as Experiment 1. In the control condition, d’ scores were higher on average than in the suppression condition. To determine whether the difference in means was significant, we conducted a paired t-test, which revealed a significant difference with a large effect size.

Figure 7b shows d’ scores for the control condition and the suppression condition as violin plots with data points superimposed. The gray violin corresponds to d’ for the control condition. It is shaped like a funnel in that it is very wide at the top, indicating most data points corresponded to high recognition accuracy. It is much higher on the y-axis and wider than that of the suppression condition, is represented by a yellow violin plot which is shaped like an upside-down wine bottle indicating more data points corresponding to poorer accuracy in the suppression condition, and greater variation in the suppression condition.

For Experiment 2, while recognition accuracy was high overall, recognition was significantly better in the control condition, when subjects memorized scenes and did not engage in the suppression task.

Experiment 2: Discussion

The attention results of Experiment 2 replicated those of Experiment 1, providing further evidence that incidental verbalization does not modulate the relationship between scene meaning and visual attention during scene viewing. Recognition performance was significantly worse in the suppression condition than in the control condition, which we cannot attribute to individual differences given that the interference manipulation was implemented within-subject. One possibility is that the shape name interference imposed greater cognitive load than the digit sequence interference; however, we cannot determine whether that was the case based on the current experiment.

General Discussion

The current study tested two competing hypotheses concerning the relationship (or lack thereof) between incidental verbal encoding during scene viewing and attentional guidance in scenes. First, the relationship between scene meaning and visual attention could be mediated by verbal encoding, even when it occurs incidentally. Second, scene meaning guides attention regardless of whether incidental verbalization is available, and verbal encoding does not mediate use of scene meaning. We tested these hypotheses in two experiments using an articulatory suppression paradigm in which subjects studied scenes for a later memorization task and either engaged in a secondary task (digit or shape sequence repetition) to suppress incidental verbalization, or had no secondary task. In both experiments, we found an advantage of meaning over salience in explaining the variance in attention maps whether or not incidental verbalization was suppressed. Our results did not support the hypothesis that verbal encoding mediates attentional guidance by meaning in scenes. To the extent that observers use incidental verbalization during scene viewing, it does not appear to mediate the influence of meaning on visual attention, suggesting that meaning in scenes is not necessarily interpreted through the lens of language.

Our attentional findings do not support saliency-based theories of attentional guidance in scenes. Instead, they are consistent with prior work showing that regions with higher image salience are not fixated more and that top-down information, including task demands, plays a greater role than image salience in guiding attention from as early as the first fixation. Consistent with cognitive guidance theory, scene meaning—-which captures the distribution of information across the scene—-predicted visual attention better in both conditions than image salience did. Because our chosen suppression manipulation interfered with verbalization strategies without imposing undue executive load, our findings demonstrate that the advantage of meaning over salience was not modulated by the use of verbal encoding during scene viewing. Instead, we suggest that domain-general cognitive mechanisms (for example, a central executive) may push attention to meaningful scene regions, although additional work is required to test this idea.

Many of the previous studies that showed an effect of internal verbalization strategies (via interference paradigms) tested simpler displays, such as arrays of objects, color patches, or cartoon images, while our stimuli were real-world photographs. Unlike real-world scenes, observers cannot extract scene gist from simple arrays, and may process cartoons less efficiently than natural scenes. It is possible that verbal encoding exerts a greater influence on visual processing for simpler stimuli: the impoverished images may put visual cognition at a disadvantage because gist and other visual information that we use to efficiently process scenes are not available.

We cannot know with certainty whether observers in our suppression task were unable to use internal verbal encoding. However, we would expect the secondary verbal task to have at least impeded verbalization strategies, and that should have impacted the relationship between meaning and attention if verbal encoding is involved in processing scene meaning. Furthermore, the suppression tasks we used (3-digit or 3-shape sequences) were comparable to tasks that eliminated verbalization effects in related work, and so should have suppressed inner speech. We suspect that a more demanding verbal task would have imposed greater cognitive load, which could confound our results because we would not be able to separate effects of verbal interference from those of cognitive load.

Subjects in the control condition did not perform a secondary non-verbal task (for example, a visual working memory task). Given that our findings did not differ across conditions, we suspect controlling for the secondary task’s cognitive load would not have affected the outcome. Recall that prior work has shown digit repetition tasks do not pose excessive cognitive load, and we would have expected lower recognition accuracy in the suppression condition if the demands of the suppression task were too great. However, we cannot be certain the verbal task did not impose burdensome cognitive load in our paradigm, and therefore this remains an issue for further investigation.

Our results are limited to attentional guidance when memorizing scenes. It is possible that verbal encoding exerts a greater influence on other aspects of visual processing, or that the extent to which verbal encoding plays a role depends on the task. Verbal interference may be more disruptive in a scene categorization task, for example, than in scene memorization, given that categorization often involves verbal labels.

The current study investigated whether internal verbal encoding processes (for example, thought in the form of language) modulate the influence of scene meaning on visual attention. We employed a verbal interference paradigm to control for incidental verbalization during a scene memorization task, which did not diminish the relationship between scene meaning and attention. Our findings suggest that verbal encoding does not mediate scene processing, and contribute to a large body of empirical support for cognitive guidance theory.

Supplemental material available on the Open Science Framework under a project with the title as the current manuscript.

This work has been published in the journal Memory & Cognition, Volume 48, issue 7, inclusive pages 1181-1195. A preprint of the accepted version of the manuscript is available on PsyArxiv under the same name. Please refer to either document for correspondence information, author affiliations, references, statistics, and figures.

#podcast #research #visual attention #language #scene processing #meaning #salience #peer reviewed #social science #vision science #visual cognition #read speech

0 notes