Deep convolutional networks do not classify based on global object shape

Nicholas Baker; Hongjing Lu; Gennady Erlikhman; Philip J. Kellman

doi:10.1371/journal.pcbi.1006613

Abstract

Deep convolutional networks (DCNNs) are achieving previously unseen performance in object classification, raising questions about whether DCNNs operate similarly to human vision. In biological vision, shape is arguably the most important cue for recognition. We tested the role of shape information in DCNNs trained to recognize objects. In Experiment 1, we presented a trained DCNN with object silhouettes that preserved overall shape but were filled with surface texture taken from other objects. Shape cues appeared to play some role in the classification of artifacts, but little or none for animals. In Experiments 2–4, DCNNs showed no ability to classify glass figurines or outlines but correctly classified some silhouettes. Aspects of these results led us to hypothesize that DCNNs do not distinguish object’s bounding contours from other edges, and that DCNNs access some local shape features, but not global shape. In Experiment 5, we tested this hypothesis with displays that preserved local features but disrupted global shape, and vice versa. With disrupted global shape, which reduced human accuracy to 28%, DCNNs gave the same classification labels as with ordinary shapes. Conversely, local contour changes eliminated accurate DCNN classification but caused no difficulty for human observers. These results provide evidence that DCNNs have access to some local shape information in the form of local edge relations, but they have no access to global object shapes.

Author summary

“Deep learning” systems–specifically, deep convolutional neural networks (DCNNs)–have recently achieved near human levels of performance in object recognition tasks. It has been suggested that the processing in these systems may model or explain object perception abilities in biological vision. For humans, shape is the most important cue for recognizing objects. We tested whether deep convolutional neural networks trained to recognize objects make use of object shape. Our findings indicate that other cues, such as surface texture, play a larger role in deep network classification than in human recognition. Most crucially, we show that deep learning systems have no sensitivity to the overall shape of an object. Whereas deep learning systems can access some local shape features, such as local orientation relations, they are not sensitive to the arrangement of these edge features or global shape in general, and they do not appear to distinguish bounding contours of objects from other edge information. These findings show a crucial divergence between artificial visual systems and biological visual processes.

Citation: Baker N, Lu H, Erlikhman G, Kellman PJ (2018) Deep convolutional networks do not classify based on global object shape. PLoS Comput Biol 14(12): e1006613. https://doi.org/10.1371/journal.pcbi.1006613

Editor: Wolfgang Einhäuser, Technische Universitat Chemnitz, GERMANY

Received: November 3, 2017; Accepted: October 31, 2018; Published: December 7, 2018

Copyright: © 2018 Baker et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: This research was funded by the National Science Foundation Research Traineeship for ModEling and uNdersTanding human behaviOR (MENTOR) DGE-1829071 (http://www.math.ucla.edu/~bertozzi/NRT/index.html) to NB and the Advancing Theory and Application in Perceptual and Adaptive Learning to Improve Community College Mathematics NSF Grant ECR-1644916 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1644916&HistoricalAwards=false) to PJK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Machine vision is one of the most challenging problems in artificial intelligence. Task-general image understanding is so difficult that it constitutes an “AI complete” problem [1], that is, a problem of sufficient difficulty and generality that it requires intelligence on a par with humans. If solved, it would be considered equivalent to the first successful completion of a Turing test [2,3]. While the general problem of image understanding is still far outside the capabilities of modern artificial systems, algorithms are beginning to reach near human capabilities on certain specialized tasks. In particular, deep convolutional neural network (DCNN) algorithms are achieving previously unseen performance on object recognition tasks.

Since their first entrance [4] in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), deep convolutional networks have substantially outperformed other state of the art recognition algorithms (e.g., [5]), to the point of the practical extinction of the latter. Modern ILSVRC competitions (including 1.2 million images associated with 1000 categories) almost exclusively feature deep convolutional networks, and their error rates have continuously fallen with more powerful hardware and more sophisticated engineering. The current winner has a top-five error rate of less than 3% on the image classification task, meaning that it fails to include the correct category out of 1000 object categories in its top five most likely choices less than 3% of the time, which is even lower than human performance on the same task (~5.1%).

The impressive performance of DCNNs on natural image recognition tasks and certain apparent similarities between human physiology and the architecture of these networks suggest the natural question of whether these systems explain the capabilities of human perception and acquire similar representations to those used by the human visual system. As DCNNs approach human performance in object recognition tasks, we may ask whether or in what ways their architecture and processing mirror that of human vision. In this paper, we take up this question with special focus on object shape. In a series of five experiments, we probe the capabilities of DCNNs and humans to cope with object classifications with the goal of finding out whether trained networks overtly or implicitly encode object shape and use it to perform classifications.

To anticipate our results: Deep learning networks lack shape representations and processing capabilities that form the primary bases of human object classification. Deep learning networks do have access to some relations of local orientations that may be considered local shape constituents, but they do not appear to form global shape representations. We also show that deep learning networks make no special use of the bounding contours of objects, which most reliably define shape in human and biological vision.

Background

DCNNs have attracted considerable attention, and several different approaches have been used to compare their performance to human object processing. Some of the similarities begin with the basic architecture. Deep convolutional neural networks perform a series of nonlinear transformations on input data such as an image in the case of object recognition. The final transformation outputs a vector of category probability values, one for each object category. Critically, early layers of these networks are not fully connected as in classical neural networks. Instead, they have convolutional windows that preserve spatial information in the image [6]. In modern DCNNs, early layers tend to operate on very local regions of the image, while deeper into the network, each node receives input from filters over a larger area of the image, allowing the network to access relations between more distant regions [4]. This network architecture has some obvious similarities with biological vision. Convolutional layers are analogous to receptive fields in visual cortex, which likewise consider more disparate regions together at higher levels of extrastriate cortex [7].

Do the similarities between DCNNs and biological vision go deeper than this basic architectural feature? One way to evaluate this is by comparing physiological activity of neural units in biological systems with the activity of certain nodes in an artificial network. Pospisil, Pasupathy, and Bair presented AlexNet, a groundbreaking DCNN, with shape stimuli to which V4 cells are optimally tuned [8] and found that there is some resemblance between node responses in intermediate layers of AlexNet and cell responses in V4, although the network response was quite sparse compared to biological systems, with many units responding to none or very few of the shape stimuli [9]. Other studies have looked at the correlation between a network’s classification accuracy and the similarity between network representations and representations in the inferotemporal gyrus (IT). Randomly varying parameters across several networks, they found that the activity of nodes in networks that perform better on the object classification task give better predictions about the activity of clusters of neurons in IT when primates are presented with the same image [10,11].

Comparisons have also been made between a network’s performance and human behavior in similar tasks. In a sense, all performance measures on image classification are a comparison to human vision, as accuracy is being measured based on labels assigned by humans [12]. However, to evaluate similarities and differences between DCNNs and human vision, it can be instructive to examine network performance on tasks for which they were not explicitly trained. Several experiments have found similarities between convolutional networks and humans in such tasks. One study used features from a convolutional network to predict the memorability of certain object segments. Features extracted from the DCNN were predictive of objects’ memorability for human subjects, suggesting that humans and networks might be attending to similar features when viewing an object [13]. In another study, Peterson, Abbott, and Griffiths found a strong correlation between similarity judgments made by DCNNs with human similarity ratings [14].

Although it is interesting to observe similar performance level for object recognition between DCNNs and biological vision, it is unclear whether the systems process information for object recognition in a similar manner. The present paper focuses on the perception of shapes, the most important cue for object recognition [15]. Objects can be recognized accurately despite impoverishments across every other visual dimension provided that global shape information is preserved [16, 17]. For example, consider the image pair presented in Fig 1. In Fig 1A, the information available for recognition has been significantly reduced across several feature dimensions. The object has no texture or background context, and the information along its contour has been simplified. Still, it is far more easily recognized as a bear than the object in Fig 1B, where cues like texture and context are preserved, but object shape is interrupted. Similarly, an ordinary line drawing, or even a few well-chosen lines as in a Picasso sketch, readily allows object recognition via shape perception processes in humans.

Download:

Fig 1. Demonstration of the importance of global shape in object recognition.

(a) Silhouette of a bear; (b) Scrambled natural image of a bear (See text). Image URLs are in S2 File.

https://doi.org/10.1371/journal.pcbi.1006613.g001

If deep networks are to be taken as models of human perception, we would expect object shape to be a critical component of their classification decisions. Currently, it is unclear to what extent shape representations play a role in object recognition in DCNNs. Kubilius, Bracci, and Op de Beeck conducted several intriguing experiments that suggest deep networks do have shape representations that are reasonably similar to human shape representations [18]. Networks were able to classify object silhouettes with ~40% accuracy and had some sensitivity to non-accidental features of an object, which are thought to be important for recognition in human observers [16]. They also compared the impact of shape cues on recognition performance for the networks with different architectures (e.g., different number of layers) and found some evidence that deeper networks did better on tasks where object shape was important to performance.

On the other hand, some research on DCNNs is difficult to reconcile with the claim that they utilize global shape of objects in detecting and recognizing objects. Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, and Fergus found that perturbation of a small subset of image pixels could result in consistent misclassification of the image, across multiple DCNNs, despite the changes being undetectable to human observers [19]. In the perturbed images, global shape of the object is unchanged, so the change in classification revealed that the global shape information of objects is likely not used for recognizing objects in the DCNNs. Another study used evolutionary algorithms to develop images that networks classified as certain objects with a high degree of confidence, despite a total absence of object shape in the images [20]. Zhu, Xie, and Yuille tested DCNN classification accuracy for images in which the object to be classified was removed [21]. They found that despite the object not actually being present in the image, networks performed reliably better than chance on the classification task, based purely on contextual information. These examples suggest that shape might be neither sufficient nor necessary for recognition in DCNNs.

More systematic tests are needed to understand whether or not, or in what ways, DCNNs process object shape. The fact that a network’s responses are sensitive to variables other than shape (such as texture or color) may reflect a valid use of information in its training history. The supervised learning method with which DCNNs are trained is agnostic about what information to consider when classifying an image. In natural images, texture and shape information are often highly correlated, so we can learn little about what cue is most relevant to the network’s classification without disentangling them. Reduction of performance by disruption of other variables may indicate that shape representations do not predominate, but such outcomes do not necessarily imply that shape information is not captured or potentially usable within the network. We term this “the latency problem”. To show that shape information is not implicitly captured or usable in a system, more systematic tests are required than simply showing that variables other than shape can be decisive in classification performance. Shape information may nevertheless be something DCNNs can in principle capture, but it may be latent in the system, overshadowed by other informational variables relevant for classification. Conversely, tests that have suggested that DCNN’s do use shape information have not distinguished local shape features from more global shape characteristics, nor have they disentangled responses to local orientation relations in visible contours from the more shape-defining contributions of the bounding contours of objects.

Questions about shape processing in DCNNs can be considered at three different levels: First, do deep networks trained to recognize objects use shape features in their classification decision? Second, can deep networks be trained to use shape information in their classification decisions? Third, can formal analysis of deep networks’ computational processes tell us what kinds of shape features they can and cannot extract from an image? In the current study, we focus on the first question, aiming to understand the capabilities that deep networks automatically learn through training on natural images. We consider the study of trained networks to be particularly important for two reasons. First, the attention garnered by DCNNs relates to the success of trained networks in object classification tasks. Both for theoretical understanding of these achievements and for practical applications, understanding how DCNNs achieve their high classification is important. Second, comparisons between biological vision systems and DCNNs require understanding how trained networks are operating, and require an understanding of the role of shape processing in particular. A sizable literature has recently emerged finding similarity between DCNNs trained on object recognition tasks and human perception [9–11, 13, 14, 22–26]. These efforts highlight the importance of understanding the functioning of trained DCNNs that are successful in object recognition.

In the experiments reported here, we presented DCNNs trained for object recognition (and, where appropriate, human observers) with stimuli intended to reveal information about the usability of global shape and bounding contours. The series of experiments was designed to provide multiple sources of information with regard to the latency problem in characterizing shape capabilities, in the process clarifying the roles of texture information and object shape in deep convolutional networks’ classification decisions. To carry out these studies, we tested two commonly used deep convolutional networks: AlexNet [4] and VGG-19 [27]. AlexNet has eight layers and is the deep network that started the DCNN revolution for object recognition. VGG-19 is deeper, 19 layers, and approaches the state of the art in object classification. Our approach was to use a variety of systematically modified stimuli to reveal the contribution of shape information to network responses. In Experiment 1, we examined the relative importance of overall shape and texture information by using objects in which the overall shape was preserved, but different texture, from another object, was superimposed on the object’s silhouette. In Experiments 2–4, we tested the networks on images with impoverished or altered texture and context information by using glass figurines, object outlines, and silhouettes. Following up the results and hypotheses emerging from these experiments, we tested the networks on images with manipulations that altered shape at a global level, while largely preserving local shape features, and vice versa (Experiment 5).

Results

Experiment 1

As noted earlier, shape is of great importance in human perception of objects, and shape information predominates in human object recognition. Prior tests of DCNNs have yielded some evidence that they classify by means of shape, whereas other work revealed examples in which images with textural similarity, but no shape in common with an object, were classified as that object with a high degree of confidence [20]. In Experiment 1, we directly compared convolutional networks’ use of shape and texture information in their classification decision. Using object silhouettes with no surface information, we overlaid a texture from a different object on top of the black figural region. We then compared the networks’ preference for both the object whose shape is displayed and the object whose texture is displayed.

Experiment 1 Method.

Test stimuli. Forty images of object silhouettes and 40 natural images were obtained from internet sources (for URLs of original images, see S2 File). Half of the object silhouettes were animals, and half manmade artifacts. In the ImageNet database, about 40% of categories are animals, and 50% are manmade objects (the last 10% are other inanimate objects like food). For each silhouette, a texture from one of the natural images was overlaid on the object. All shapes and textures were taken from the 1000 object categories on which the network was trained. See Fig 2 for examples.

Download:

Fig 2. Sample stimuli used in Experiment 1.

The bounding shape of an object was combined with the texture of a different object to generate each image. a) Shape: Teapot | Texture: Golf ball; b) Shape: Vase | Texture: Gong; c) Shape: Airplane | Texture: Otter; d) Shape: Obelisk | Texture: Lobster; e) Shape: Cannon | Texture: Pineapple; f) Shape: Ram | Texture: Bison; g) Shape: Camel | Texture: Zebra; h) Shape: Orca | Texture: Kimono; i) Shape: Otter | Texture: Speedometer; j) Shape: Elephant | Texture: Sock. The full image set is displayed in Figs 3–6.

https://doi.org/10.1371/journal.pcbi.1006613.g002

Network. Tests were conducted on a pre-trained VGG-19 network.

Experiment 1 results.

For each of the displayed images, the network assigned a probability value to each of the 1000 object categories it had been trained to classify. The objects that received the five highest probability assignments for each image are shown, broken into four parts for size considerations, in Figs 3–6, along with the probability assigned to the correct shape and texture label. Based on shape, the network chose as its highest probability classification the correct answer for 5 of the 40 objects. Based on texture, the network chose as its highest probability classification the correct answer for 4 of the 40 objects. In terms of including the correct answer in its top 5 possibilities, the network classified 8 of 40 objects within its top 5 choices by shape and 7 of 40 objects within its top 5 choices by texture. Overall, the assigned probability was lower than is typical for natural images for both the correct texture-object label and the correct shape-object label. For photographs of objects that include texture, shape and context, 90% or more of the total probability across 1000 object categories will ordinarily be assigned to the correct object label. In this simulation, there were a few shape-based classifications that were near natural image performance, such as the abacus and the trombone, but on average shape-based classifications were nearer to 10%. Likewise, for textures, a few of objects were assigned probabilities that were 20% or greater but average performance was quite poor. Human observers, for whom shape is predominant in object recognition, readily produce correct shape labels for all of these objects, as confirmed by pilot studies. By contrast, across the whole display set used here, the object whose shape was depicted in the displays was selected by the network, on average, as its 209^th ranked choice.

Download:

Fig 3. Network classifications for the stimuli presented in Experiment 1 Part 1.

The left most column shows the image presented. The second column in each row names the object from which the shape was sampled. The third column names the object from which the texture silhouette was obtained. Probabilities assigned to the object name in columns 2 and 3 are shown as percents below the object label. The remaining five columns show the probabilities (as percents) produced by the network for its top five classifications, ordered left to right in terms of probability. Correct shape classifications in the top five are shaded in blue and correct texture classifications are shaded in orange.

https://doi.org/10.1371/journal.pcbi.1006613.g003

Download:

Fig 4. Network classifications for the stimuli presented in Experiment 1 Part 2.

https://doi.org/10.1371/journal.pcbi.1006613.g004

Download:

Fig 5. Network classifications for the stimuli presented in Experiment 1 Part 3.

https://doi.org/10.1371/journal.pcbi.1006613.g005

Download:

Fig 6. Network classifications for the stimuli presented in Experiment 1 Part 4.

https://doi.org/10.1371/journal.pcbi.1006613.g006

Although there were indications of some use of shape information by the network for only about 20% of the displays tested in Experiment 1, an interesting pattern can be seen in the data. Use of shape information appeared to play some role in DCNN classification of artifacts but almost none for animals. The network had the object shape in its top-five classification selections for seven of the 20 artifacts, and only one of the 20 animals. The average probability assigned to the image’s shape label was 10 times higher for artifacts than for animals (17.90% vs. 1.75%). Texture appears to be about equally considered for both kinds of stimuli. There are three classifications in the top 5 choices for the texture-object in the 20 artifact images, and four in the 20 animal images; the mean probability assigned to the texture object is about equal for both kinds of images (3.22% vs. 3.73%).

The data from Experiment 1 were also analyzed by directly comparing the probability value associated with the object whose shape was described by the silhouette and the object whose texture was overlaid atop the shape. Overall, texture was preferred more often than shape (23 vs. 17). However, there was a large difference between network behavior in manmade objects versus animals. The network assigned higher probability to the shape-label in 14 of the 20 manmade objects, but only three of the 20 animal images (see Figs 7 and 8).

Download:

Fig 7. Comparison of probabilities assigned to image shapes and textures for animals.

On the x-axis, the shape and texture of each object are given as shape-texture. Filled black bars display the probability given by the network to the correct shape. Outlined bars display the probability given by the network for the correct texture.

https://doi.org/10.1371/journal.pcbi.1006613.g007

Download:

Fig 8. Comparison of probabilities assigned to image shapes and textures for artifacts.

On the x-axis, the shape and texture of each object are given as shape-texture. Filled black bars display the probability given by the network to the correct shape. Outlined bars display the probability given by the network for the correct texture.

https://doi.org/10.1371/journal.pcbi.1006613.g008

Finally, we measured the contribution of shape and texture to network classification by looking at the rank order of the correct shape-object and the correct texture-object for each display used. For the correct shape response, the mean rank among network outputs was 86.70 for artifacts, and 330.50 for animals. For the correct texture response, the mean rank was 249.95 for artifacts and 65.30 for animals.

Experiment 1 Discussion.

Experiment 1 suggested a major difference between human observers and DCNNs. Whereas human observers readily classify objects by shape, even in the face of uncharacteristic texture or context information, VGG-19 showed no evidence that shape information plays a primary role in DCNN classification. The correct label based on shape was chosen as the first choice classification by the network for only 5 of the 40 objects sampled, and the correct shape turned up on average as the 209^th ranked choice among network outputs in object classification.

Despite the lack of a clear, general accessibility of object shape in DCNN classifications, there was some evidence suggesting use of shape information in some cases. These cases were almost entirely confined to the 20 artifacts tested, in which 5 of 20 objects selected as first-choice classifications matched on shape, with 7 out of 20 objects placing the correct shape in the top 5 choices. In contrast, no first-choice classifications were correct for animal shapes, and only one of 20 animal displays showed any shape match among the top five classifications.

Although the network appears to utilize shape for classifying artifacts but not animals, there are still inconsistent examples in the test. Some objects, even from the artifact stimuli, do not show any evidence that shape is involved in network classification. For example, the airplane with otter texture was assigned essentially zero (.000002) probability to the airplane label (where .001 would be the value obtained by randomly guessing), instead classifying the image as “hatchet”, “nail”, or “hook” as its top three choices, none of which shares any shape similarity with an airplane. While this is a particularly glaring failure on the network’s part, over half of the artifacts were misclassified in all of the network’s top-five selections and the average rank order of the shape label of artifacts was 86.70. Many implausible shape misclassifications are assigned higher probability than the correct shape label. We mention this not to diminish the network’s success for some shapes on a task for which it was not explicitly trained, but to indicate that the results of Exp. 1 pose some differences in misclassification errors between the network and the human visual system. Humans have considerably less difficulty recognizing any of the objects by shape and would never consider some of the objects to which the network assigns high probability (e.g., “parachute” for hammer, or “electric guitar” for apron) as likely candidates.

Understanding why the network makes these kinds of misclassifications could be an important step to reveal the differences between network and human classification capabilities. It is possible that these erroneous responses are simply some intermediate landing point between shape and texture evidence, but classifications like “hatchet” for the airplane-otter suggest little consideration for the object’s texture in the network’s perceptual decision. It is also important to note that we tested objects separated from backgrounds and contexts. Typical tests of DCNNs include contextual information, which has been shown to be so important that networks perform reliably better than chance in classifying objects with images in which the object to be classified has been removed [21].

What could give rise to the differences we observed between artifacts and animals in networks that are not explicitly trained to recognize differences between superordinate categories like animacy? One possibility is that the network might learn to down-weight the contribution of texture cues in recognition of many artifact categories during training, since there is a more diverse range of textures and color associated with artifacts such as guitar and hammer. A sofa, for example, can be upholstered with any number of patterns, while a leopard’s fur will tend to be more consistent across exemplars. Another possibility is that DCNNs attribute less importance to shape cues in natural objects due to the large variability in the bounding contours of some natural objects. Animals are non-rigid, and their bounding contours vary considerably from image to image depending on pose, so it might be maladaptive to learn shape features for natural objects during training. Yet another possibility is that DCNNs really do not encode global shape but can nevertheless make use of some local shape features, which tend to be highly diagnostic for some artifacts, but provide little discriminability between different kinds of animals. Later experiments shed light on these possibilities. After considering additional results, we return to these issues in Experiment 4 and the General Discussion.

Experiment 2

Experiment 1 showed that in displays that preserved overall object shape but altered their texture, shape was a poor predictor of network classifications. There was some indication, however, of sensitivity to shape information in certain cases. The network made several accurate classifications of objects with non-canonical surface texture. This success appeared to be largely confined to artifacts, although even artifact classifications included many implausible top selections. On the other hand, shape information appeared to be largely irrelevant for classification responses generated for animal displays. In Experiments 2–4, we developed more detailed tests to examine whether networks could classify objects only based on shape with changed or absent surface texture and context information.

It is a remarkable fact, one attesting to the primacy of shape processing in human perception, that human observers readily recognize shapes in arbitrary materials (and construct and display them, etc.). In Experiment 2, we presented two deep networks with glass figurines. All figurines were pictures of real glass objects. Since glass figurines lack the natural surface colors and textures of the objects represented, we expected that accurate classification would be difficult without a representation of the object’s bounding shape. We expected that if the networks did have access to object shape, they might be able to accurately classify the glass figurines even in the absence of other, usually accompanying, cues for recognition. In other words, DCNN classifications that would resemble even a child’s intuitive response the first time they see a glass elephant would furnish evidence that shape information plays a role in DCNN classification.

Experiment 2 Method.

Images. Twenty images of glass figurines were found from the internet (see S2 File for URLs). Half of the figurines were of animals and half were of manmade objects. The images had some texture information, but the overall texture of each object was very different from a canonical instance of the represented object. The background in all but one of the 20 images was non-descriptive; either a homogeneous field or a color gradient. The one exception is the schooner figurine (see below), which is photographed on a table. See Fig 9 for examples.

Download:

Fig 9. Sample stimuli used in Experiment 2.

https://doi.org/10.1371/journal.pcbi.1006613.g009

Network. classification was tested on two networks: AlexNet, with seven layers, and VGG-19.

Experiment 2 results.

We assessed the networks as correct if they generated the names that a human observer would give to each image. (Human classification was verified in pilot work with human observers.) Neither network produced as its top choice the correct label for any of the 20 objects. Figs 10–11 show the top five classification choices for the 20 images shown for VGG-19 (the better-performing network; see below). Percentages in parentheses represent the probability assigned to each label by the network. In the absence of any evidence, the baseline probability of an object was 0.1%, as there were 1000 object categories.

Download:

Fig 10. VGG-19 classifications for glass figurines Part 1.

The leftmost column shows the image presented to the VGG-19 DCNN. The second column shows the correct object label and the probability generated by the network for that label. The other five columns show probabilities for the network’s top five classifications, ordered left to right from highest to lowest. Correct classifications are shaded in blue.

https://doi.org/10.1371/journal.pcbi.1006613.g010

Download:

Fig 11. VGG-19 classifications for glass figurines Part 2.

https://doi.org/10.1371/journal.pcbi.1006613.g011

Most of the top-choice responses seem bizarre for human perceivers, such as "web site" for goose, "oxygen mask" for otter, "can opener" for polar bear, and "chain" for fox. Although the stimuli here are (intentionally) different from what the networks were trained on (because they are glass figurines), the results clearly indicate that shape, if accessible at all by DCNNs, does not play the defining role in object recognition that it does for human perceivers. Comparing the networks' responses to chance level performance (0.1%), AlexNet assigned a probability below chance to the correct shape for 18 of the 20 test objects, and VGG-19 assigned a probability below chance to the correct shape for 15 of the 20 objects. Analysis of the rank order of correct labels revealed that the correct shape choice averaged, across the display set, a mean rank of 162.60.

The criterion of finding the correct answer among the top five most probable responses is often used to assess performance of DCNNs. Using this criterion, VGG-19 correctly classified two of the 20 images (the shark figurine and the grand piano figurine), while AlexNet misclassified all 20 images. VGG-19 assigned a 20.58% probability to the correct grand piano response, second only to its 53.07% probability assigned to "radio telescope." For the shark, VGG-19 selected the correct label as its 4th choice, assigning a 2.58% probability. For many of the remaining images, the objects were misidentified as glass-made or metal-made kitchen objects, such as “water jug”, or “can-opener”.

As another way of measuring the network’s sensitivity to object shape, we compared the probability the network gave to the object in the image with the probability it gave to the nine other objects that were used in the experiment. For example, we compared the probability that the network gave to “goose” when it was shown a goose figurine to the average of the probabilities the network gave to the other nine animal labels (“otter”, “peacock”, and so on) for the same figurine. For this analysis, we kept glass animals and glass objects separate to ensure that higher probabilities could not be accounted for by low-level contour features like the presence or absence of a straight edge. For both animate and inanimate objects, probabilities were not higher for the correct shape-label than for the average of the other nine incorrect shape labels more often than would be expected by chance. Five of the 10 inanimate objects, and seven of the 10 animate objects were given a lower probability than the average of the other nine in their class.

Experiment 2 Discussion.

The networks showed little capability of classifying glass figurines. Although glass figurines of animals remove the natural surface texture, they also introduce surface properties of their own. As in Experiment 1, surface features appear to play a major role in the network’s classification decisions. For example, the peacock figurine has “water jug” and “pitcher” in its top-five objects, despite having no shape similarity to either. “Goblet”, “vase”, “cup”, and “hourglass” are also common misclassifications made by both AlexNet and VGG-19. By contrast, the networks appear to make few misclassifications based on similarity between the shape of two objects. Aside from the piano and great white shark, only the tiger had a misclassification (“Egyptian cat”) that would be consistent with use of some sort of information about object shape. It appears that shape does not play an independent, predominant role in recognition, as it does in humans [28–30].

Do the network’s few successes point to a broader trend that the network is utilizing both shape and texture in its classification decisions? It is difficult to determine what is different about these three images than the other 20 images that the network fails to classify. One thing to keep in mind is that there is a chance that a plausible shape classification appears in the network’s top-five selections without the network having any sensitivity to the shape of the image. The tiger seems like a likely candidate for this possibility. While Egyptian cats and tigers have some shape features in common, Egyptian cats and elephants have very few, but the network names “Egyptian cat” as its top selection for both the elephant and the tiger glass figurines. In fact, it assigns higher probability that the elephant is an Egyptian cat than that the tiger is an Egyptian cat. It is possible that some surface feature, or conjunction of surface features and local edge properties, is driving classification in both cases.

This explanation is less satisfactory for the grand piano and great white shark, whose shape label is actually in the top-five selections and does not appear in any of the other images’ top selections. In the case of the piano, one possibility we considered is that the texture of the keys drove classification, but a further test showed that the network performs well even after the keys have been occluded or blurred out. We tested the network on five additional glass grand piano images (see Fig 12), and it was unable to correctly classify any of them in its top-five selections. It is unclear what information is present in the image where the network does well that is absent in the other five, but it is likely a local shape feature, as global shape is very similar across the six images. Likewise, it is likely that local contour features, not global shape, are driving the network’s accurate classification of the great white shark. We discuss this hypothesis in greater detail and revisit these positive examples in Experiment 4, after we have considered more data regarding the network’s sensitivity to local and global shape information.

Download:

Fig 12. Five additional glass Pianos.

VGG-19 incorrectly classified each of these five images despite correctly classifying the glass piano shown in Fig 11.

https://doi.org/10.1371/journal.pcbi.1006613.g012

The results of Experiment 2 clearly showed an absence of shape sensitivity for glass figurines. Human classification of such objects is affirmed by the fact that we make, display, and recognize such objects routinely. Although in natural scenes, an elephant is never made of glass, is never 4" high, and only rarely appears on anyone's desk or coffee table, human use of object shape makes recognition of a glass elephant on a desk effortless and routine. Not only is this predominance of shape not seen in DCNN performance, there is little to suggest that shape, independent of other information, is accessible at all in classification of these objects.

In this experiment, texture or surface quality information provided a stronger influence than object shape on object recognition by both AlexNet and VGG-19. We might suspect that the strength of surface texture cues pulled the networks’ top-five classifications towards texture-object labels. If object shape played any role at all, however, we would expect that the correct object label would be assigned a probability that is at least greater than chance. In most cases, the correct shape label was assigned a value less than chance, and objects of similar composition but different shapes tended to receive probabilities as high as the correct shape.

Experiment 3

A remarkable fact about human object perception is that we readily extract shape from outline drawings. This ability clearly depends on shape, as outlines omit surface information completely. Object outlines have the same texture within the bounding contour as outside it, and there is no variation in texture between the outlines of two different objects. We tested outlines in Experiment 3 to extend the earlier results and specifically to remove competing texture or surface information as much as possible. If, for example, the pictures of glass figurines somehow distracted deep networks from utilizing some encoded shape information due to competing surface texture, we expected that the problem would be substantially mitigated by using outlines. On the other hand, if deep networks cannot access shape from outline drawings of objects, we expected poor performance.