《认知心理学》课程教学资源（书籍文献）中国心理学家揭示面孔身份识别和表情识别的关联机制 Emerged human-like facial expression representation in a deep convolutional neural network.pdf

Zhou et al., Sci. Adv. 8, eabj4383 (2022) 23 March 2022 SCIENCE ADVANCES | RESEARCH ARTICLE 1 of 11 COGNITIVE NEUROSCIENCE Emerged human-like facial expression representation in a deep convolutional neural network Liqin Zhou1 , Anmin Yang1 , Ming Meng2,3 *, Ke Zhou1 * Recent studies found that the deep convolutional neural networks (DCNNs) trained to recognize facial identities spontaneously learned features that support facial expression recognition, and vice versa. Here, we showed that the self-emerged expression-selective units in a VGG-Face trained for facial identification were tuned to distinct basic expressions and, importantly, exhibited hallmarks of human expression recognition (i.e., facial expression confusion and categorical perception). We then investigated whether the emergence of expression-selective units is attributed to either face-specific experience or domain-general processing by conducting the same analysis on a VGG-16 trained for object classification and an untrained VGG-Face without any visual experience, both having the identical architecture with the pretrained VGG-Face. Although similar expression-selective units were found in both DCNNs, they did not exhibit reliable human-like characteristics of facial expression perception. Together, these findings revealed the necessity of domain-specific visual experience of face identity for the development of facial expression perception, highlighting the contribution of nurture to form human-like facial expression perception. INTRODUCTION Facial identity and expression play important roles in daily life and social communication. When interacting with others, we can easily recognize who they are through their facial identity information and access their emotions from their facial expressions. An influential early model proposed that face identity and expression were processed separately via parallel pathways (1, 2). Configural information for encoding face identity and expression differed (3). Findings from several neuropsychological studies supported this view. Patients with impaired facial expression recognition still retained the ability to recognize famous faces (4, 5), whereas patients with prosopagnosia (an inability to recognize the identity of others from their faces) could still recognize facial expressions (4–6). Haxby et al. (7) further proposed a distributed neural system for face perception, which emphasized a distinction between the representation of invariant aspects (e.g., identity) and changeable aspects (e.g., expression) of faces. According to this model, in the core system, lateral inferior occipitotemporal cortex [i.e., fusiform face area (FFA) and occipital face area (OFA)] and superior temporal sulcus (STS) may contribute to the recognition of facial identity and expression, respectively (8, 9). Patients with OFA/FFA damage have deficits in face identity recognition, and those with damage to the posterior STS (pSTS) suffer impairments in expression recognition (10). On the other hand, processing mechanisms of the human visual system for facial identity and expression recognition normally share face stimuli as inputs. That is, naturally, a face contains both identity and expression information. Early visual processing of the same face stimuli would be the same for both identity and expression recognition, but it is unclear at what stage they may start to split. Amid increasing evidence to suggest an interdependence or interaction between face identity and expression processing (11–15), we hypothesize that any computational model that simulates human performance for facial identity and expression recognition must share common inputs for training. Moreover, if domain-specific face input is necessary to train a computational model that simulates human performance for facial identity and expression recognition, it would suggest that the split of identity and expression processing might occur after the domain-general visual processing stages. However, if no training or no domain-specific training of face inputs were needed for a computational model that simulates human performance, it would suggest a dissociation between identity and expression processing at domain-general stages of visual processing. Specifically, deep convolutional neural networks (DCNNs) have achieved human-level performance in object recognition of natural images. Investigations combining DCNNs with cognitive neuroscience further discovered similar functional properties between artificial and biological systems. For instance, there is a trend of high similarity between the hierarchy of DCNNs and primate ventral visual pathways (16, 17). Research relevant to this study revealed a similarity of activation patterns between face identity–pretrained DCNNs and human FFA/OFA (18). Thus, DCNNs could be a useful model simulating the processes of biological neural systems. More recently, several seminal studies have found that the DCNNs trained to recognize facial expression spontaneously developed facial identity recognition ability, and vice versa, suggesting that integrated representations of identity and expression may arise naturally within neural networks like humans do (19, 20). However, a recent study found that face identity–selective units could spontaneously emerge in an untrained DCNN (21), which seemed to cast substantial doubt on the role of nurture in developing face perception and the abovementioned speculation. When adopting a computational approach to examine the human cognitive function, a success in classifying different expressions only suggests the weak equivalence between DCNNs and humans at the input-output behavior in Marr’s threelevel framework, which does not necessarily mean that DCNNs and 1 Beijing Key Laboratory of Applied Experimental Psychology, Faculty of Psychology, Beijing Normal University, Beijing 100875, China. 2 Philosophy and Social Science Laboratory of Reading and Development in Children and Adolescents (South China Normal University), Ministry of Education, Guangzhou 510631, China. 3 Guangdong Key Laboratory of Mental Health and Cognitive Science, School of Psychology, South China Normal University, Guangzhou 510631, China. *Corresponding author. Email: mingmeng@m.scnu.edu.cn (M.M.); kzhou@bnu. edu.cn (K.Z.) Copyright © 2022 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S.Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC). Downloaded from https://www.science.org at Southern Medical University on April 22, 2023

Zhou et al., Sci. Adv. 8, eabj4383 (2022) 23 March 2022 SCIENCE ADVANCES | RESEARCH ARTICLE 2 of 11 humans adopt similar representational mechanisms (i.e., algorithms) to achieve the same computational goal (22). Therefore, to explore whether a common mechanism may be shared by both artificial and biological intelligent systems, a much stronger equivalence should be tested by establishing additional relationships between models and humans, i.e., similarity in algorithms between them (23). Therefore, in the present study, we borrowed the cognitive approaches developed in human research to explore whether the human-like facial expression recognition relied on face identity recognition by using the VGG-Face, a typical DCNN pretrained for the face identity recognition task (hereafter referred to as pretrained VGG-Face). The pretrained VGG-Face was chosen because of its relatively simple architecture and evidence supporting its similar representations of face identity to those in the human ventral pathway (18). The training process of VGG-Face has already determined units’ selectivity for various features to optimize the network’s face identity recognition performance. If the pretrained VGG-Face could simulate the interdependence between facial identity and expression in the human brain, then it should spontaneously generate expression-selective units. The selective units should also be able to predict the expressions of new face images. However, as mentioned above, having an ability to correctly classify different expressions does not necessarily mean a human-like perception of expressions. Here, we introduced morphed expression continua to test whether these units perceived morphed expression categorically in a humanlike way. Then, to answer the question of what the human-like expression perception depends on, we introduced two additional DCNNs. The first one is the VGG-16, a DCNN that has an almost identical architecture with the pretrained VGG-Face but was trained only for natural object classification. The other one is an untrained VGG-Face, which has an identical architecture to the pretrained VGG-Face, but its weights are randomly assigned with no training (hereafter referred to as untrained VGG-Face). Comparisons among the three DCNNs would clarify whether the human-like expression perception relies on face (identity) recognition–specific experience, or general object recognition experience, or merely the architecture of the network. RESULTS Expression-selective units spontaneously emerge in the pretrained VGG-Face We first explored whether expression-selective units could spontaneously emerge in the pretrained VGG-Face. The pretrained VGG-Face was trained with more than 2 million face images to recognize 2622 identities (24). It consists of 13 convolutional (conv) layers and 3 fully connected (FC) layers (Fig. 1A). The first 13 convolutional layers form a feature extraction network that transforms images to a goaldirected high-level representation, and the following 3 FC layers form a classification network to classify images by converting the high-level representation into classification probabilities (25). Since the final layer (conv5-3) of the feature extraction network represents the highest level representation (26, 27) and has the largest receptive field among all convolutional layers, we tested the expression selectivity of each unit in this layer using stimulus set 1 to explore whether a DCNN could spontaneously generate facial expression– selective “neurons” (see Materials and Methods for details). Stimulus set 1 consisted of 104 different facial identities selected from the Karolinska Directed Emotional Faces (KDEF) (28) and NimStim (29) databases, and each identity has six basic expressions (i.e., anger, disgust, fear, happiness, sadness, and surprise) (30, 31). All 624 images in stimulus set 1 were presented to the pretrained VGG-Face, and their activations in the conv5-3 layer were extracted. First, we conducted a two-way nonrepeated analysis of variance (ANOVA) with identity and expression as factors to detect units selective to facial expression (P ≤ 0.01) but not to face identity (P > 0.01). The units meeting the criteria were defined as the expression-selective units. Of the total 100,352 units, 1259 units (1.25%) in the conv5-3 layer were found to be expression selective. Then, for each expressionselective unit, its tuning value (32) for each expression category was calculated to measure whether and to what extent it preferred a specific expression. As shown in Fig. 1B, almost all units responded selectively to only one specific expression and exhibited a tuning effect. Last, to test whether the responses of these expression-selective units provide sufficient information for successful expression recognition, we performed principal components analysis (PCA) on the activations of these units to all images in stimulus set 1 and selected the first 600 principal components (PCs) to perform an expression classification task using a support vector classification (SVC) analysis with 104-fold cross-validation. The 600 PCs could explain nearly 100% variance of the expression-selective features (fig. S1). We found that the classification accuracy (mean ± SE, 76.76 ± 1.59%) of the expression-selective units was much higher than the chance level (16.67%) and much higher than the classification accuracy of images with randomly shuffled expression labels (P = 1.8 × 10−35, Mann-Whitney U test) (Fig. 1C). The results indicated that the expression-selective units spontaneously emerged in the VGG-Face pretrained for face identity recognition, which echoed previous findings (19, 20). Human-like expression confusion effect ofthe expression-selective units in the pretrained VGG-face To examine the reliability of the expression-selective units, we used the classification model trained by using stimulus set 1 to predict the expressions of images selected from the Radboud Faces Database (RaFD) (33). The RaFD is an independent facial expression database including 67 face identities with different head and gaze directions. Only the front-view expressions of each identity were used in the present study (i.e., stimulus set 2). The prediction accuracy of the expressions from stimulus set 2 was significantly higher than the chance level [accuracy = 67.91%; 95% confidence interval (CI), 63.18 to 72.39%, bootstrapped with 10,000 iterations] (Fig. 2A). We also changed the number of PCs from 50 to 600 to explore whether the number of PCs influenced the prediction performance. As shown in fig. S2A, the prediction accuracy remained relatively stable as the number of PCs changed. It thus indicated that the expression-selective units in the pretrained VGG-Face had a reliable expression discriminability. Subsequently, to test whether the expression representation of these units was similar to humans, we presented the same face images of stimulus set 2 to both human participants (experiment 1, see Materials and Methods for details) and the pretrained VGG-Face and calculated the confusion matrices of facial expression recognition, respectively (Fig. 2, B and C). Although the mean classification accuracy of the human participants (73.47%) was significantly higher than that of the pretrained VGG-Face, the error patterns of the two confusion matrices were highly correlated (Kendall’s  = 0.48, P = 5.4 × 10−4). For instance, in both confusion matrices, fear and surprise might be confused with each other, disgust was frequently mistaken for anger, and anger was often mistaken for sadness. Overall, Downloaded from https://www.science.org at Southern Medical University on April 22, 2023

Zhou et al., Sci. Adv. 8, eabj4383 (2022) 23 March 2022 SCIENCE ADVANCES | RESEARCH ARTICLE 3 of 11 the results suggested a similar expression confusion effect between the expression-selective units in the pretrained VGG-Face and humans. Ecological validity of expression selectivity emerged in the pretrained VGG-Face The facial expressions in stimulus set 1 and stimulus set 2 were collected from the same identities in the laboratory-controlled environment and thus had limited ecological validity. If the expression-selective units can recognize expressions, they should also be able to recognize the real-life facial expressions with ecological validity. To verify this, we generated stimulus set 3 by selecting 4800 images with manually annotated expressions from the AffectNet database—a large real-world facial expression database (34). Each basic expression included 800 images. Note that, in stimulus set 3, the face identities across expressions are different. By using the same SVC model trained Fig. 1. Expression-selective units emerged in the pretrained VGG-Face. (A) The architecture of the VGG-Face. An example face image (for demonstration purposes only) is shown. Photo credit: Liqin Zhou, Beijing Normal University. ReLU, rectification linear unit. (B) The tuning value map of the expression-selective units in the pretrained VGG-Face. (C) The expression classification performance of the expression-selective units. The black dashed line represents the chance level. Error bars indicate SE. ***P ≤ 0.001. Fig. 2. Human-like expression confusion effect of the expression-selective units in the pretrained VGG-Face for stimulus set 2. (A) The expression discriminability of the expression-selective units emerged in the pretrained VGG-Face. The black dashed line represents the chance level. (B) The confusion matrix of the expressionselective units in the pretrained VGG-Face for stimulus set 2. (C) Human confusion matrix for stimulus set 2. Downloaded from https://www.science.org at Southern Medical University on April 22, 2023

Zhou et al., Sci. Adv. 8, eabj4383 (2022) 23 March 2022 SCIENCE ADVANCES | RESEARCH ARTICLE 4 of 11 with stimulus set 1, we found that the prediction accuracy of the expressions from stimulus set 3 was also significantly higher than the chance level (accuracy = 29.56%; 95% CI, 28.31 to 30.85%, bootstrapped with 10,000 iterations) (Fig. 3A). Similarly, we also obtained the confusion matrices for both human participants (experiment 2, see Materials and Methods for details) and the pretrained VGG-Face (Fig. 3, B and C). Again, the error patterns of the two confusion matrices were highly correlated (Kendall’s  = 0.27, P = 0.037), although the mean classification accuracy of the human participants (46.76%) was higher than that of the pretrained VGG-Face. The reliable human-like confusion effect of facial expression recognition suggested that the expression-selective units in the pretrained VGG-Face can recognize facial expressions in a way humans do, even for real-life face images. Fig. 3. Expression recognition of the expression-selective units in the pretrained VGG-Face, VGG-16, and untrained VGG-Face for stimulus set 3. (A) The expression discriminability of the expression-selective units in each DCNN. Expression classification of the expression-selective units in the pretrained VGG-Face is much better than in the VGG-16 and untrained VGG-Face. The black dashed line represents the chance level. (B) Human confusion matrix for stimulus set 3. (C to E) The confusion matrix of the expression-selective units in the pretrained VGG-Face (C), VGG-16 (D), and untrained VGG-Face (E) for stimulus set 3. (F) The goodness of fit (R2 ) of each fit type for each DCNN. Logistic regression fits better for the pretrained VGG-Face than for the other two DCNNs, whereas linear regression fits the worst for the pretrained VGGFace. Error bars indicate SE. **P ≤ 0.01. (G) The identification rates for the seven continua in the VGG-16 and the untrained VGG-Face, respectively. Black dots represent true identification rates. Blue solid lines indicate fitting for the logistic function. HA, happiness; AN, anger; FE, fear; DI, disgust and SA, sadness. Downloaded from https://www.science.org at Southern Medical University on April 22, 2023

Zhou et al., Sci. Adv. 8, eabj4383 (2022) 23 March 2022 SCIENCE ADVANCES | RESEARCH ARTICLE 5 of 11 The expression-selective units in the pretrained VGG-Face showed human-like categorical perception for morphed facial expressions One may argue that the similarity in the expression confusion effect does not necessarily mean that expression-selective units perceive expressions in a human-like way. It might result from the similarities in physical properties of the expression images since the imagebased PCA (i.e., PCs based on pixel intensities and shapes) could also yield a confusion matrix similar to that of humans (35). Therefore, to further confirm whether these units could exhibit a human-like psychophysical response to facial expressions, we tested whether their responses showed a categorical perception of facial expressions by using morphed expression continua. Considering the generality of the categorical emotion perception in humans, we systematically tested the categorical effect in seven expression continua including happiness-anger, happiness-fear, anger-disgust, happiness-sadness, anger-fear, disgust-fear, and disgust-sadness. All of them have been tested in humans (36–40). In detail, we designed a morphed expression discrimination task (Fig. 4A) that resembled the ABX discrimination task designed for humans (36, 39, 40). The prototypic expressions were selected from stimulus set 1. For each expression continuum, images of the two prototypic expressions were used to train an SVC model, and then the trained SVC model was applied to identify expressions of the morphed images. At each morph level of the continuum, the identification frequency of one of the two expressions was defined as the units’ identification rate at the current morph level. We hypothesized that if the selective units perceived expressions like humans, i.e., showing categorical effect, then the identification curves should be S-shaped. As predicted, for all continua, the identification curves of the expression-selective units in the pretrained VGG-Face were S-shaped (Fig. 4B). To quantify this effect, we fitted linear, quadratic (Poly2), and logistic functions to each identification curve, respectively. If the units exhibited a human-like categorical effect, the goodness of fit (R2 ) of the logistic function to the curves should be the best. Otherwise, the goodness of fit of the linear function to the curves should be the best if the units’ response followed the physical changes in images. As illustrated in Fig. 4 (C and D), we found that all seven identification curves showed typical S-like patterns (logistic versus linear: P = 0.002 and logistic versus Poly2: P = 0.002, Mann-Whitney U test). The human-like expression perception only spontaneously emerged in theDCNN with domain-specific experience (pretrained VGG-Face), but not in those with domain-general visual experience (VGG-16) or without any visual experience (untrained VGG-Face) So far, we had demonstrated that the human-like perception of expression could spontaneously emerge in the DCNN pretrained for face identity recognition. However, how did these expression-selective units achieve human-like expression perception? Specifically, it was still unknown whether the spontaneous emergence of the human-like Fig. 4. Categorical perception of facial expressions of the expression-selective units in the pretrained VGG-Face. (A) Example facial stimuli used in a morph continuum (happiness-anger). An example face image (for demonstration purposes only) is shown. Photo credit: Liqin Zhou, Beijing Normal University. (B) The identification rates for the seven continua. The identification rates refer to the identification frequency of one of the two expressions. Labels along the x axis indicate the percentage of this expression in facial stimuli. Black dots represent true identification rates. Blue solid lines indicate fitting for the logistic function. (C) Goodness of fit (R2 ) of each regression type for each expression continuum. The black dashed lines represent R2 at 0.95 and 1.00, separately. (D) Mean goodness of fit (R2 ) among expression continua. The R2 in the logistic regression was much higher than the other two regressions. Error bars indicate SE. *P ≤ 0.05 and **P ≤ 0.01. Downloaded from https://www.science.org at Southern Medical University on April 22, 2023