Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring Zhiwei Jiang Meng Liu' Yafeng Yin jzw@nju.edu.cn mf1933061@smail.nju.edu.cn yafeng@nju.edu.cn State Key Laboratory for Novel State Key Laboratory for Novel State Key Laboratory for Novel Software Technology,Nanjing Software Technology,Nanjing Software Technology,Nanjing University University University Nanjing,China Nanjing,China Nanjing,China Hua Yu Zifeng Cheng Qing Gu huayu.yh@smail.nju.edu.cn chengzf@smail.nju.edu.cn guq@nju.edu.cn State Key Laboratory for Novel State Key Laboratory for Novel State Key Laboratory for Novel Software Technology,Nanjing Software Technology,Nanjing Software Technology,Nanjing University University University Nanjing,China Nanjing,China Nanjing,China ABSTRACT KEYWORDS One-shot automated essay scoring (AES)aims to assign scores Essay Scoring,One-Shot,Graph Propagation,Ordinal Distillation to a set of essays written specific to a certain prompt,with only ACM Reference Format: one manually scored essay per distinct score.Compared to the Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu previous-studied prompt-specific AES which usually requires a 2021.Learning from Graph Propagation via Ordinal Distillation for One- large number of manually scored essays for model training (e.g.. Shot Automated Essay Scoring.In Proceedings of the Web Conference 2021 about 600 manually scored essays out of totally 1000 essays),one- (WWW '21),April 19-23,2021,Ljubljana,Slovenia.ACM,New York,NY. shot AES can greatly reduce the workload of manual scoring.In this USA,10 pages.https:/doi.org/10.1145/3442381.3450017 paper,we propose a Transductive Graph-based Ordinal Distillation (TGOD)framework to tackle the task of one-shot AES.Specifically, 1 INTRODUCTION we design a transductive graph-based model as a teacher model to generate pseudo labels of unlabeled essays based on the one-shot Automated Essay Scoring(AES)aims to summarize the quality of labeled essays.Then,we distill the knowledge in the teacher model a student essay with a score or grade based on the factors such into a neural student model by learning from the high confidence as grammaticality,organization,and coherence.It is commercially pseudo labels.Different from the general knowledge distillation, valuable to be able to automate the scoring of millions of essays.In we propose an ordinal-aware unimodal distillation which makes a fact,AES has been developed and deployed in large-scale standard- ized tests such as TOEFL,GMAT,and GRE [2].Besides evaluating unimodal distribution constraint on the output of student model, the quality of essays,as an evaluation technique of text quality,AES to tolerate the minor errors existed in pseudo labels.Experimental results on the public dataset ASAP show that TGOD can improve can also be used conveniently to evaluate the quality of various the performance of existing neural AES models under the one-shot Web texts(e.g.,news,responses,and posts). Research on automated essay scoring has spanned the last 50 AES setting and achieve an acceptable average OWK of 0.69. years [25],and still continues to draw a lot of attention in the natu- CCS CONCEPTS ral language processing community [17].Traditional AES methods mainly rely on various handcrafted-features and score essays based Computing methodologies->Natural language processing: on regression methods [2,19,26,32,48].Recently,with the de- Information systems-Clustering and classification. velopment of deep learning technology,many models based on LSTM and CNN have been proposed [7,8,10,39,41].These models "Both authors contributed equally to this research. can automatically learn the features of essays and achieve better Corresponding author. performance than traditional methods. However,to train an effective neural AES model,it often needs a large number of manually scored essays for model training(e.g., This paper is published under the Creative Commons Attribution 40 International about 600 manually scored essays out of totally 1000 essays in a (CC-BY 4.0)license.Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. test),which is labor intensive.This limits its application in some WWW '21,April 19-23,2021,Ljubljana,Slovenia real-world scenarios.To this end,some recent work considers using 2021 IW3C2 (International World Wide Web Conference Committee),published the scored essays under other prompts (ie.,topic of writing essay) under Creative Commons CC-BY 4.0 License. ACM ISBN978-1-4503-8312-7/21/04. to alleviate the burden of manual scoring under target prompt.But https:/doi.org/10.1145/3442381.3450017 due to the difference among prompts such as genre,score range, 2347
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring Zhiwei Jiang∗† jzw@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Meng Liu∗ mf1933061@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Yafeng Yin yafeng@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Hua Yu huayu.yh@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Zifeng Cheng chengzf@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Qing Gu guq@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China ABSTRACT One-shot automated essay scoring (AES) aims to assign scores to a set of essays written specific to a certain prompt, with only one manually scored essay per distinct score. Compared to the previous-studied prompt-specific AES which usually requires a large number of manually scored essays for model training (e.g., about 600 manually scored essays out of totally 1000 essays), oneshot AES can greatly reduce the workload of manual scoring. In this paper, we propose a Transductive Graph-based Ordinal Distillation (TGOD) framework to tackle the task of one-shot AES. Specifically, we design a transductive graph-based model as a teacher model to generate pseudo labels of unlabeled essays based on the one-shot labeled essays. Then, we distill the knowledge in the teacher model into a neural student model by learning from the high confidence pseudo labels. Different from the general knowledge distillation, we propose an ordinal-aware unimodal distillation which makes a unimodal distribution constraint on the output of student model, to tolerate the minor errors existed in pseudo labels. Experimental results on the public dataset ASAP show that TGOD can improve the performance of existing neural AES models under the one-shot AES setting and achieve an acceptable average QWK of 0.69. CCS CONCEPTS • Computing methodologies → Natural language processing; • Information systems → Clustering and classification. ∗Both authors contributed equally to this research. †Corresponding author. This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-8312-7/21/04. https://doi.org/10.1145/3442381.3450017 KEYWORDS Essay Scoring, One-Shot, Graph Propagation, Ordinal Distillation ACM Reference Format: Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu. 2021. Learning from Graph Propagation via Ordinal Distillation for OneShot Automated Essay Scoring. In Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3442381.3450017 1 INTRODUCTION Automated Essay Scoring (AES) aims to summarize the quality of a student essay with a score or grade based on the factors such as grammaticality, organization, and coherence. It is commercially valuable to be able to automate the scoring of millions of essays. In fact, AES has been developed and deployed in large-scale standardized tests such as TOEFL, GMAT, and GRE [2]. Besides evaluating the quality of essays, as an evaluation technique of text quality, AES can also be used conveniently to evaluate the quality of various Web texts (e.g., news, responses, and posts). Research on automated essay scoring has spanned the last 50 years [25], and still continues to draw a lot of attention in the natural language processing community [17]. Traditional AES methods mainly rely on various handcrafted-features and score essays based on regression methods [2, 19, 26, 32, 48]. Recently, with the development of deep learning technology, many models based on LSTM and CNN have been proposed [7, 8, 10, 39, 41]. These models can automatically learn the features of essays and achieve better performance than traditional methods. However, to train an effective neural AES model, it often needs a large number of manually scored essays for model training (e.g., about 600 manually scored essays out of totally 1000 essays in a test), which is labor intensive. This limits its application in some real-world scenarios. To this end, some recent work considers using the scored essays under other prompts (i.e., topic of writing essay) to alleviate the burden of manual scoring under target prompt. But due to the difference among prompts such as genre, score range, 2347
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu topic,and difficulty,these cross-prompt methods often perform of essays written to a certain prompt,y={1,2.....K)denote a set worse than the prompt-specific methods [9].Tackling the domain of pre-defined scores (labels)at ordinal scale,and (x,y)denote an adaptation among prompts is a challenging problem and there are essay and its ground-truth score(label)respectively.For one-shot some recent studies focusing on this line of work [5,14]. AES,we assume that we are given a set of one-shot labeled data In this paper,we consider another way without using data from D。={ci,班=i}where the set。={,yh)eDo}is other prompts.Given a set of essays towards a target prompt,we a subset of (i.e.,No E x),and the essay x e Xo with y i is consider if we can score all essays only based on a few manually the one-shot essay for the distinct score (label)i e y.Apart from scored essays.Extremely,we consider the one-shot scenario,that the one-shot labeled essays o,the rest essays in X constitute the is,only one manually scored essay per distinct score is given.In unlabeled essay set Xu ={xi,and thus Xu UXo =X.The goal practical writing tests,scoring staff usually evaluates the essays of one-shot AES is to learn a function F to predict the scores(labels) by first designing a criteria specific to the current test and then of the unlabelled essays x e Xu,based on the one-shot labeled data applying the criteria for essays scoring.To alleviate the burden Do and essay set by of scoring staff,we expect to firstly let the scoring staff express the criteria by one-shot manual scoring,and then use a specially- =F(x;Dox) (1) designed AES model to scoring the rest essays based on the one-shot data Typical AES approaches based on supervised learning would One-shot AES is a challenging task,since the one-shot labeled remove X and replace Do with a statistic "=0"(Do)in Eq.1. data is insufficient to train an effective neural AES model.To solve since they can usually learn a sufficient statistic 0*for prediction this problem,our intuition is whether we can augment the one- po(ylx)only based on labeled data Do.However,the one-shot shot labeled data with some pseudo labeled data,and then perform setting is never the case,since only few labeled data is given in Do, model training on the augmented labeled data.Obviously,there which is insufficient to train a statistic with good generalization. are two challenges:one is how to acquire the pseudo labeled data. We therefore exploit both the one-shot labeled data Do and the and the other is how to alleviate the disturbance brought by error unlabeled essays XuEX to learn the prediction function F,and pseudo labels during model training. thus adopt the more general form of F in Eq.1. To this end,we propose a Transductive Graph-based Ordinal Distillation (TGOD)framework for one-shot automated essay scor- ing,which is designed based on a teacher-student mechanism(i.e.. 3 THE TGOD FRAMEWORK knowledge distillation)[13].Specifically,we employ a transduc- In this section,we introduce the proposed TGOD framework,fol- tive graph-based model [52,53]as the teacher model to generate lowed by its technical details. pseudo labels,and then train the neural AES model(student model) by combining the pseudo labels and one-shot labels.Considering 3.1 An Overview of TGOD that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the TGOD is designed based on the teacher-student mechanism.It quality of pseudo labels.Besides,considering that the score is at can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring ordinal scale and an essay is easily to be assigned a score near its setting.While the one-shot labeled data is insufficient to train the ground-truth score (e.g.,3 is easily to be predicted as 2 or 4),we supervised neural student model,the student model can be trained proposed an ordinal-aware unimodal distillation strategy to tolerate by distilling the knowledge of the semi-supervised teacher model some pseudo labels with minor errors. on the unlabeled essays.Through a specially-designed ordinal dis The major contributions of this paper are summarized as follows: tillation strategy,the supervised neural student model can even For the one-shot automated essay scoring,we propose a outperform the semi-supervised teacher model. distillation framework based on graph propagation,which Specifically,as shown in Figure 1,TGOD contains three main alleviates the requirement of supervised neural AES model components:the Teacher Model which exploits the manifold struc- on labeled data by utilizing unsupervised data. ture among labeled and unlabeled essays based on graphs and We propose the label selection and the ordinal-aware uni- generates pseudo labels of unlabeled essays for distillation;the Stu- modal distillation strategies to alleviate the effect of error dent Model which tackles the essay scoring problem as an ordinal pseudo labels on the final AES model. classification problem and makes a unimodal distribution predic- The TGOD framework has no limitation on the architecture tion for essays;the Ordinal Distillation which distills the unimodal of student model,thus can be applied to many existing neu- smoothed Teacher Model's outputs into the Student Model.In the ral AES models.Experimental results on the public dataset following.we introduce these components of TGOD with technical demonstrate that our framework can effectively improve the details. performance of several classical neural AES models under the one-shot AES setting. 3.2 Graph-Based Label Propagation(Teacher) We introduce the Teacher Model illustrated in Figure 1,which is a 2 PROBLEM DEFINITION graph-based label propagation model and consists of three com- We first introduce some notation and formalize the one-shot auto- ponents:multiple graph construction that models the relationship mated essay scoring (AES)problem.Let=denote a set among essays from multiple aspects;label propagation that spreads 2348
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu topic, and difficulty, these cross-prompt methods often perform worse than the prompt-specific methods [9]. Tackling the domain adaptation among prompts is a challenging problem and there are some recent studies focusing on this line of work [5, 14]. In this paper, we consider another way without using data from other prompts. Given a set of essays towards a target prompt, we consider if we can score all essays only based on a few manually scored essays. Extremely, we consider the one-shot scenario, that is, only one manually scored essay per distinct score is given. In practical writing tests, scoring staff usually evaluates the essays by first designing a criteria specific to the current test and then applying the criteria for essays scoring. To alleviate the burden of scoring staff, we expect to firstly let the scoring staff express the criteria by one-shot manual scoring, and then use a speciallydesigned AES model to scoring the rest essays based on the one-shot data. One-shot AES is a challenging task, since the one-shot labeled data is insufficient to train an effective neural AES model. To solve this problem, our intuition is whether we can augment the oneshot labeled data with some pseudo labeled data, and then perform model training on the augmented labeled data. Obviously, there are two challenges: one is how to acquire the pseudo labeled data, and the other is how to alleviate the disturbance brought by error pseudo labels during model training. To this end, we propose a Transductive Graph-based Ordinal Distillation (TGOD) framework for one-shot automated essay scoring, which is designed based on a teacher-student mechanism (i.e., knowledge distillation) [13]. Specifically, we employ a transductive graph-based model [52, 53] as the teacher model to generate pseudo labels, and then train the neural AES model (student model) by combining the pseudo labels and one-shot labels. Considering that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the quality of pseudo labels. Besides, considering that the score is at ordinal scale and an essay is easily to be assigned a score near its ground-truth score (e.g., 3 is easily to be predicted as 2 or 4), we proposed an ordinal-aware unimodal distillation strategy to tolerate some pseudo labels with minor errors. The major contributions of this paper are summarized as follows: • For the one-shot automated essay scoring, we propose a distillation framework based on graph propagation, which alleviates the requirement of supervised neural AES model on labeled data by utilizing unsupervised data. • We propose the label selection and the ordinal-aware unimodal distillation strategies to alleviate the effect of error pseudo labels on the final AES model. • The TGOD framework has no limitation on the architecture of student model, thus can be applied to many existing neural AES models. Experimental results on the public dataset demonstrate that our framework can effectively improve the performance of several classical neural AES models under the one-shot AES setting. 2 PROBLEM DEFINITION We first introduce some notation and formalize the one-shot automated essay scoring (AES) problem. Let X = {xi } N i=1 denote a set of essays written to a certain prompt, Y = {1, 2, ...,K} denote a set of pre-defined scores (labels) at ordinal scale, and (x,y) denote an essay and its ground-truth score (label) respectively. For one-shot AES, we assume that we are given a set of one-shot labeled data Do = {(xi ,yi = i)}K i=1 , where the set Xo = {xi |(xi ,yi) ∈ Do } is a subset of X (i.e., Xo ∈ X), and the essay x ∈ Xo with y = i is the one-shot essay for the distinct score (label) i ∈ Y. Apart from the one-shot labeled essays Xo , the rest essays in X constitute the unlabeled essay set Xu = {xi } Nu i=1 , and thus Xu ∪ Xo = X. The goal of one-shot AES is to learn a function F to predict the scores (labels) of the unlabelled essays x ∈ Xu , based on the one-shot labeled data Do and essay set X, by yˆ = F (x; Do, X). (1) Typical AES approaches based on supervised learning would remove X and replace Do with a statistic θ ∗ = θ ∗ (Do ) in Eq. 1, since they can usually learn a sufficient statistic θ ∗ for prediction pθ ∗ (y|x) only based on labeled data Do . However, the one-shot setting is never the case, since only few labeled data is given in Do , which is insufficient to train a statistic θ ∗ with good generalization. We therefore exploit both the one-shot labeled data Do and the unlabeled essays Xu ∈ X to learn the prediction function F , and thus adopt the more general form of F in Eq. 1. 3 THE TGOD FRAMEWORK In this section, we introduce the proposed TGOD framework, followed by its technical details. 3.1 An Overview of TGOD TGOD is designed based on the teacher-student mechanism. It can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring setting. While the one-shot labeled data is insufficient to train the supervised neural student model, the student model can be trained by distilling the knowledge of the semi-supervised teacher model on the unlabeled essays. Through a specially-designed ordinal distillation strategy, the supervised neural student model can even outperform the semi-supervised teacher model. Specifically, as shown in Figure 1, TGOD contains three main components: the Teacher Model which exploits the manifold structure among labeled and unlabeled essays based on graphs and generates pseudo labels of unlabeled essays for distillation; the Student Model which tackles the essay scoring problem as an ordinal classification problem and makes a unimodal distribution prediction for essays; the Ordinal Distillation which distills the unimodal smoothed Teacher Model’s outputs into the Student Model. In the following, we introduce these components of TGOD with technical details. 3.2 Graph-Based Label Propagation (Teacher) We introduce the Teacher Model illustrated in Figure 1, which is a graph-based label propagation model and consists of three components: multiple graph construction that models the relationship among essays from multiple aspects; label propagation that spreads 2348
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia -+k-1h Student Model m-P Essay One-Shot Labeled Essays Ordinal Distillation ■△◆鱼 Leg CMB PMF Unimodal Ordinal Classifier edicted Label ooOO oOOO oOO 0000 D Unlabeled Multiple Graph Gi Label Guessing Graphs Essays Pseudo-Label Graph Gz Teacher Model Figure 1:Architecture of the Transductive Graph-Based Ordinal Distillation(TGOD)framework. labels from the one-shot essays to the unlabeled essays;label guess- one-shot essays No and labeled as yi j,otherwise Yii =0.Starting ing that generates the pseudo labels of unlabeled essays from results from Y,label propagation iteratively determines the unknown labels of multiple graph propagation. of essays in Au according to the graph structure using the following formulation: 3.2.1 Multiple Graphs Construction.To construct a graph on F+1=aSF+(1-a)Y, 4) the essay set A,we need first to extract the feature embedding of each essay xiE X.Specifically,we employ an embedding layer where Ft eFdenotes the predicted labels at the timestamp t,S followed by a mean pooling layer as the essay encoder fe()to denotes the normalized weight,and a e(0,1)controls the amount extract the feature embedding fe(xi)of essay xi. of propagated information.It is well known that the sequence(F) Based on the feature embedding of essays,we then construct has a closed-form solution as follows: a neighborhood graph G=(V,E,W)for the essay set X,where F=(I-aS)-1Y, (5) V=X denotes the node set,E denotes the edge set,and W denotes where I is the identity matrix [52] the adjacent matrix.To construct an appropriate graph,we employ the Gaussian kernel function [53]to calculate the adjacent matrix 3.23 Label Guessing.For each unlabeled essay in Xu,we pro- W: duce a"guess"for its label based on the predictions of label propa- gation on multiple graphs.This guess is later used as pseudo label Wij exp d (fe(xi).fe(xj)) 2o2 (2) of unlabeled essay for knowledge distillation where d(,)is a distance measure (e.g.,Euclidean distance)and o To do so,we first compute the average of the label distributions is a length scale parameter. predicted by label propagation on all the B graphs by To construct a k-nearest neighbor graph,we only keep the k- B Y'= 1 max values in each row of W,and then apply the normalized graph B」 (6) Laplacians [6]on W: S=D-WD-i. (3) where ydenotes the averaged label distribution matrixand denotes the final label distribution matrix generated by applying where D is a diagonal matrix with its (i,i)-value to be the sum of label propagation on graph G. the i-th row of W. Then,for each unlabeled essay xiXu,its pseudo labelyis While using different pre-trained word embeddings as the em- obtained as follows: bedding layer may result in different k-nearest neighbor graphs, () we can construct B graphs by using B types of pre-trained word yi=arg max Yij. 1≤j≤K embeddings (e.g,Word2Vec [20],GloVe [28],ELMo [31],BERT where Y denotes the j-th element of the i-th row vector of Y. [43]). 3.2.2 Label Propagation.We now describe how to get predic- 3.3 Ordinal-Aware Neural Network (Student) tions for the unlabeled essays set Xu using label propagation [23]. We introduce the Student Model illustrated in Figure 1,which is Let F denote the set of N x K sized matrix with nonnegative an ordinal-aware neural network model and consists of two main entries.We define a label matrix Ye F with Yij=1 if xi is from the components:essay encoder that extracts the feature embedding 2349
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Multiple Graphs Construction Essay Encoder Essay Embedding Predicted Label 𝒀𝑼 Ordinal Distillation Pseudo-Label Label Guessing 𝒀𝑼 ′ Unimodal Ordinal Classifier Graph 𝑮𝟏 Label Distribution 𝑭𝑮𝟏 ∗ Label Propagation Graph 𝑮𝟐 Label Distribution 𝑭𝑮𝟐 ∗ Copy Expansion Softmax Sigmoid 𝒑 𝒑𝒌=𝟏 𝒑𝒌=𝟐 𝒑𝒌=𝑲 𝝊 log 𝐾 − 1 𝑘 − 1 + 𝒌 − 𝟏 log 𝒑 + (𝑲 − 𝒌) log(𝟏 − 𝒑) 𝝊 Log CMB PMF … Label Propagation Teacher Model Student Model One-Shot Labeled Essays Unlabeled Essays Figure 1: Architecture of the Transductive Graph-Based Ordinal Distillation (TGOD) framework. labels from the one-shot essays to the unlabeled essays; label guessing that generates the pseudo labels of unlabeled essays from results of multiple graph propagation. 3.2.1 Multiple Graphs Construction. To construct a graph on the essay set X, we need first to extract the feature embedding of each essay xi ∈ X. Specifically, we employ an embedding layer followed by a mean pooling layer as the essay encoder fe (·) to extract the feature embedding fe (xi) of essay xi . Based on the feature embedding of essays, we then construct a neighborhood graph G = (V, E,W ) for the essay set X, where V = X denotes the node set, E denotes the edge set, and W denotes the adjacent matrix. To construct an appropriate graph, we employ the Gaussian kernel function [53] to calculate the adjacent matrix W : Wij = exp − d fe (xi), fe (xj) 2σ 2 , (2) where d(·, ·) is a distance measure (e.g., Euclidean distance) and σ is a length scale parameter. To construct a k-nearest neighbor graph, we only keep the kmax values in each row ofW , and then apply the normalized graph Laplacians [6] on W : S = D − 1 2W D− 1 2 , (3) where D is a diagonal matrix with its (i,i)-value to be the sum of the i-th row of W . While using different pre-trained word embeddings as the embedding layer may result in different k-nearest neighbor graphs, we can construct B graphs by using B types of pre-trained word embeddings (e.g., Word2Vec [20], GloVe [28], ELMo [31], BERT [43]). 3.2.2 Label Propagation. We now describe how to get predictions for the unlabeled essays set Xu using label propagation [23]. Let F denote the set of N × K sized matrix with nonnegative entries. We define a label matrix Y ∈ F with Yij = 1 if xi is from the one-shot essays Xo and labeled asyi = j, otherwise Yij = 0. Starting fromY, label propagation iteratively determines the unknown labels of essays in Xu according to the graph structure using the following formulation: F t+1 = αSFt + (1 − α)Y, (4) where F t ∈ F denotes the predicted labels at the timestamp t, S denotes the normalized weight, and α ∈ (0, 1) controls the amount of propagated information. It is well known that the sequence {F t } has a closed-form solution as follows: F ∗ = (I − αS) −1Y, (5) where I is the identity matrix [52]. 3.2.3 Label Guessing. For each unlabeled essay in Xu , we produce a "guess" for its label based on the predictions of label propagation on multiple graphs. This guess is later used as pseudo label of unlabeled essay for knowledge distillation. To do so, we first compute the average of the label distributions predicted by label propagation on all the B graphs by Y ′ = 1 B Õ B b=1 F ∗ Gb , (6) where Y ′ denotes the averaged label distribution matrix, and F ∗ Gb denotes the final label distribution matrix generated by applying label propagation on graph Gb . Then, for each unlabeled essay xi ∈ Xu , its pseudo label y ′ i is obtained as follows: y ′ i = arg max 1≤j ≤K Y ′ ij , (7) where Y ′ ij denotes the j-th element of the i-th row vector of Y ′ . 3.3 Ordinal-Aware Neural Network (Student) We introduce the Student Model illustrated in Figure 1, which is an ordinal-aware neural network model and consists of two main components: essay encoder that extracts the feature embedding 2349
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu of the input essay:ordinal classifier that predicts a unimodal label where Yik denotes the k-th element of Yi.Based on Yi,the final distribution on the pre-defined scores for each input essay. predicted label i of essay xi can be obtained by: 3.3.1 Essay Encoder.We employ a neural network fo()to ex- i=arg max Yik. (12) tract features of an input xi,where fo(xi;)refers to the essay 1≤k≤K embedding and o indicates the parameters of the network.This module is not limited to a specific architecture and can be var- 3.4 Ordinal Distillation ious existing AES encoders.To demonstrate the universality of We introduce the Ordinal Distillation illustrated in Figure 1,which our framework and provide more fair comparisons in the experi- distills the pseudo-label knowledge of Teacher Model into the Stu- ments,we adopt the encoders adopted in some recent work(e.g., dent Model,and consists of three main steps:label selection that CNN-LSTM-Att [9].HA-LSTM [5],BERT [5]). selects high confidence pseudo-labels for later distillation;unimodal 3.3.2 Unimodal Ordinal Classifier.Unlike previous neural net- smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution;unimodal distillation that min- work based AES models which predict the score of the input essay by using a regression layer(ie.a one-unit layer),we view the essay imizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of scoring as an ordinal classification problem and adopt an ordinal Teacher Model. classifier [3]for prediction To capture the ordinal relationship among classes,the unimodal 3.4.1 Label Selection.Considering that only one-shot labeled probability distribution(ie,the distribution has a peak at class k data is available for label propagation,the pseudo labels generated while decreasing its value when the class goes away from k)is usu- by Teacher Model may be noisy.Therefore,we propose a label ally used to restrict the shape of the predicted label distributions. selection strategy to select a subset of pseudo labels with high According to previous studies [3,22],some special exponential confidence. functions and the probability mass function(PMF)of both Pois- Specifically,for each distinct score k e y,we first collect all son distribution and binomial distribution can be used to enforce corresponding pseudo labels,that is,Ck=yly=k,xiE Xu). discrete unimodal probability distribution. and then rank these pseudo labels Ck according to their confidence. In our framework,we choose an extension of the binomial dis- We measure the confidence of a pseudo label y by calculating tribution,Conway-Maxwell binomial distribution(CMB)[16].as the negative Shannon entropy of its corresponding label distribu- the base distribution,and employ the PMF of the CMB to generate tion(Eq.13),so that a peaked distribution may tend to get a high the predicted unimodal probability distribution of essay xi: confidence. P(yi=k)= 1/K-1 S(p,v)k-1 p-1(1-p)K-k (8) Confidence(y)=-H(Y:) ∑写logz (13) where S(p.v)= (9) After that,we select top mk pseudo labels with high confidence from Ck by Here k∈y={1,2,..,K,0≤p≤1,and-oo≤v≤co.The mk min (Ckl,max(a,Cklx y)), (14) parameter v can be used to control the variance of the distribution. The case v=1 is the usual binomial distribution. where the threshold ratio y and the threshold number a are set to To be more specifically,we now describe the neural network ensure a sufficient number of pseudo labels are selected in the end architecture of the employed ordinal classifier based on the PMF of and avoid serious class imbalance problem. the CMB.As shown in Figure 1,the essay encoder is followed by a linear layer which transforms the essay embedding into a number 3.4.2 Unimodal Smoothing.Previous studies on knowledge dis- vER and a probability p e[0,1](by using sigmoid activation func- tillation [13,49]have shown that a soft or smoothed probability tion).The linear layer is then followed by a 'copy expansion'layer distribution from teacher model is more suitable for knowledge which expands the probability p into K probabilities corresponding distillation than one-hot probability distribution.Considering that to K distinct scores,that is,Pk=1 =Pk=2=...=Pk=K.The follow- essay scoring is an ordinal classification problem and an essay is ing layer then applies the 'Log CMB PMF'transformation on these more likely to be mispredicted as a score close to the ground-truth probabilities with different k: score,we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution K-1 LCP(k:v,p)=vlog As mentioned before,some special exponential functions [22] k-1 +(k-1)logp (10) can be used to enforce discrete unimodal probability distribution. +(K-k)log(1-p), Therefore,we employ an exponential function to perform the uni- where the log operation is used to address numeric stability is- modal smoothing on both one-shot labels and pseudo labels: sues.Finally,a softmax layer is applied on the logit,LCP(k:v.p),to cp(-以 produce a unimodal probability distribution Yi for essay xi: ∑A:ep(U- xi∈Xo eLCP(k:v.p) g'(yi=kxi)= (15) exp(- =1eLCP(k:v.p) (11) 1p(-T xiEXu 2350
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu of the input essay; ordinal classifier that predicts a unimodal label distribution on the pre-defined scores for each input essay. 3.3.1 Essay Encoder. We employ a neural network fφ (·) to extract features of an input xi , where fφ (xi ;φ) refers to the essay embedding and φ indicates the parameters of the network. This module is not limited to a specific architecture and can be various existing AES encoders. To demonstrate the universality of our framework and provide more fair comparisons in the experiments, we adopt the encoders adopted in some recent work (e.g., CNN-LSTM-Att [9], HA-LSTM [5], BERT [5]). 3.3.2 Unimodal Ordinal Classifier. Unlike previous neural network based AES models which predict the score of the input essay by using a regression layer (i.e. a one-unit layer), we view the essay scoring as an ordinal classification problem and adopt an ordinal classifier [3] for prediction. To capture the ordinal relationship among classes, the unimodal probability distribution (i.e., the distribution has a peak at class k while decreasing its value when the class goes away from k) is usually used to restrict the shape of the predicted label distributions. According to previous studies [3, 22], some special exponential functions and the probability mass function (PMF) of both Poisson distribution and binomial distribution can be used to enforce discrete unimodal probability distribution. In our framework, we choose an extension of the binomial distribution, Conway–Maxwell binomial distribution (CMB) [16], as the base distribution, and employ the PMF of the CMB to generate the predicted unimodal probability distribution of essay xi : P(yi = k) = 1 S(p,υ) K − 1 k − 1 υ p k−1 (1 − p) K−k , (8) where S(p,υ) = Õ K k=1 K − 1 k − 1 υ p k−1 (1 − p) K−k . (9) Here k ∈ Y = {1, 2, . . . ,K}, 0 ≤ p ≤ 1, and −∞ ≤ υ ≤ ∞. The parameter υ can be used to control the variance of the distribution. The case υ = 1 is the usual binomial distribution. To be more specifically, we now describe the neural network architecture of the employed ordinal classifier based on the PMF of the CMB. As shown in Figure 1, the essay encoder is followed by a linear layer which transforms the essay embedding into a number υ ∈ R and a probability p ∈ [0, 1] (by using sigmoid activation function). The linear layer is then followed by a ‘copy expansion’ layer which expands the probability p into K probabilities corresponding to K distinct scores, that is, pk=1 = pk=2 = · · · = pk=K . The following layer then applies the ‘Log CMB PMF’ transformation on these probabilities with different k: LCP(k;υ,p) = υ log K − 1 k − 1 + (k − 1) logp + (K − k) log (1 − p), (10) where the log operation is used to address numeric stability issues. Finally, a softmax layer is applied on the logit, LCP(k;υ,p), to produce a unimodal probability distribution Yˆ i for essay xi : Yˆ ik = e LCP(k;υ,p) ÍK k=1 e LCP(k;υ,p) , (11) where Yˆ ik denotes the k-th element of Yˆ i . Based on Yˆ i , the final predicted label yˆi of essay xi can be obtained by: yˆi = arg max 1≤k ≤K Yˆ ik . (12) 3.4 Ordinal Distillation We introduce the Ordinal Distillation illustrated in Figure 1, which distills the pseudo-label knowledge of Teacher Model into the Student Model, and consists of three main steps: label selection that selects high confidence pseudo-labels for later distillation; unimodal smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution; unimodal distillation that minimizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of Teacher Model. 3.4.1 Label Selection. Considering that only one-shot labeled data is available for label propagation, the pseudo labels generated by Teacher Model may be noisy. Therefore, we propose a label selection strategy to select a subset of pseudo labels with high confidence. Specifically, for each distinct score k ∈ Y, we first collect all corresponding pseudo labels, that is, Ck = {y ′ i |y ′ i = k, xi ∈ Xu }, and then rank these pseudo labels Ck according to their confidence. We measure the confidence of a pseudo label y ′ i by calculating the negative Shannon entropy of its corresponding label distribution (Eq. 13), so that a peaked distribution may tend to get a high confidence. Confidence(y ′ i ) = −H(Y ′ i ) = Õ K j=1 Y ′ ij log2 Y ′ ij (13) After that, we select top mk pseudo labels with high confidence from Ck by mk = min (|Ck |, max(a, |Ck | ×γ )), (14) where the threshold ratio γ and the threshold number a are set to ensure a sufficient number of pseudo labels are selected in the end and avoid serious class imbalance problem. 3.4.2 Unimodal Smoothing. Previous studies on knowledge distillation [13, 49] have shown that a soft or smoothed probability distribution from teacher model is more suitable for knowledge distillation than one-hot probability distribution. Considering that essay scoring is an ordinal classification problem and an essay is more likely to be mispredicted as a score close to the ground-truth score, we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution. As mentioned before, some special exponential functions [22] can be used to enforce discrete unimodal probability distribution. Therefore, we employ an exponential function to perform the unimodal smoothing on both one-shot labels and pseudo labels: q ′ (yi = k|xi) = exp( −|k−yi | τ ) ÍK j=1 exp( −|j−yi | τ ) xi ∈ Xo exp( −|k−y ′ i | τ ) ÍK j=1 exp( −|j−y ′ i | τ ) xi ∈ Xu , (15) 2350
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia Algorithm 1 The Training Flow of TGOD Table 1:Statistics of the ASAP datasets.For column Genre, Input:The whole set of essays X,one-shot labeled data Do. ARG denotes argumentative essays,RES denotes response Output:An optimized student model. essays,and NAR denotes narrative essays.The last column Run the Teacher Model: lists the score ranges. Construct multiple graphs G*=(G1,G2,....GB}on X. for each Gh eG do Prompt #Essay Genre Avg Len Range Apply label propagation algorithm on Gh as Eq.5. 1 1,783 ARG 350 2-12 end for 2 1,800 ARG 350 1-6 Generate pseudo labels by label guessing as Eq.6 and 7. 3 1,726 RES 150 0-3 Train the Student Model by Ordinal Distillation: 4 1,772 RES 150 0-3 Select the pseudo labels with high confidence by Eq.13 and 14. 5 1,805 RES 150 0-4 Smooth the selected labels as Eg.15. 6 1,800 RES 150 0-4 Split selected essays into training set Dt and validation set Du 1,569 NAR 250 0-30 for all iter=1,...,MaxIter do 8 723 NAR 650 0-60 Optimize the student model on Dt by minimizing Eq.16. Validate the student model on D,, end for 4.1 Dataset and Evaluation Metric return The student model with best performance on D We conduct experiments on a public dataset ASAP(Automated Student Assessment Prize),which is a widely-used benchmark dataset for the task of automated essay scoring.In ASAP,there are where k ey and r is a parameter used to control the variance of eight sets of essays corresponding to eight different prompts,and a total of 12,978 scored essays.These eight essay sets vary in essay the distribution. number,genre,and score range,the details of which are listed in 3.4.3 Unimodal Distillation.Since the one-shot labeled data Table 1. Do is not sufficient to train a neural network,we use the pseudo To evaluate the performance of AES methods,we employ the labels produced by teacher model as a supplement to train the quadratic weighted kappa(OWK)as the evaluation metric,which student model. is the official metric of ASAP dataset.For each set of essays with Specifically,we train the student model by matching the output possible scores y={1,2.....K),the QWK can be calculated to label distribution of student model g(xi)=Y;and the unimodal measure the agreement between the automated predicted scores smoothed pseudo label of teacher model q'(xi)via a KL-divergence (Rater A)and the resolved human scores(Rater B)as follows: loss: K=1- ∑i,jwi,j0i,j (17) COD= DKL(gx)llg'(x》, (16) ∑ijwi,jEij where wi.j= (is calculated based on the difference between (i-2 where Xs denotes the set of essays from either one-shot data or the raters'scores,O is a K-by-K histogram matrix,Oi.j is the number selected essays after label selection. of essays that received a score i by Rater A and a score j by Rater B and E is calculated as the normalized outer product between each 3.5 Training Flow of TGOD rater's histogram vector of scores. In summary,there are two steps in TGOD to train the Student Model 4.2 Experimental Settings under the one-shot setting,i.e.,first generating pseudo labels of unlabeled essays by running the Teacher Model,and then training For the setting of'one-shot',we conduct experiments by randomly the Student Model by Ordinal Distillation.The whole training flow sampling the one-shot labeled data to train the model and test the of TGOD is illustrated in Figure 1 and Alg.1. model on the rest unlabeled essays.To reduce randomness,under In particular,considering that model selection is difficult to im- each case,we repeat the sampling of one-shot labeled data 20 times, plement under the one-shot supervised setting,we design a model and the average results are reported.For our proposed framework selection strategy based on pseudo labels,which validates the model we perform model selection based on the pseudo validation set. on a subset of pseudo labels. For other baseline methods.since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection,we report their best performance on test 4 EXPERIMENTS set as their upper bound performance for comparison. In this section,we first introduce the dataset and evaluation metric. For the setting of'one-shot+history prompt',we combine the one- Then we illustrate the experimental settings,the implementation shot labeled data and the labeled data in a history prompt of the details,and the performance comparison.Finally,we conduct ab- similar score range(e.g,P1→P2,P2→P1,P3→P4,P4→P3,and lation study and model analysis to investigate the effectiveness of our proposed approach. 1https://www.kaggle.com/c/asap-aes/data 2351
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Algorithm 1 The Training Flow of TGOD Input: The whole set of essays X, one-shot labeled data Do . Output: An optimized student model. Run the Teacher Model: Construct multiple graphs G ∗ = {G1,G2, . . . ,GB } on X. for each Gb ∈ G ∗ do Apply label propagation algorithm on Gb as Eq. 5. end for Generate pseudo labels by label guessing as Eq. 6 and 7. Train the Student Model by Ordinal Distillation: Select the pseudo labels with high confidence by Eq. 13 and 14. Smooth the selected labels as Eq. 15. Split selected essays into training set Dt and validation set Dv . for all iter=1,. . . ,MaxIter do Optimize the student model on Dt by minimizing Eq. 16. Validate the student model on Dv end for return The student model with best performance on Dv where k ∈ Y and τ is a parameter used to control the variance of the distribution. 3.4.3 Unimodal Distillation. Since the one-shot labeled data Do is not sufficient to train a neural network, we use the pseudo labels produced by teacher model as a supplement to train the student model. Specifically, we train the student model by matching the output label distribution of student model qˆ(xi) = Yˆ i and the unimodal smoothed pseudo label of teacher model q ′ (xi) via a KL-divergence loss: LO D = Õ xi ∈Xs DK L qˆ(xi)||q ′ (xi) , (16) where Xs denotes the set of essays from either one-shot data or the selected essays after label selection. 3.5 Training Flow of TGOD In summary, there are two steps in TGOD to train the Student Model under the one-shot setting, i.e., first generating pseudo labels of unlabeled essays by running the Teacher Model, and then training the Student Model by Ordinal Distillation. The whole training flow of TGOD is illustrated in Figure 1 and Alg. 1. In particular, considering that model selection is difficult to implement under the one-shot supervised setting, we design a model selection strategy based on pseudo labels, which validates the model on a subset of pseudo labels. 4 EXPERIMENTS In this section, we first introduce the dataset and evaluation metric. Then we illustrate the experimental settings, the implementation details, and the performance comparison. Finally, we conduct ablation study and model analysis to investigate the effectiveness of our proposed approach. Table 1: Statistics of the ASAP datasets. For column Genre, ARG denotes argumentative essays, RES denotes response essays, and NAR denotes narrative essays. The last column lists the score ranges. Prompt #Essay Genre Avg Len Range 1 1,783 ARG 350 2-12 2 1,800 ARG 350 1-6 3 1,726 RES 150 0-3 4 1,772 RES 150 0-3 5 1,805 RES 150 0-4 6 1,800 RES 150 0-4 7 1,569 NAR 250 0-30 8 723 NAR 650 0-60 4.1 Dataset and Evaluation Metric We conduct experiments on a public dataset ASAP (Automated Student Assessment Prize1 ), which is a widely-used benchmark dataset for the task of automated essay scoring. In ASAP, there are eight sets of essays corresponding to eight different prompts, and a total of 12,978 scored essays. These eight essay sets vary in essay number, genre, and score range, the details of which are listed in Table 1. To evaluate the performance of AES methods, we employ the quadratic weighted kappa (QWK) as the evaluation metric, which is the official metric of ASAP dataset. For each set of essays with possible scores Y = {1, 2, . . . ,K}, the QWK can be calculated to measure the agreement between the automated predicted scores (Rater A) and the resolved human scores (Rater B) as follows: κ = 1 − Í i,j wi,jOi,j Í i,j wi,jEi,j , (17) where wi,j = (i−j) 2 (K−1) 2 is calculated based on the difference between raters’ scores, O is a K-by-K histogram matrix, Oi,j is the number of essays that received a score i by Rater A and a score j by Rater B, and E is calculated as the normalized outer product between each rater’s histogram vector of scores. 4.2 Experimental Settings For the setting of ‘one-shot’, we conduct experiments by randomly sampling the one-shot labeled data to train the model and test the model on the rest unlabeled essays. To reduce randomness, under each case, we repeat the sampling of one-shot labeled data 20 times, and the average results are reported. For our proposed framework, we perform model selection based on the pseudo validation set. For other baseline methods, since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection, we report their best performance on test set as their upper bound performance for comparison. For the setting of ‘one-shot+history prompt’, we combine the oneshot labeled data and the labeled data in a history prompt of the similar score range (e.g., P1 → P2, P2 → P1, P3 → P4, P4 → P3, and 1https://www.kaggle.com/c/asap-aes/data 2351