Enriching Word Embeddings with Domain Knowledge for Readability Assessment Zhiwei Jiang and Qing Gu*and Yafeng Yin and Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University,Nanjing 210023,China jiangzhiwei@outlook.com,guq,yafeng,cdx@nju.edu.cn Abstract In this paper,we present a method which learns the word embedding for readability assessment. For the existing word embedding models,they typically focus on the syntactic or semantic rela- tions of words,while ignoring the reading difficulty,thus they may not be suitable for readability assessment.Hence,we provide the knowledge-enriched word embedding (KEWE),which en- codes the knowledge on reading difficulty into the representation of words.Specifically,we extract the knowledge on word-level difficulty from three perspectives to construct a knowledge graph,and develop two word embedding models to incorporate the difficulty context derived from the knowledge graph to define the loss functions.Experiments are designed to apply KEWE for readability assessment on both English and Chinese datasets,and the results demonstrate both effectiveness and potential of KEWE. 1 Introduction Readability assessment is a classic problem in natural language processing,which attracts many re- searchers'attention in recent years (Todirascu et al.,2016;Schumacher et al.,2016:Cha et al.,2017). The objective is to evaluate the readability of texts by levels or scores.The majority of recent readabil- ity assessment methods are based on the framework of supervised learning (Schwarm and Ostendorf, 2005)and build classifiers from hand-crafted features extracted from the texts.The performance of these methods depends on designing effective features to build high-quality classifiers. Designing hand-crafted features are essential but labor-intensive.It is desirable to learn representative features from the texts automatically.For document-level readability assessment,an effective feature learning method is to construct the representation of documents by combining the representation of the words contained (Kim,2014).For the representation of word,a useful technique is to learn the word representation as a dense and low-dimensional vector,which is called word embedding.Existing word embedding models(Collobert et al.,2011;Mikolov et al.,2013;Pennington et al.,2014)can be used for readability assessment,but the effectiveness is compromised by the fact that these models typically focus on the syntactic or semantic relations of words,while ignoring the reading difficulty.As a result,words with similar functions or topics,such as"man"and "gentleman",are mapped into close vectors although their reading difficulties are different.It calls for incorporating the knowledge on reading difficulty when training the word embedding. In this paper,we provide the knowledge-enriched word embedding (KEWE)for readability assess- ment,which encodes the knowledge on reading difficulty into the representation of words.Specifically, we define the word-level difficulty from three perspectives,and use the extracted knowledge to construct a knowledge graph.After that,we derive the difficulty context of words from the knowledge graph,and develop two word embedding models to incorporate the difficulty context to define the loss functions. We apply KEWE for document-level readability assessment under the supervised framework.The experiments are conducted on four datasets of either English or Chinese.The results demonstrate that *Corresponding Author This work is licensed under a Creative Commons Attribution 4.0 International Licence.Licence details: http://creativecommons.org/licenses/by/4.0/. 366 Proceedings of the 27th International Conference on Computational Linguistics,pages 366-378 Santa Fe,New Mexico,USA,August 20-26,2018
Proceedings of the 27th International Conference on Computational Linguistics, pages 366–378 Santa Fe, New Mexico, USA, August 20-26, 2018. 366 Enriching Word Embeddings with Domain Knowledge for Readability Assessment Zhiwei Jiang and Qing Gu∗ and Yafeng Yin and Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China jiangzhiwei@outlook.com, {guq,yafeng,cdx}@nju.edu.cn Abstract In this paper, we present a method which learns the word embedding for readability assessment. For the existing word embedding models, they typically focus on the syntactic or semantic relations of words, while ignoring the reading difficulty, thus they may not be suitable for readability assessment. Hence, we provide the knowledge-enriched word embedding (KEWE), which encodes the knowledge on reading difficulty into the representation of words. Specifically, we extract the knowledge on word-level difficulty from three perspectives to construct a knowledge graph, and develop two word embedding models to incorporate the difficulty context derived from the knowledge graph to define the loss functions. Experiments are designed to apply KEWE for readability assessment on both English and Chinese datasets, and the results demonstrate both effectiveness and potential of KEWE. 1 Introduction Readability assessment is a classic problem in natural language processing, which attracts many researchers’ attention in recent years (Todirascu et al., 2016; Schumacher et al., 2016; Cha et al., 2017). The objective is to evaluate the readability of texts by levels or scores. The majority of recent readability assessment methods are based on the framework of supervised learning (Schwarm and Ostendorf, 2005) and build classifiers from hand-crafted features extracted from the texts. The performance of these methods depends on designing effective features to build high-quality classifiers. Designing hand-crafted features are essential but labor-intensive. It is desirable to learn representative features from the texts automatically. For document-level readability assessment, an effective feature learning method is to construct the representation of documents by combining the representation of the words contained (Kim, 2014). For the representation of word, a useful technique is to learn the word representation as a dense and low-dimensional vector, which is called word embedding. Existing word embedding models (Collobert et al., 2011; Mikolov et al., 2013; Pennington et al., 2014) can be used for readability assessment, but the effectiveness is compromised by the fact that these models typically focus on the syntactic or semantic relations of words, while ignoring the reading difficulty. As a result, words with similar functions or topics, such as “man” and “gentleman”, are mapped into close vectors although their reading difficulties are different. It calls for incorporating the knowledge on reading difficulty when training the word embedding. In this paper, we provide the knowledge-enriched word embedding (KEWE) for readability assessment, which encodes the knowledge on reading difficulty into the representation of words. Specifically, we define the word-level difficulty from three perspectives, and use the extracted knowledge to construct a knowledge graph. After that, we derive the difficulty context of words from the knowledge graph, and develop two word embedding models to incorporate the difficulty context to define the loss functions. We apply KEWE for document-level readability assessment under the supervised framework. The experiments are conducted on four datasets of either English or Chinese. The results demonstrate that ∗Corresponding Author This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
our method can outperform other well-known readability assessment methods,and the classic text-based word embedding models on all the datasets.By concatenating our knowledge-enriched word embedding with the hand-crafted features,the performance can be further improved. The rest of the paper is organized as follows.Section 2 provides the related work for readability assessment.Section 3 describes the details of KEWE.Section 4 presents the experiments and results. Finally Section 5 concludes the paper with future work. 2 Related Work In this section,we briefly introduce three research topics relevant to our work:readability assessment, word embedding,and graph embedding. Readability Assessment.The researches on readability assessment have a relatively long history from the beginning of last century(Collinsthompson,2014).Early studies mainly focused on designing readability formulas to evaluate the reading scores of texts.Some of the well-known readability formulas include the SMOG formula(McLaughlin,1969),the FK formula (Kincaid et al.,1975),and the Dale- Chall formula(Chall,1995).At the beginning of the 21th century,supervised approaches have been introduced and then explored for readability assessment(Si and Callan,2001;Collins-Thompson and Callan,2004;Schwarm and Ostendorf,2005).Researchers have focused on improving the performance by designing highly effective features (Pitler and Nenkova,2008;Heilman et al.,2008;Feng et al.,2010; Vajjala and Meurers,2012)and employing effective classification models (Heilman et al.,2007;Kate et al.,2010;Ma et al.,2012;Jiang et al.,2015;Cha et al.,2017).While most studies are conducted for English,there are studies for other languages,such as French(Francois and Fairon,2012),German (Hancke et al.,2012),Bangla (Sinha et al.,2014),Basque(Gonzalez-Dios et al.,2014),Chinese (Jiang et al.,2014),and Japanese (Wang and Andersen,2016). Word Embedding.Researchers have proposed various methods on word embedding,which mainly include two broad categories:neural network based methods(Bengio et al.,2003;Collobert et al.,2011; Mikolov et al.,2013)and co-occurrence matrix based methods(Turney and Pantel,2010;Levy and Gold- berg,2014b;Pennington et al.,2014).Neural network based methods learn word embedding through training neural network models,which include NNLM(Bengio et al.,2003),C&W(Collobert and We- ston,2008).and word2vec (Mikolov et al..2013).Co-occurrence matrix based methods learn word embedding based on the co-occurrence matrices,which include LSA (Deerwester,1990),Implicit Ma- trix Factorization (Levy and Goldberg,2014b),and GloVe (Pennington et al.,2014).Besides the general word embedding learning methods,researchers have also proposed methods to learn word embedding to include certain properties (Liu et al.,2015;Shen and Liu,2016)or for certain domains (Tang et al., 2014:Ren et al..2016:Alikaniotis et al..2016:Wu et al.,2017). Graph embedding.Graph embedding aims to learn continuous representations of the nodes or edges based on the structure of a graph.The graph embedding methods can be classified into three categories (Goyal and Ferrara,2017):factorization based (Roweis and Saul,2000;Belkin and Niyogi,2001), random walk based (Perozzi et al.,2014;Grover and Leskovec,2016),and deep learning based(Wang et al.,2016).Among them,the random walk based methods are easy to comprehend and can effectively reserve the centrality and similarity of the nodes.Deepwalks(Perozzi et al.,2014)and node2vec(Grover and Leskovec,2016)are two representatives of the random walk based methods.The basic idea of Deepwalk is viewing random walk paths as sentences,and feeding them to a general word embedding model.node2vec is similar to Deepwalk,although it simulates a biased random walk over graphs,and often provides efficient random walk paths 3 Learning Knowledge-Enriched Word Embedding for Readability Assessment In this section,we present the details of Knowledge-Enriched Word Embedding(KEWE)for readability assessment.By incorporating the word-level readability knowledge,we extend the existing word embed- ding model and design two models with different learning structures.As shown in Figure 1,the above one is the knowledge-only word embedding model(KEWE)which only takes in the domain knowledge, 367
367 our method can outperform other well-known readability assessment methods, and the classic text-based word embedding models on all the datasets. By concatenating our knowledge-enriched word embedding with the hand-crafted features, the performance can be further improved. The rest of the paper is organized as follows. Section 2 provides the related work for readability assessment. Section 3 describes the details of KEWE. Section 4 presents the experiments and results. Finally Section 5 concludes the paper with future work. 2 Related Work In this section, we briefly introduce three research topics relevant to our work: readability assessment, word embedding, and graph embedding. Readability Assessment. The researches on readability assessment have a relatively long history from the beginning of last century (Collinsthompson, 2014). Early studies mainly focused on designing readability formulas to evaluate the reading scores of texts. Some of the well-known readability formulas include the SMOG formula (McLaughlin, 1969), the FK formula (Kincaid et al., 1975), and the DaleChall formula (Chall, 1995). At the beginning of the 21th century, supervised approaches have been introduced and then explored for readability assessment (Si and Callan, 2001; Collins-Thompson and Callan, 2004; Schwarm and Ostendorf, 2005). Researchers have focused on improving the performance by designing highly effective features (Pitler and Nenkova, 2008; Heilman et al., 2008; Feng et al., 2010; Vajjala and Meurers, 2012) and employing effective classification models (Heilman et al., 2007; Kate et al., 2010; Ma et al., 2012; Jiang et al., 2015; Cha et al., 2017). While most studies are conducted for English, there are studies for other languages, such as French (Franc¸ois and Fairon, 2012), German (Hancke et al., 2012), Bangla (Sinha et al., 2014), Basque (Gonzalez-Dios et al., 2014), Chinese (Jiang et al., 2014), and Japanese (Wang and Andersen, 2016). Word Embedding. Researchers have proposed various methods on word embedding, which mainly include two broad categories: neural network based methods (Bengio et al., 2003; Collobert et al., 2011; Mikolov et al., 2013) and co-occurrence matrix based methods (Turney and Pantel, 2010; Levy and Goldberg, 2014b; Pennington et al., 2014). Neural network based methods learn word embedding through training neural network models, which include NNLM (Bengio et al., 2003), C&W (Collobert and Weston, 2008), and word2vec (Mikolov et al., 2013). Co-occurrence matrix based methods learn word embedding based on the co-occurrence matrices, which include LSA (Deerwester, 1990), Implicit Matrix Factorization (Levy and Goldberg, 2014b), and GloVe (Pennington et al., 2014). Besides the general word embedding learning methods, researchers have also proposed methods to learn word embedding to include certain properties (Liu et al., 2015; Shen and Liu, 2016) or for certain domains (Tang et al., 2014; Ren et al., 2016; Alikaniotis et al., 2016; Wu et al., 2017). Graph embedding. Graph embedding aims to learn continuous representations of the nodes or edges based on the structure of a graph. The graph embedding methods can be classified into three categories (Goyal and Ferrara, 2017): factorization based (Roweis and Saul, 2000; Belkin and Niyogi, 2001), random walk based (Perozzi et al., 2014; Grover and Leskovec, 2016), and deep learning based (Wang et al., 2016). Among them, the random walk based methods are easy to comprehend and can effectively reserve the centrality and similarity of the nodes. Deepwalks (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016) are two representatives of the random walk based methods. The basic idea of Deepwalk is viewing random walk paths as sentences, and feeding them to a general word embedding model. node2vec is similar to Deepwalk, although it simulates a biased random walk over graphs, and often provides efficient random walk paths. 3 Learning Knowledge-Enriched Word Embedding for Readability Assessment In this section, we present the details of Knowledge-Enriched Word Embedding (KEWE) for readability assessment. By incorporating the word-level readability knowledge, we extend the existing word embedding model and design two models with different learning structures. As shown in Figure 1, the above one is the knowledge-only word embedding model (KEWEk) which only takes in the domain knowledge
Knowledge Graph Random Wall mple positive contexts on Sim_edge by sliding window Knowledge Difficulty KEWE Base Context Sim_edge Sampling negative contexts on Dissim_edge Dissim_edge...... Sentences Sliding window Hybrid two types Text on sentences Text of contexts Corpus Context KEWE Figure 1:Illustration of the knowledge-enriched word embedding models.KEWE is based on the difficulty context,while KEWEh is based on both the difficulty and text contexts. the other is the hybrid word embedding model(KEWE)which compensates the domain knowledge with text corpus 3.1 The Knowledge-only Word Embedding Model (KEWE) In the classic word embedding models,such as C&W,CBOW,and Skip-Gram,the context of a word is represented by its surrounding words in the text corpus.Levy and Goldberg (2014a)have incorporated the syntactic context from the dependency parse-trees and found that the trained word embedding could capture more functional and less topical similarity.For readability assessment,reading difficulty other than function or topic becomes more important.Hence,we introduce a kind of difficulty context,and try to learn a difficulty-focusing word embedding,which leads to KEWEk.In the following,we describe this model in three steps:domain knowledge extraction,knowledge graph construction,and graph-based word embedding learning.The former two steps focus on modeling the relationship among words on reading difficulty,and the final step on deriving the difficulty context and learning the word embedding 3.1.1 Domain Knowledge Extraction To model the relationship among words on reading difficulty,we first introduce how to extract the knowl- edge on word-level difficulty from different perspectives.Specifically,we consider three types of word- level difficulty:acquisition difficulty,usage difficulty,and structure difficulty. Acquisition difficulty.Word acquisition refers to the temporal stage at which children learn the meaning of new words.Researchers have shown that the information on word acquisition is useful for readability assessment (Kidwell et al.,2009;Schumacher et al.,2016).Generally,the words acquired at primary school are easier than those acquired at high school.We call the reading difficulty reflected by word acquisition as the acquisition difficulty.Formally,given a word w,its acquisition difficulty is described by a distribution K over the age-of-acquisition (AoA)(Kidwell et al.,2009). Since the rating on AoA is an unsolved problem in cognitive science(Brysbaert and Biemiller,2016) and not available for many languages,we explore extra materials to describe the acquisition difficulty.In particular,we collect three kinds of knowledge teaching materials,i.e.,in-class teaching material,extra- curricular teaching material,and proficiency test material.These materials are arranged as lists of words, each of which contains words learned in the same time period and hence corresponds to a certain level of acquisition difficulty.For example,given a lists of words,we can defineKR,whereK1 if a word w belongs to the list i,andK=0otherwise. Usage difficulty.Researchers used to count the usage frequency to measure the difficulty of words (Dale and Chall,1948),which can separate the words which are frequently used from those rarely used. We call the difficulty reflected by usage preference as the usage difficulty.Formally,given a word w,its usage difficulty is described by a distribution K over the usage preferences. We provide two ways to measure the usage difficulty.One way is estimating the level of words' 368
368 KEWEk Knowledge Base Sentences gggggg Text Corpus KEWEh Knowledge Graph Dissim_edge Sim_edge Difficulty Context Sampling negative contexts on Dissim_edge Sample positive contexts by sliding window Paths ggg Random Walk on Sim_edge Text Context Sliding window on sentences Hybrid two types of contexts Figure 1: Illustration of the knowledge-enriched word embedding models. KEWEk is based on the difficulty context, while KEWEh is based on both the difficulty and text contexts. the other is the hybrid word embedding model (KEWEh) which compensates the domain knowledge with text corpus. 3.1 The Knowledge-only Word Embedding Model (KEWEk) In the classic word embedding models, such as C&W, CBOW, and Skip-Gram, the context of a word is represented by its surrounding words in the text corpus. Levy and Goldberg (2014a) have incorporated the syntactic context from the dependency parse-trees and found that the trained word embedding could capture more functional and less topical similarity. For readability assessment, reading difficulty other than function or topic becomes more important. Hence, we introduce a kind of difficulty context, and try to learn a difficulty-focusing word embedding, which leads to KEWEk. In the following, we describe this model in three steps: domain knowledge extraction, knowledge graph construction, and graph-based word embedding learning. The former two steps focus on modeling the relationship among words on reading difficulty, and the final step on deriving the difficulty context and learning the word embedding. 3.1.1 Domain Knowledge Extraction To model the relationship among words on reading difficulty, we first introduce how to extract the knowledge on word-level difficulty from different perspectives. Specifically, we consider three types of wordlevel difficulty: acquisition difficulty, usage difficulty, and structure difficulty. Acquisition difficulty. Word acquisition refers to the temporal stage at which children learn the meaning of new words. Researchers have shown that the information on word acquisition is useful for readability assessment (Kidwell et al., 2009; Schumacher et al., 2016). Generally, the words acquired at primary school are easier than those acquired at high school. We call the reading difficulty reflected by word acquisition as the acquisition difficulty. Formally, given a word w, its acquisition difficulty is described by a distribution KA w over the age-of-acquisition (AoA) (Kidwell et al., 2009). Since the rating on AoA is an unsolved problem in cognitive science (Brysbaert and Biemiller, 2016) and not available for many languages, we explore extra materials to describe the acquisition difficulty. In particular, we collect three kinds of knowledge teaching materials, i.e., in-class teaching material, extracurricular teaching material, and proficiency test material. These materials are arranged as lists of words, each of which contains words learned in the same time period and hence corresponds to a certain level of acquisition difficulty. For example, given a lists of words, we can define KA w ∈ R a , where KA w,i = 1 if a word w belongs to the list i, and KA w,i = 0 otherwise. Usage difficulty. Researchers used to count the usage frequency to measure the difficulty of words (Dale and Chall, 1948), which can separate the words which are frequently used from those rarely used. We call the difficulty reflected by usage preference as the usage difficulty. Formally, given a word w, its usage difficulty is described by a distribution KU w over the usage preferences. We provide two ways to measure the usage difficulty. One way is estimating the level of words’
usage frequency by counting the word frequency lists from the text corpus.The other way is estimating the probability distribution of words over the sentence-level difficulties,which is motivated by Jiang et al.(2015).Usage difficulty is defined on both.By discretizing the range of word frequency into b intervals of equal size,the usage frequency level of a word w is i,if its frequency resides in the ith intervals.By estimating the probability distribution vector P from sentence-level difficulties,we can define K andKi,Pl Structure difficulty.When building readability formulas,researchers have found that the structure of words could imply its difficulty (Flesch,1948;Gunning,1952;McLaughlin,1969).For example, words with more syllables are usually more difficult than words with less syllables.We call the difficulty reflected by structure of words as the structure difficulty.Formally,given a word w,its structure difficulty can be described by a distributionK over the word structures. Words in different languages may have their own special structural characteristics.For example,in English,the structural characteristics of words relate to syllables,characters,affixes,and subwords. Whereas in Chinese,the structural characteristics of words relate to strokes and radicals of Chinese characters.Here we use the number of syllables(strokes for Chinese)and characters in a word w to describe its structure difficulty.By discretizing the range of each number into intervals,Kis obtained by counting the interval in which w resides,respectively. 3.1.2 Knowledge Graph Construction After extracting the domain knowledge on word-level difficulty,we then quantitatively represent the knowledge by a graph.We define the knowledge graph as an undirected graph G=(V,E),where V is the set of vertices,each of which represents a word,and E is the set of edges,each of which represents the relation (i.e.,similarity)between two words on difficulty.Each edge e EE is a vertex pair (wi,wj) and is associated with a weight zij,which indicates the strength of the relation.If no edge exists between wi and wj,the weight zij =0.We define two edge types in the graph:Sim-edge and Dissim-edge.The former indicates that its end words have similar difficulty and is associated with a positive weight.The latter indicates that its end words have significant different difficulty and is associated with a negative weight.We derived the edges from the similarities computed between pairs of the words'knowledge vectors.Formally,given the extracted knowledge vector K=[K of a word w,E can be constructed using the similarity between pairs of words (w;,w;)as follows: sim(Kw:,Kw;) wj∈Np(w) sim(K:,Kw) wj∈Wn(u) (1) 0 otherwise where sim()is a similarity function (e.g.,cosine similarity),Np(wi)refers to the set of k most similar (i.e.,greatest similarity)neighbors of wi,and An(wi)refers to the set of k most dissimilar (i.e.,least similarity)neighbors of wi. 3.1.3 Knowledge Graph-based Word Embedding After constructing the knowledge graph,which models the relationship among words on difficulty,we can derive the difficulty context from the graph and train the word embedding focused on reading diffi- culty.For the graph-based difficulty context,given a word w,we define its difficulty context as the set of other words that have relevance to w on difficulty.Specifically,we define two types of difficulty context, positive context and negative context,corresponding to the two types of edges in the knowledge graph (i.e.,Sim_edge and Dissim-edge). Unlike the context defined on texts,which can be sampled by sliding windows over consecutive words, the context defined on a graph requires special sampling strategies.Different sampling strategies may define the context differently.For difficulty context,we design two relatively intuitive strategies,the ran- dom walk strategy and the immediate neighbors strategy,for the sampling of either positive or negative context. From the type Sim_edge,we sample the positive target-context pairs where the target word and the context words are similar on difficulty.Since the similarity is generally transitive,we adopt the random 369
369 usage frequency by counting the word frequency lists from the text corpus. The other way is estimating the probability distribution of words over the sentence-level difficulties, which is motivated by Jiang et al. (2015). Usage difficulty is defined on both. By discretizing the range of word frequency into b intervals of equal size, the usage frequency level of a word w is i, if its frequency resides in the ith intervals. By estimating the probability distribution vector Pw from sentence-level difficulties, we can define KU w ∈ R 1+|Pw| , and KU w,i = [i, Pw]. Structure difficulty. When building readability formulas, researchers have found that the structure of words could imply its difficulty (Flesch, 1948; Gunning, 1952; McLaughlin, 1969). For example, words with more syllables are usually more difficult than words with less syllables. We call the difficulty reflected by structure of words as the structure difficulty. Formally, given a word w, its structure difficulty can be described by a distribution KS w over the word structures. Words in different languages may have their own special structural characteristics. For example, in English, the structural characteristics of words relate to syllables, characters, affixes, and subwords. Whereas in Chinese, the structural characteristics of words relate to strokes and radicals of Chinese characters. Here we use the number of syllables (strokes for Chinese) and characters in a word w to describe its structure difficulty. By discretizing the range of each number into intervals, KS w is obtained by counting the interval in which w resides, respectively. 3.1.2 Knowledge Graph Construction After extracting the domain knowledge on word-level difficulty, we then quantitatively represent the knowledge by a graph. We define the knowledge graph as an undirected graph G = (V, E), where V is the set of vertices, each of which represents a word, and E is the set of edges, each of which represents the relation (i.e., similarity) between two words on difficulty. Each edge e ∈ E is a vertex pair (wi , wj ) and is associated with a weight zij , which indicates the strength of the relation. If no edge exists between wi and wj , the weight zij = 0. We define two edge types in the graph: Sim edge and Dissim edge. The former indicates that its end words have similar difficulty and is associated with a positive weight. The latter indicates that its end words have significant different difficulty and is associated with a negative weight. We derived the edges from the similarities computed between pairs of the words’ knowledge vectors. Formally, given the extracted knowledge vector Kw = [KA w , KU w , KS w] of a word w, E can be constructed using the similarity between pairs of words (wi , wj ) as follows: zij = sim(Kwi , Kwj ) wj ∈ Np(wi) −sim(Kwi , Kwj ) wj ∈ Nn(wi) 0 otherwise (1) where sim() is a similarity function (e.g., cosine similarity), Np(wi) refers to the set of k most similar (i.e., greatest similarity) neighbors of wi , and Nn(wi) refers to the set of k most dissimilar (i.e., least similarity) neighbors of wi . 3.1.3 Knowledge Graph-based Word Embedding After constructing the knowledge graph, which models the relationship among words on difficulty, we can derive the difficulty context from the graph and train the word embedding focused on reading diffi- culty. For the graph-based difficulty context, given a word w, we define its difficulty context as the set of other words that have relevance to w on difficulty. Specifically, we define two types of difficulty context, positive context and negative context, corresponding to the two types of edges in the knowledge graph (i.e., Sim edge and Dissim edge). Unlike the context defined on texts, which can be sampled by sliding windows over consecutive words, the context defined on a graph requires special sampling strategies. Different sampling strategies may define the context differently. For difficulty context, we design two relatively intuitive strategies, the random walk strategy and the immediate neighbors strategy, for the sampling of either positive or negative context. From the type Sim edge, we sample the positive target-context pairs where the target word and the context words are similar on difficulty. Since the similarity is generally transitive, we adopt the random
walk strategy to sample the positive context.Following the idea of node2vec (Grover and Leskovec, 2016),we sample the positive contexts of words by simulating a 2d order random walk on the knowledge graph with only Sim_edge.After that,by applying a sliding window of fixed length s over the sampled random walk paths,we can get the positive target-context pairs {(wt,we). From the type Dissim_edge,we sample the negative target-context pairs where the target word and the context words are dissimilar on difficulty.Since dissimilarity is generally not transitive,we adopt the immediate neighbor strategy to sample the negative context.Specifically,on the knowledge graph with only Dissim_edge,we collect the negative context from the immediate neighbors of the target node wt and get the negative context list Cn(wt). By replacing the text-based linear context with our graph-based difficulty context,we can train the word embedding using the classic word embedding models,such as C&W,CBOW,and Skip-Gram. Here we use the Skip-Gram model with Negative Sampling (SGNS)proposed by Mikolov et al.(2013). Specifically,given N positive target-context pairs (wt,we)and the negative context list of the target word Cn(wt),the objective of KEWEk is to minimize the loss function Ck,which is defined as follows: 4=∑ loga(uweVw)+EwECn()loga(-uw,Vw) (2) (t,e) where v and u are the"input"and"output"vector representation of w,and o is the sigmoid function defined as a()=.This loss function enables the positive context (e.g.,wc)to be distinguished from the negative context (e.g.,wi). 3.2 The Hybrid Word Embedding Model(KEWE) The classic text-based word embedding models yield word embedding focusing on syntactic and seman- tic contexts,while ignoring the word difficulty.By contrast,KEWEk trains the word embedding focusing on the word difficulty,while leaving out the syntactic and semantic information.Since readability may also relate to both syntax and semantics,we develop a hybrid word embedding model (KEWE),to incorporate both domain knowledge and text corpus.The loss function of the hybrid model Ch can be expressed as follows: Ch=λCk+(1-入)Ct (3) where Ck is the loss of predicting the knowledge graph-based difficulty contexts,Ct is the loss of pre- dicting the text-based syntactic and semantic contexts,and A [0,1]is a weighting factor.Clearly,the case of A =1 reduces the hybrid model to the knowledge-only model. As there are many text-based word embedding models,the text-based loss Ct can be defined in various ways.To be consistent with KEWEk,we formalize Ct based on the Skip-Gram model.Given a text corpus,the Skip-Gram model aims to find word representations that are good at predicting the context words.Specifically,given a sequence of training words,denoted as w1,w2,...,wr,the objective of Skip-Gram model is to minimize the log loss of predicting the context using target word embedding, which can be expressed as follows: T Ct=- 1 logp(wt+ilwt) (4) t=1-s≤j≤sj≠0 where s is the window size of the context sampling.Since the full softmax function used to define p(wwt)is computationally expensive,we employ the negative sampling strategy (Mikolov et al., 2013)and replace every log p(welwt)in Ct by the following formula: k logp(welwt)=loga(uwev)+>EwP.(w)loga(-uw,Vw) (5) =1 where v,u,and o are of the same meanings as in Eq.2,k is the number of negative samples,and P(w)is the noise distribution.This strategy enables the actual context we to be distinguished from the noise context wi drawn from the noise distribution Pn(w). 370
370 walk strategy to sample the positive context. Following the idea of node2vec (Grover and Leskovec, 2016), we sample the positive contexts of words by simulating a 2 nd order random walk on the knowledge graph with only Sim edge. After that, by applying a sliding window of fixed length s over the sampled random walk paths, we can get the positive target-context pairs {(wt , wc)}. From the type Dissim edge, we sample the negative target-context pairs where the target word and the context words are dissimilar on difficulty. Since dissimilarity is generally not transitive, we adopt the immediate neighbor strategy to sample the negative context. Specifically, on the knowledge graph with only Dissim edge, we collect the negative context from the immediate neighbors of the target node wt and get the negative context list Cn(wt). By replacing the text-based linear context with our graph-based difficulty context, we can train the word embedding using the classic word embedding models, such as C&W, CBOW, and Skip-Gram. Here we use the Skip-Gram model with Negative Sampling (SGNS) proposed by Mikolov et al. (2013). Specifically, given N positive target-context pairs (wt , wc) and the negative context list of the target word Cn(wt), the objective of KEWEk is to minimize the loss function Lk, which is defined as follows: Lk = − 1 N X (wt,wc) h log σ(u > wc vwt ) + Ewi∈Cn(wt) log σ(−u > wi vwt ) i (2) where vw and uw are the “input” and “output” vector representation of w, and σ is the sigmoid function defined as σ(x) = 1 (1+e−x) . This loss function enables the positive context (e.g., wc) to be distinguished from the negative context (e.g., wi). 3.2 The Hybrid Word Embedding Model (KEWEh) The classic text-based word embedding models yield word embedding focusing on syntactic and semantic contexts, while ignoring the word difficulty. By contrast, KEWEk trains the word embedding focusing on the word difficulty, while leaving out the syntactic and semantic information. Since readability may also relate to both syntax and semantics, we develop a hybrid word embedding model (KEWEh), to incorporate both domain knowledge and text corpus. The loss function of the hybrid model Lh can be expressed as follows: Lh = λLk + (1 − λ)Lt (3) where Lk is the loss of predicting the knowledge graph-based difficulty contexts, Lt is the loss of predicting the text-based syntactic and semantic contexts, and λ ∈ [0, 1] is a weighting factor. Clearly, the case of λ = 1 reduces the hybrid model to the knowledge-only model. As there are many text-based word embedding models, the text-based loss Lt can be defined in various ways. To be consistent with KEWEk, we formalize Lt based on the Skip-Gram model. Given a text corpus, the Skip-Gram model aims to find word representations that are good at predicting the context words. Specifically, given a sequence of training words, denoted as w1, w2, · · · , wT , the objective of Skip-Gram model is to minimize the log loss of predicting the context using target word embedding, which can be expressed as follows: Lt = − 1 T X T t=1 X −s≤j≤s,j6=0 log p(wt+j |wt) (4) where s is the window size of the context sampling. Since the full softmax function used to define p(wt+j |wt) is computationally expensive, we employ the negative sampling strategy (Mikolov et al., 2013) and replace every log p(wc|wt) in Lt by the following formula: log p(wc|wt) = log σ(u > wc vwt ) +X k i=1 Ewi∼Pn(w) log σ(−u > wi vwt ) (5) where vw, uw, and σ are of the same meanings as in Eq. 2, k is the number of negative samples, and Pn(w) is the noise distribution. This strategy enables the actual context wc to be distinguished from the noise context wi drawn from the noise distribution Pn(w)