GRAW+:A Two-View Graph Propagation Method With Word Coupling for Readability Assessment Zhiwei Jiang State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:jiangzhiwei@outlook.com Qing Gu State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:guq@nju.edu.cn Yafeng Yin State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:yafeng@nju.edu.cn Jianxiang Wang State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:wjxnju@outlook.com Daoxu Chen State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:cdx@nju.edu.cn Existing methods for readability assessment usually con- Introduction struct inductive classification models to assess the readabil- ity of singular text documents based on extracted features. Readability assessment evaluates the reading difficulties which have been demonstrated to be effective.However. of text documents,which are normally represented as dis- they rarely make use of the interrelationship among docu- crete reading levels.Automatic readability assessment is a ments on readability,which can help increase the accuracy of readability assessment.In this article,we adopt a graph- challenging task,which has attracted researchers'attention based classification method to model and utilize the relation- from the beginning of the last century (Collins-Thompson, ship among documents using the coupled bag-of-words 2014).Traditionally,it can be used by educationists to model.We propose a word coupling method to build the coupled bag-of-words model by estimating the correlation choose appropriate reading materials for students of differ- between words on reading difficulty.In addition,we propose ent education or grade levels.In modern times,it can be a two-view graph propagation method to make use of both used by web search engines to do personalized searches the coupled bag-of-words model and the linguistic features. based on web users'educational backgrounds. Our method employs a graph merging operation to combine graphs built according to different views,and improves the Existing methods for readability assessment usually con- label propagation by incorporating the ordinal relation centrate on feature engineering and then applying inductive among reading levels.Experiments were conducted on both classification models to utilize the features.In the early English and Chinese data sets,and the results demonstrate stages,researchers proposed readability formulas to mea- both effectiveness and potential of the method. sure the readability of texts (Zakaluk Samuels,1988). These formulas are usually attained by linear regression on several easy-to-compute text features relevant to reading difficulty.Recently,by employing machine-learning tech- Received March 26,2017:revised June 20,2018:accepted July 14,2018 niques,classification-based methods have been proposed and 2019 ASIS&T.Published online February 18.2019 in Wiley Online demonstrated to be more effective than readability formulas Library(wileyonlinelibrary.com).DOI:10.1002/asi.24123 (Benjamin,2012;Collins-Thompson,2014).These methods JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY,70(5):433-447,2019
GRAW+: A Two-View Graph Propagation Method With Word Coupling for Readability Assessment Zhiwei Jiang State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: jiangzhiwei@outlook.com Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: guq@nju.edu.cn Yafeng Yin State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: yafeng@nju.edu.cn Jianxiang Wang State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: wjxnju@outlook.com Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: cdx@nju.edu.cn Existing methods for readability assessment usually construct inductive classification models to assess the readability of singular text documents based on extracted features, which have been demonstrated to be effective. However, they rarely make use of the interrelationship among documents on readability, which can help increase the accuracy of readability assessment. In this article, we adopt a graphbased classification method to model and utilize the relationship among documents using the coupled bag-of-words model. We propose a word coupling method to build the coupled bag-of-words model by estimating the correlation between words on reading difficulty. In addition, we propose a two-view graph propagation method to make use of both the coupled bag-of-words model and the linguistic features. Our method employs a graph merging operation to combine graphs built according to different views, and improves the label propagation by incorporating the ordinal relation among reading levels. Experiments were conducted on both English and Chinese data sets, and the results demonstrate both effectiveness and potential of the method. Introduction Readability assessment evaluates the reading difficulties of text documents, which are normally represented as discrete reading levels. Automatic readability assessment is a challenging task, which has attracted researchers’ attention from the beginning of the last century (Collins-Thompson, 2014). Traditionally, it can be used by educationists to choose appropriate reading materials for students of different education or grade levels. In modern times, it can be used by web search engines to do personalized searches based on web users’ educational backgrounds. Existing methods for readability assessment usually concentrate on feature engineering and then applying inductive classification models to utilize the features. In the early stages, researchers proposed readability formulas to measure the readability of texts (Zakaluk & Samuels, 1988). These formulas are usually attained by linear regression on several easy-to-compute text features relevant to reading difficulty. Recently, by employing machine-learning techniques, classification-based methods have been proposed and demonstrated to be more effective than readability formulas (Benjamin, 2012; Collins-Thompson, 2014). These methods Received March 26, 2017; revised June 20, 2018; accepted July 14, 2018 © 2019 ASIS&T • Published online February 18, 2019 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.24123 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 70(5):433–447, 2019
combine the rich representation of texts with sophisticated To build the coupled Bow model,the key point is to prediction models. construct the word coupling matrix.For this purpose,we Most current methods assess the readability of text doc- first estimate the occurrence distributions of words in sen- uments singularly,and ignore the interrelationship among tences of different reading difficulties,and then compute documents on readability,which can be useful in assessing their similarities on reading difficulty based on the the readability of documents based on the labeled ones. distributions. For example,two documents can be of the same reading Besides the coupled Bow model,the linguistic features level,if they consist of words that have similar reading dif- can also be adopted by our method.On the one hand,we ficulty.Hence,we propose a graph propagation method for use the linguistic features as complementation of the readability assessment,which can model and utilize the coupled Bow model to construct graphs from multiple interrelationship among text documents. views.On the other,the linguistic features are used to rein- To measure the relationship among documents,we use the force the label propagation algorithm by providing the bag-of-words (Bow)model,which is commonly used for text prior knowledge. classification and clustering (Huang,2008;Sebastiani,2002). In this article,we propose a two-view graph propaga- However,to measure the relationship on readability,the basic tion method with word coupling for readability assess- BoW model requires improvements,since it ignores the fact ment.Our contributions are as follows (a preliminary that different words may have similar reading difficulties. version of this work appeared in Jiang,Sun,Gu,Bai, Figure 1 illustrates the improved use of the Bow model for Chen,2015).(i)We apply the graph-based method for readability assessment using a simple example.In Figure 1, readability assessment,which can make use of the interre- the left matrix is built from the basic Bow model for three lationship among documents to estimate their readability. documents(that is,D1,D2,and D3)consisting of four tokens (ii)We propose the coupled BoW model,which can be (that is,school,law,syllabus,and decree).Among the three used to measure the similarity of documents on reading documents,D1 and D2 are two relatively difficult documents difficulty.(iii)We propose a two-view graph building both containing two easy words(school or law)and two diffi- strategy to make use of both the coupled Bow model and cult words(syllabus or decree),while D3 is an easy document the linguistic features.(iv)We propose a reinforced label that contains two easy words (school).By calculating the propagation algorithm,which can make use of the ordinal cosine similarities based on the basic BoW model (the bottom relation among reading levels.Extensive experiments left subfigure).the result shows that D1 is more similar to D3 were carried out on data sets of both English and Chinese. than to D2,which is inconsistent with their similarities on Compared with the state-of-art methods,the results dem- reading difficulty. onstrate both effectiveness and the potential of our To overcome the shortcoming of the basic Bow model. method. we designed a word coupling method.As shown in Figure 1,the word coupling method first measures the sim- ilarities among words on reading difficulties(the word cou- Background and Related Work pling matrix).Then the method makes the words of similar Readability Assessment difficulties (for example,school and law)share their occur- rence frequencies with each other (by matrix multiplica- Research on automatic readability assessment has tion),which leads to the coupled Bow model (the coupled spanned the last 70 years (Benjamin,2012).Early research Bow matrix).In this way,the documents will be similar mainly focused on the designing of readability formulas on readability if their words have similar distributions on (Zakaluk Samuels,1988).Many well-known readability reading difficulties. formulas have been developed,such as the SMOG formula school 1 syllabus decree school law decree school 18可 syllsbus decree D1 0 2 school 0.5 0.5 0 0 DI 0.5 0. 0 0 D2 D2 0 0.5 0.5 03 0 0.5 0.5 D3 decree sim(D1,D2)=1 sim(D1,D2)=0 D1 D2 D1- —D2 sim(D1,D3=0.707 sim(D1,D3j=0.7071 simD2,D3)=0.707 sim(D2,D3)=0 D3 03 FIG.1.A motivation example of the word coupling method.The left matrix is a basic Bow matrix.The central matrix is a word coupling matrix.The right matrix is the coupled BoW matrix.[Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 D0l:10.1002/asi
combine the rich representation of texts with sophisticated prediction models. Most current methods assess the readability of text documents singularly, and ignore the interrelationship among documents on readability, which can be useful in assessing the readability of documents based on the labeled ones. For example, two documents can be of the same reading level, if they consist of words that have similar reading dif- ficulty. Hence, we propose a graph propagation method for readability assessment, which can model and utilize the interrelationship among text documents. To measure the relationship among documents, we use the bag-of-words (BoW) model, which is commonly used for text classification and clustering (Huang, 2008; Sebastiani, 2002). However, to measure the relationship on readability, the basic BoW model requires improvements, since it ignores the fact that different words may have similar reading difficulties. Figure 1 illustrates the improved use of the BoW model for readability assessment using a simple example. In Figure 1, the left matrix is built from the basic BoW model for three documents (that is, D1, D2, and D3) consisting of four tokens (that is, school, law, syllabus, and decree). Among the three documents, D1 and D2 are two relatively difficult documents both containing two easy words (school or law) and two diffi- cult words (syllabus or decree), while D3 is an easy document that contains two easy words (school). By calculating the cosine similarities based on the basic BoW model (the bottom left subfigure), the result shows that D1 is more similar to D3 than to D2, which is inconsistent with their similarities on reading difficulty. To overcome the shortcoming of the basic BoW model, we designed a word coupling method. As shown in Figure 1, the word coupling method first measures the similarities among words on reading difficulties (the word coupling matrix). Then the method makes the words of similar difficulties (for example, school and law) share their occurrence frequencies with each other (by matrix multiplication), which leads to the coupled BoW model (the coupled BoW matrix). In this way, the documents will be similar on readability if their words have similar distributions on reading difficulties. To build the coupled BoW model, the key point is to construct the word coupling matrix. For this purpose, we first estimate the occurrence distributions of words in sentences of different reading difficulties, and then compute their similarities on reading difficulty based on the distributions. Besides the coupled BoW model, the linguistic features can also be adopted by our method. On the one hand, we use the linguistic features as complementation of the coupled BoW model to construct graphs from multiple views. On the other, the linguistic features are used to reinforce the label propagation algorithm by providing the prior knowledge. In this article, we propose a two-view graph propagation method with word coupling for readability assessment. Our contributions are as follows (a preliminary version of this work appeared in Jiang, Sun, Gu, Bai, & Chen, 2015). (i) We apply the graph-based method for readability assessment, which can make use of the interrelationship among documents to estimate their readability. (ii) We propose the coupled BoW model, which can be used to measure the similarity of documents on reading difficulty. (iii) We propose a two-view graph building strategy to make use of both the coupled BoW model and the linguistic features. (iv) We propose a reinforced label propagation algorithm, which can make use of the ordinal relation among reading levels. Extensive experiments were carried out on data sets of both English and Chinese. Compared with the state-of-art methods, the results demonstrate both effectiveness and the potential of our method. Background and Related Work Readability Assessment Research on automatic readability assessment has spanned the last 70 years (Benjamin, 2012). Early research mainly focused on the designing of readability formulas (Zakaluk & Samuels, 1988). Many well-known readability formulas have been developed, such as the SMOG formula FIG. 1. A motivation example of the word coupling method. The left matrix is a basic BoW matrix. The central matrix is a word coupling matrix. The right matrix is the coupled BoW matrix. [Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
(McLaughlin,1969)and the FK formula (Kincaid,Fish- Graph-Based Label Propagation burne,Rogers,Chissom,1975).A key observation in these studies is that the vocabulary used in a document Graph-based label propagation is applied on a graph to usually determines its readability (Pitler Nenkova. propagate class labels from labeled nodes to unlabeled 2008).A general way of using the vocabulary is the statis- ones (Subramanya,Petrov,Pereira,2010).It has been tical language model (Collins-Thompson Callan,2004; successfully applied in various applications,such as dictio- Kidwell,Lebanon,Collins-Thompson,2009).More nary construction (Kim,Verma,Yeh,2013),word seg- recently,researchers have explored complex linguistic fea- mentation and tagging (Zeng,Wong,Chao,Trancoso, 2013),and sentiment classification (Ponomareva Thel- tures combined with classification models to obtain robust and effective readability prediction methods (Denning, wall,2012).Typically,a graph-based label propagation Pera,Ng,2016;Feng,Jansche,Huenerfauth,Elhadad, method consists of two main steps:graph construction and 2010;Schwarm Ostendorf,2005).While most studies label propagation.During the first step,some forms of are conducted for English,there are studies for other lan- edge weight estimation and edge pruning are required to guages,such as French(Francois Fairon,2012),German build an efficient graph (Jebara,Wang,Chang,2009; (Hancke,Vajjala,Meurers,2012),and Bangla (Sinha, Ponomareva Thelwall,2012).In addition,nodes in the Dasgupta,&Basu,2014).In addition,researchers have graph can be heterogeneous;for example,both instance used the representation learning techniques for readability nodes and class nodes can coexist in a birelational graph assessment (Cha,Gwon,Kung,2017;Tseng,Hung, (Jiang,2011),and nodes (words)of English can be linked Sung,Chen,2016). to nodes (words)of the target language (Chinese)in a bilingual graph (Gao,Wei,Li,Liu,Zhou,2015).During the second step,propagation algorithms are required to propagate the label distributions to all the nodes (Kim The Bag-of-Words Model et al.,2013;Subramanya et al.,2010)so that the classes of The Bow model has been widely used for document unlabeled nodes can be predicted. classification owing to its simplicity and general applica- bility.It constructs a feature space that contains all the distinct words of a language (or text corpus).Tradition- The Proposed Method ally,it assumes that words are independent,while In this section,we first present the overview of the pro- recently,capturing the word coupling relationship has posed method (GRAW+).Then we describe two main attracted much attention (Cao,2015).Billhardt,Borrajo, parts of the method in detail:feature representation and and Maojo(2002)studied the coupling relationship based readability classification on the co-occurrence of words in the same documents. Kalogeratos and Likas(2012)generalized the relationship An Overview of the Method by taking into account the distance among words in the sentences of each document.Cheng,Miao,Wang,and The framework of GRAW+is depicted in Figure 2. Cao (2013)estimated the co-occurrence of words by min- GRAW+takes an auxiliary text corpus and a target docu- ing the transitive properties.Inspired by these studies,this ment set as inputs.The auxiliary text corpus contains unla- article modifies the Bow model for readability assess- beled sentences used to construct the word coupling ment,and provides the coupled BoW model incorporating matrix.The target document set contains both labeled and the co-occurrence of words in different levels of reading unlabeled documents on readability.The objective of difficulty. GRAW+is to predict the reading levels of the unlabeled Datasets Feature Representation Classification The Coupled Bag-of-Words Model raph Construction Auxiliary Text Corpus Constructing word coupling matrix Graph Merging Constructing Generating coupled bag-of-words model bag-of-words model Label Propagation Target Document Set Extracting linguistic features Label for unlabeled documents FIG.2.The framework of GRAW+.[Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 435 D0:10.1002/asi
(McLaughlin, 1969) and the FK formula (Kincaid, Fishburne, Rogers, & Chissom, 1975). A key observation in these studies is that the vocabulary used in a document usually determines its readability (Pitler & Nenkova, 2008). A general way of using the vocabulary is the statistical language model (Collins-Thompson & Callan, 2004; Kidwell, Lebanon, & Collins-Thompson, 2009). More recently, researchers have explored complex linguistic features combined with classification models to obtain robust and effective readability prediction methods (Denning, Pera, & Ng, 2016; Feng, Jansche, Huenerfauth, & Elhadad, 2010; Schwarm & Ostendorf, 2005). While most studies are conducted for English, there are studies for other languages, such as French (François & Fairon, 2012), German (Hancke, Vajjala, & Meurers, 2012), and Bangla (Sinha, Dasgupta, & Basu, 2014). In addition, researchers have used the representation learning techniques for readability assessment (Cha, Gwon, & Kung, 2017; Tseng, Hung, Sung, & Chen, 2016). The Bag-of-Words Model The BoW model has been widely used for document classification owing to its simplicity and general applicability. It constructs a feature space that contains all the distinct words of a language (or text corpus). Traditionally, it assumes that words are independent, while recently, capturing the word coupling relationship has attracted much attention (Cao, 2015). Billhardt, Borrajo, and Maojo (2002) studied the coupling relationship based on the co-occurrence of words in the same documents. Kalogeratos and Likas (2012) generalized the relationship by taking into account the distance among words in the sentences of each document. Cheng, Miao, Wang, and Cao (2013) estimated the co-occurrence of words by mining the transitive properties. Inspired by these studies, this article modifies the BoW model for readability assessment, and provides the coupled BoW model incorporating the co-occurrence of words in different levels of reading difficulty. Graph-Based Label Propagation Graph-based label propagation is applied on a graph to propagate class labels from labeled nodes to unlabeled ones (Subramanya, Petrov, & Pereira, 2010). It has been successfully applied in various applications, such as dictionary construction (Kim, Verma, & Yeh, 2013), word segmentation and tagging (Zeng, Wong, Chao, & Trancoso, 2013), and sentiment classification (Ponomareva & Thelwall, 2012). Typically, a graph-based label propagation method consists of two main steps: graph construction and label propagation. During the first step, some forms of edge weight estimation and edge pruning are required to build an efficient graph (Jebara, Wang, & Chang, 2009; Ponomareva & Thelwall, 2012). In addition, nodes in the graph can be heterogeneous; for example, both instance nodes and class nodes can coexist in a birelational graph (Jiang, 2011), and nodes (words) of English can be linked to nodes (words) of the target language (Chinese) in a bilingual graph (Gao, Wei, Li, Liu, & Zhou, 2015). During the second step, propagation algorithms are required to propagate the label distributions to all the nodes (Kim et al., 2013; Subramanya et al., 2010) so that the classes of unlabeled nodes can be predicted. The Proposed Method In this section, we first present the overview of the proposed method (GRAW+). Then we describe two main parts of the method in detail: feature representation and readability classification. An Overview of the Method The framework of GRAW+ is depicted in Figure 2. GRAW+ takes an auxiliary text corpus and a target document set as inputs. The auxiliary text corpus contains unlabeled sentences used to construct the word coupling matrix. The target document set contains both labeled and unlabeled documents on readability. The objective of GRAW+ is to predict the reading levels of the unlabeled FIG. 2. The framework of GRAW+. [Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 435
documents in the target set based on the labeled ones. computing the distribution divergence.Since sentences GRAW+includes two main stages:feature representation with labeled difficulty levels are hard to acquire,we use and readability classification. unlabeled sentences instead,and label the sentences by During the first stage,the documents in the data set are estimating their difficulty levels with heuristic functions mapped into feature vectors from two distinct views:the Figure 3 presents the three steps required to construct cBow (coupled Bow)view and the linguistic view.From the word coupling matrix:per-sentence reading difficulty the cBow view,the coupled Bow model is required,and estimation,per-word difficulty distribution estimation,and from the linguistic view,suitable linguistic features can be word coupling matrix construction.In the first step,each borrowed from previous studies.By representing docu- sentence in the text corpus is assigned a weak label,which ments from these two views,both word-level difficulty dis- is a reading score computed in a heuristic function.In the tribution and document-level readability features can be second step,based on the weak labels,the difficulty distri- measured,and hence provide extensive information for bution of each word is estimated,according to their distri- readability assessment. butions of occurrences in these sentences.In the final step, During the second stage,a two-view graph propagation the similarities among words are calculated using the distri- method is proposed for readability classification,which con- bution divergence. sists of three main steps:graph construction,graph merging, and label propagation.In the first step,both labeled and unla- beled documents are used to build graphs,where each docu- Step 1:Per-sentence reading difficulty estimation. ment is represented by a node,and their similarities on The precise estimation of sentence-level readability is a reading difficulty are represented by edge weights.In the sec- hard problem and has recently attracted the attention of ond step,the intra-view merging operation is used to merge many researchers (Pilan,Volodina,Johansson,2014; the homogeneous graphs within the same view,and the inter- Schumacher,Eskenazi,Frishkoff,&Collins-Thompson, view merging operation is used to merge the heterogeneous 2016;Vajjala Meurers,2014).For efficiency,we use graphs from different views.In the third step,a label propa- heuristic functions to make a rough estimation.Specifically, gation algorithm is designed on the merged graph. we consider the linguistic features designed for readability assessment that have been demonstrated to be effective in The Coupled Bag-of-Words Model previous studies (Feng et al.,2010;Schumacher et al., 2016),and choose the most used linguistic features that can To build the cBow model,first we construct the word be operated at the sentence level to build the heuristic func- coupling matrix.Then we transform the basic Bow model tions.In total,eight heuristic functions h E(len,ans,anc, to the coupled BoW model using the word coupling matrix. l,art,ntr,pth,anp}corresponding to eight distinct fea- tures from three aspects are used to compute the reading Constructing the word coupling matrix.As stated before, score of a sentence,as shown in Table 1. the word coupling matrix is constructed to represent the Let S denote the set of all the sentences ready for con- similarities of word pairs on reading difficulty.For this structing the word coupling matrix.Given a sentence purpose,we assume the simple fact that easy words tend to s E S,its reading difficulty can be quantified as a reading appear in easy sentences,while difficult words tend to score r(s)=h(s)by using one of the eight functions.The appear in difficult sentences.Hence,we can estimate the more difficult s is,the greater r(s)will be. reading difficulty of a word by its distributions of occur- Considering that the reading score r(s)may be continu- rence probabilities in sentences from different difficulty ous,we discretize r(s)into several difficulty levels.Let n levels.The difficulty distributions can then be used to esti- denote the predetermined number of difficulty levels, mate the similarities among words on reading difficulty by andrespectively,denote the maximum and minimum Sentence set Sentences Levels Words Distributions Extracting Estimating the Estimating the Constructing the readability of difficulty distributions word coupling sentences sentences of words matrix Documents Documents Extracting Constructing Generating documents general BoW model coupled BoW model FIG.3.The process of representing documents using the coupled bag-of-words model.[Color figure can be viewed at wileyonlinelibrary.com] 436 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 Dol:10.1002/asi
documents in the target set based on the labeled ones. GRAW+ includes two main stages: feature representation and readability classification. During the first stage, the documents in the data set are mapped into feature vectors from two distinct views: the cBoW (coupled BoW) view and the linguistic view. From the cBoW view, the coupled BoW model is required, and from the linguistic view, suitable linguistic features can be borrowed from previous studies. By representing documents from these two views, both word-level difficulty distribution and document-level readability features can be measured, and hence provide extensive information for readability assessment. During the second stage, a two-view graph propagation method is proposed for readability classification, which consists of three main steps: graph construction, graph merging, and label propagation. In the first step, both labeled and unlabeled documents are used to build graphs, where each document is represented by a node, and their similarities on reading difficulty are represented by edge weights. In the second step, the intra-view merging operation is used to merge the homogeneous graphs within the same view, and the interview merging operation is used to merge the heterogeneous graphs from different views. In the third step, a label propagation algorithm is designed on the merged graph. The Coupled Bag-of-Words Model To build the cBoW model, first we construct the word coupling matrix. Then we transform the basic BoW model to the coupled BoW model using the word coupling matrix. Constructing the word coupling matrix. As stated before, the word coupling matrix is constructed to represent the similarities of word pairs on reading difficulty. For this purpose, we assume the simple fact that easy words tend to appear in easy sentences, while difficult words tend to appear in difficult sentences. Hence, we can estimate the reading difficulty of a word by its distributions of occurrence probabilities in sentences from different difficulty levels. The difficulty distributions can then be used to estimate the similarities among words on reading difficulty by computing the distribution divergence. Since sentences with labeled difficulty levels are hard to acquire, we use unlabeled sentences instead, and label the sentences by estimating their difficulty levels with heuristic functions. Figure 3 presents the three steps required to construct the word coupling matrix: per-sentence reading difficulty estimation, per-word difficulty distribution estimation, and word coupling matrix construction. In the first step, each sentence in the text corpus is assigned a weak label, which is a reading score computed in a heuristic function. In the second step, based on the weak labels, the difficulty distribution of each word is estimated, according to their distributions of occurrences in these sentences. In the final step, the similarities among words are calculated using the distribution divergence. Step 1: Per-sentence reading difficulty estimation. The precise estimation of sentence-level readability is a hard problem and has recently attracted the attention of many researchers (Pilán, Volodina, & Johansson, 2014; Schumacher, Eskenazi, Frishkoff, & Collins-Thompson, 2016; Vajjala & Meurers, 2014). For efficiency, we use heuristic functions to make a rough estimation. Specifically, we consider the linguistic features designed for readability assessment that have been demonstrated to be effective in previous studies (Feng et al., 2010; Schumacher et al., 2016), and choose the most used linguistic features that can be operated at the sentence level to build the heuristic functions. In total, eight heuristic functions h 2 {len, ans, anc, lv, art, ntr, pth, anp} corresponding to eight distinct features from three aspects are used to compute the reading score of a sentence, as shown in Table 1. Let S denote the set of all the sentences ready for constructing the word coupling matrix. Given a sentence s 2 S, its reading difficulty can be quantified as a reading score r(s) = h(s) by using one of the eight functions. The more difficult s is, the greater r(s) will be. Considering that the reading score r(s) may be continuous, we discretize r(s) into several difficulty levels. Let η denote the predetermined number of difficulty levels, rh max and rh min, respectively, denote the maximum and minimum FIG. 3. The process of representing documents using the coupled bag-of-words model. [Color figure can be viewed at wileyonlinelibrary.com] 436 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
TABLE 1.Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese)per word (or character for Chinese)in s. anc(s) the average number of characters per word in s. Lexical Iv(s) the number of distinct types of POS,that is,part of speech,in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun,verb,and preposition)phrases in s. reading score of all the sentences in S,where h refers to version of Kullback-Leibler divergence (Kullback Leibler, one of the eight functions.To determine the difficulty level 1951)to measure their distribution difference,which aver- (s)((s)[1.)of a sentence s,the range[]is ages the values of the divergence computed from both direc- evenly divided into n intervals.I"(s)will be i,if the reading tions.The equation is: score r(s)resides in the i-th interval.For each of the three aspects,we compute one I'(s)for a sentence s by combin- 1 ing the heuristic functions using the following equations. cKL()=(KL(PnlPe)+KL(Pnp)) (3) Bsur(s)=max(Hen(s),lans(s),lanc(s)) where KL(p)=∑,p(log8 and iis the element index.After that,the logistic function is applied to get the pex(s)=max(I(s).Ia(s).I"M(s)) (1) normalized distribution similarity,that is: sim(,2)=1+ea阿 (4) Isyn(s)=max(Ipth(s).Iamp(s)) Given a word ti,only A other words with highest corre- lation (similarity)are selected to build the neighbor set of Step 2:Per-word difficulty distribution estimation. ti,denoted as N(ti).If a word t;is not selected (that is, The difficulty distribution of each word is computed based (t)),the corresponding sim(ti.1)will be assigned on the sentence-level reading difficulty.Since each sentence 0.After that,the word coupling matrix (C")with sim(ti,), contains many words and each word may appear in many as its elements are normalized along the rows so that the sentences,we estimate the difficulty distributions of words sum of each row is 1.Based on three different I'(s),we according to their distributions of occurrences in sentences. construct three distinct word coupling matrices Cr Let y denote the set of all the words appearing in S,P and csyn denotes the difficulty distribution of a word(term)tV.P, While a large volume of vocabulary will make the con- is a vector containing n(that is,the number of difficulty struction of the word coupling matrix time-consuming,we levels)values,the i-th part of which can be calculated by provide a strategy to filter out less informative words based Equation 2. on their distributions on reading difficulty.The filtering mea- sure is the entropy of the words,which can be calculated by n0=L∑8t∈-aI附=) Equation 5.By sorting the words ascendingly according to (2) entropy,the last a E[0,1]proportion will be filtered out. Ent(t)=H(p:)=->p,(i)logp,(i) (5) where n,refers to the number of sentences containing t. i=1 The indicator function (x)retums 1 if x is true and 0 otherwise. Generating the Coupled Bag-Of-Words Model.In the Step 3:Word coupling matrix construction. basic Bow model,words are treated as being independent Given the set of words V,a word coupling matrix is of each other,and the corresponding BoW matrix is sparse defined as CERMVIxMl,the element of which reflects the and ignores the similarity among words on reading diffi- correlation between two words (that is,terms).The correla- culty.For readability assessment,the coupled BoW model tion between each pair of words can be computed according can be implemented by multiplying the word coupling to the similarity measure of their difficulty distributions. matrix and the basic Bow matrix,and the resulting Given two words(terms)t and t2,whose difficulty dis-coupled Bow matrix will be dense and focus on similari- tributions are p and p,respectively,we use a symmetric ties on reading difficulty. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 437 D0:10.1002/asi
reading score of all the sentences in S, where h refers to one of the eight functions. To determine the difficulty level l h (s) (l h (s) 2 [1, η]) of a sentence s, the range rh min,rh max is evenly divided into η intervals. l h (s) will be i, if the reading score r(s) resides in the i-th interval. For each of the three aspects, we compute one l * (s) for a sentence s by combining the heuristic functions using the following equations. l surð Þ¼ s max llenð Þs ,l ansð Þs ,l ancð Þs l lexð Þs ¼ max llvð Þs ,l atrð Þs ,l ntrð Þs ð1Þ l synð Þ¼ s max lpthð Þs ,l anpð Þs Step 2: Per-word difficulty distribution estimation. The difficulty distribution of each word is computed based on the sentence-level reading difficulty. Since each sentence contains many words and each word may appear in many sentences, we estimate the difficulty distributions of words according to their distributions of occurrences in sentences. Let V denote the set of all the words appearing in S, pt denotes the difficulty distribution of a word (term) t 2 V. pt is a vector containing η (that is, the number of difficulty levels) values, the i-th part of which can be calculated by Equation 2. ptð Þ¼i 1 nt X s2S δð Þ t 2 s δðÞ ð l sð Þ¼ i 2Þ where nt refers to the number of sentences containing t. The indicator function δ(x) returns 1 if x is true and 0 otherwise. Step 3: Word coupling matrix construction. Given the set of words V, a word coupling matrix is defined as C 2 Rj j V × j j V , the element of which reflects the correlation between two words (that is, terms). The correlation between each pair of words can be computed according to the similarity measure of their difficulty distributions. Given two words (terms) t1 and t2, whose difficulty distributions are pt1 and pt2 , respectively, we use a symmetric version of Kullback–Leibler divergence (Kullback & Leibler, 1951) to measure their distribution difference, which averages the values of the divergence computed from both directions. The equation is: cKLð Þ¼ t1,t2 1 2 KL pt1 kpt2 ð Þ + KL pt2 kpt1 ð Þð ð Þ 3Þ where KLðp j jqÞ ¼P i p ið Þlogp ið Þ q ið Þ and i is the element index. After that, the logistic function is applied to get the normalized distribution similarity, that is: sim tð Þ¼ 1,t2 2 1 + ecKLð Þ t1,t2 ð4Þ Given a word ti, only λ other words with highest correlation (similarity) are selected to build the neighbor set of ti, denoted as N tð Þi . If a word tj is not selected (that is, tj2N= tð Þi ), the corresponding sim(ti, tj) will be assigned 0. After that, the word coupling matrix (C* ) with sim(ti, tj), as its elements are normalized along the rows so that the sum of each row is 1. Based on three different l * (s), we construct three distinct word coupling matrices Csur, Clex, and Csyn. While a large volume of vocabulary will make the construction of the word coupling matrix time-consuming, we provide a strategy to filter out less informative words based on their distributions on reading difficulty. The filtering measure is the entropy of the words, which can be calculated by Equation 5. By sorting the words ascendingly according to entropy, the last α 2 [0, 1] proportion will be filtered out. Ent tð Þ¼ H pð Þ¼t − X η i¼1 ptð Þi logptðÞ ð i 5Þ Generating the Coupled Bag-Of-Words Model. In the basic BoW model, words are treated as being independent of each other, and the corresponding BoW matrix is sparse and ignores the similarity among words on reading diffi- culty. For readability assessment, the coupled BoW model can be implemented by multiplying the word coupling matrix and the basic BoW matrix, and the resulting coupled BoW matrix will be dense and focus on similarities on reading difficulty. TABLE 1. Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese) per word (or character for Chinese) in s. anc(s) the average number of characters per word in s. Lexical lv(s) the number of distinct types of POS, that is, part of speech, in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun, verb, and preposition) phrases in s. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 437