Bayesian Disambiguation Algorithm Disambiguation for all senses s, of w do for all senses Sk of w do for all y; in vocabulary do score (Sy=log p(se P(y|)=C()/C() for all v: in context window cdo end core end log P(i lsg for all senses Sk of w do end P(g=C(/C(u) end end choose argmax k score ( y Gale, Church, and Yarowsky obtain 90% correct disambiguation on 6 ambiguous nouns in Hansard corpus using this approach (e. g, drug as a medication Vs illicit substance 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 11 Bayesian Disambiguation Algorithm Training: for all senses sk of w do for all vj in vocabulary do P(vj|sk ) = C(vj ,sk )/C(sk ) end end for all senses sk of w do P(sk ) = C(sk )/C(w) end Disambiguation: for all senses sk of w do score(sk ) = log P(sk ) for all vj in context window c do score(sk ) = score(sk ) + log P(vj|sk ) end end choose argmaxsk score (sk ) Gale, Church, and Yarowsky obtain 90% correct disambiguation on 6 ambiguous nouns in Hansard corpus using this approach (e.g., drug as a medication vs. illicit substance
Supervised Disambiguation An Information-Theoretic Approach Brown et al., 1991)attempt to find a single contextual feature that reliably indicates which sense of an ambiguous word is being used a For example, the French verb prendre has two different readings that are affected by the word earing in object position( mesur→onk, ecision→0mhe2), but the verb vouloir s reading is affected by tense ( present→owmh, conditional→mbke) To make good use of an informant its values need to be categorized as to which sense they indicate(e. g mr→make, ecision→mnke); Brown et a.use the flip-Flop als gorithm to do this 20212/5 Natural Language Processing--Word Sense Disambiguation 12
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 12 Supervised Disambiguation: An Information-Theoretic Approach ◼ (Brown et al., 1991) attempt to find a single contextual feature that reliably indicates which sense of an ambiguous word is being used. ◼ For example, the French verb prendre has two different readings that are affected by the word appearing in object position (mesure → to take, décision → to make), but the verb vouloir’s reading is affected by tense (present → to want, conditional → to like). ◼ To make good use of an informant, its values need to be categorized as to which sense they indicate (e.g., mesure → to take, décision → to make); Brown et al. use the Flip-Flop algorithm to do this
Supervised Disambiguation An Information- Theoretic approach Let t,, .. tm be translations for an ambigu uous word an d be possible values of the indicator The Flip-Flop algorithm is used to disambiguate between the different senses of a word using mutual intormation: I(x,)=2∈xy∈Yp(x,y)logp(xy)/(p()p) See brown et al. for an extension to more than two senses The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 13 Supervised Disambiguation: An Information-Theoretic Approach ◼ Let t1 ,…, tm be translations for an ambiguous word and x1 ,…, xn be possible values of the indicator. ◼ The Flip-Flop algorithm is used to disambiguate between the different senses of a word using mutual information: ◼ I(X;Y)=Σx∈X Σ y ∈ Y p(x,y) log p(x,y)/(p(x)p(y)) ◼ See Brown et al. for an extension to more than two senses. ◼ The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant
Mutual Information I(X; Y=H(-HXp=H-H(r/X), the mutual information between X and Y is the reduction in uncertainty of one random variable due to knowing about another or. in other words the amount of information one random variable contains about another H(X,Y H(XY H(YX H() H( 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 14 Mutual Information ◼ I(X; Y)=H(X)-H(X|Y)=H(Y)-H(Y|X), the mutual information between X and Y, is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another
Mutual Information(cont) I(X; Y=H(X-H(XY=HY-HYX I(X; Y) is symmetric, non-negative measure of the common information of two variables Some see it as a measure of dependence between two variable but better to think of it as a measure of independence I(X; Y) is 0 only when X and Y are independent: H(XY=H(X) For two dependent variables, I grows not only according to the degree of dependence but also according to the entropy of the two variables H④=H④H(X|Ⅹ=I(x;Ⅹ→ Why entropy is called self- information 20212/5 Natural Language Processing--Word Sense Disambiguation
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 15 Mutual Information (cont) I(X; Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) ◼ I(X; Y) is symmetric, non-negative measure of the common information of two variables. ◼ Some see it as a measure of dependence between two variables, but better to think of it as a measure of independence. ◼ I(X; Y) is 0 only when X and Y are independent: H(X|Y)=H(X) ◼ For two dependent variables, I grows not only according to the degree of dependence but also according to the entropy of the two variables. ◼ H(X)=H(X)-H(X|X)=I(X; X) ➔ Why entropy is called selfinformation