Supervised disambiguation Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label This becomes a statistical classification problem; assign w some sense sk in context cl Approaches Bavesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates Information from many words in a context window. nformation Theory: only looks at the most informative feature in the context, which may be sensitive to text structure. T here are many more approaches (see Chapter 16 or a text on Machine Learning ml) that could be applied 20212/5 Natural Language Processing--Word Sense Disambiguation 6
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 6 Supervised Disambiguation ◼ Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label. This becomes a statistical classification problem; assign w some sense sk in context cl. ◼ Approaches: ◼ Bayesian Classification: the context of occurrence is treated as a bag of words without structure, but it integrates information from many words in a context window. ◼ Information Theory: only looks at the most informative feature in the context, which may be sensitive to text structure. ◼ There are many more approaches (see Chapter 16 or a text on Machine Learning (ML)) that could be applied
Supervised Disambiguation Bayesian classification (Gale et al, 1992): look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection; it simply combines the evidence from all features, assuming they are independent Bayes decision rule: Decide s if P(S1>PGld for Sk t Optimal because it minimizes the probability of error; for each individual case it selects the class with the highest conditional probability(and hence owest error rate) a Error rate for a sequence will also be minimized 20212/5 Natural Language Processing--Word Sense Disambiguation 7
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 7 Supervised Disambiguation: Bayesian Classification ◼ (Gale et al, 1992): look at the words around an ambiguous word in a large context window. Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it. The classifier does no feature selection; it simply combines the evidence from all features, assuming they are independent. ◼ Bayes decision rule: Decide s ’ if P(s ’|c) > P(sk|c) for sk ≠s ’ ◼ Optimal because it minimizes the probability of error; for each individual case it selects the class with the highest conditional probability (and hence lowest error rate). ◼ Error rate for a sequence will also be minimized
Supervised Disambiguation Bayesian classification We do not usually know p(sko), but we can use Baye Rule to compute it: P(k|)=P(|s/P()×P() P(k is the prior probability of Se,1.e, the probability of instance s, without any contextual information When updating the prior with evidence from context (i.e P(CSe/P(), we obtain the posterior probability P(e S If all we want to do is select the correct class, we can ignore P(. Also use logs to simplify computation Assign word w sense s'=argmax kp(sd argmaxeP(c5kX P(k= argmax klog P(c sk+ log P 20212/5 Natural Language Processing--Word Sense Disambiguation 8
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 8 Supervised Disambiguation: Bayesian Classification ◼ We do not usually know P(sk|c), but we can use Bayes’ Rule to compute it: ◼ P(sk|c) = (P(c|sk )/P(c)) × P(sk ) ◼ P(sk ) is the prior probability of sk , i.e., the probability of instance sk without any contextual information. ◼ When updating the prior with evidence from context (i.e., P(c|sk )/P(c)), we obtain the posterior probability P(sk|c). ◼ If all we want to do is select the correct class, we can ignore P(c). Also use logs to simplify computation. ◼ Assign word w sense s ’ = argmaxskP(sk|c) =argmaxskP(c|sk ) × P(sk ) = argmaxsk[log P(c| sk ) + log P(sk )]
Bayesian Classification: Nalve bayes Naive bayes is widely used in ml due to its ability to efficiently combine evidence from a wide variety of features. can be applied if the state of the world we base our classification on can be described as a series of attributes in this case, we describe the context of w in terms of the words y that occur in the context Naive bayes assumption: The attributes used for classification are conditionally independent: P(c 5y P(Zl y; in c) Ise=Ilyin P( I sg ■ Two consequences: The structure and linear ordering of words d bag of words model The presence of one word is independent of another, which is clearly untrue in text 20212/5 Natural Language Processing--Word Sense Disambiguation 9
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 9 Bayesian Classification: Naïve Bayes ◼ Naïve Bayes: ◼ is widely used in ML due to its ability to efficiently combine evidence from a wide variety of features. ◼ can be applied if the state of the world we base our classification on can be described as a series of attributes. ◼ in this case, we describe the context of w in terms of the words vj that occur in the context. ◼ Naïve Bayes assumption: ◼ The attributes used for classification are conditionally independent: P(c|sk ) = P({vj| vj in c}|sk ) = П vj in c P(vj | sk ) ◼ Two consequences: ◼ The structure and linear ordering of words is ignored: bag of words model. ◼ The presence of one word is independent of another, which is clearly untrue in text
Bayesian Classification: Naive bayes Although the naive bayes assumption is incorrect in the context of text processing, it often does quite well, partly because the decisions made can be optimal even in the face of the inaccurate assumption a Decision rule for Naive bayes: Decide sif s=argmax k[log P(5k+smiinc log p(oi lsel P(Oils and p(e are computed via Maximum-Likelihoo Estimation, perhaps with appropriate smoothing, from a labeled training corpus P(|)=C(y)C( P()=C()/C(n) 20212/5 Natural Language Processing--Word Sense Disambiguation 10
2021/2/5 Natural Language Processing -- Word Sense Disambiguation 10 Bayesian Classification: Naïve Bayes ◼ Although the Naïve Bayes assumption is incorrect in the context of text processing, it often does quite well, partly because the decisions made can be optimal even in the face of the inaccurate assumption. ◼ Decision rule for Naïve Bayes: Decide s ’ if s ’ =argmaxsk[log P(sk )+Σvj in c log P(vj|sk )] ◼ P(vj|sk ) and P(sk ) are computed via Maximum-Likelihood Estimation, perhaps with appropriate smoothing, from a labeled training corpus. ◼ P(vj|sk ) = C(vj ,sk )/C(sk ) ◼ P(sk ) = C(sk )/C(w)