Classification Naive Bayes The naive bayes classifier runnyhose sinus cough fever muscle-ache Conditional Independence Assumption features detect term presence and are independent of each other given the class P(X1,X5C)=P(x1C)P(X2|C·…·P(X5|C)
Classification 11 Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier ▪ Conditional Independence Assumption: features detect term presence and are independent of each other given the class: ( , , | ) ( | ) ( | ) ( | ) P X1 X5 C = P X1 C • P X2 C •• P X5 C Naïve Bayes
Classification Naive Bayes Learning the model First attempt: maximum likelihood estimates simply use the frequencies in the data N(C=C, P(c1) N(X; =X, C=c,) P(xIci) NC=C
Classification 12 Learning the Model ▪ First attempt: maximum likelihood estimates ▪ simply use the frequencies in the data ( ) ( , ) ( | ) ˆ j i i j i j N C c N X x C c P x c = = = = C X1 X2 X3 X4 X5 X6 N N C c P c j j ( ) ( ) ˆ = = Naïve Bayes
Classification Naive Bayes Problem with maximum likelihood runnynose sInus cough fever muscle-ache P(x12…,xX5C)=P(X1C)·P(X2C)…·P(X5C) What if we have seen no training documents with the word muscle ache and classified in the topic Flu? P(Xs=tC=nf) N(Xs=t, C=nf NC=nf Zero probabilities cannot be conditioned away, no matter the other evidence C= arg max。P( DILP(x,lc)
Classification 13 Problem with Maximum Likelihood ▪ What if we have seen no training documents with the word muscleache and classified in the topic Flu? ▪ Zero probabilities cannot be conditioned away, no matter the other evidence! 0 ( ) ( , ) ( | ) ˆ 5 5 = = = = = = = N C nf N X t C nf P X t C nf= i c i P c P(x | c) ˆ ( ) max ˆ arg Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache ( , , | ) ( | ) ( | ) ( | ) P X1 X5 C = P X1 C • P X2 C •• P X5 C Naïve Bayes
Classification Naive Bayes Smoothing to Avoid Overfitting Laplace smoothing P(X c) N(X1=x1,C=C1)+1 N(C=C)+k of values of x
Classification Smoothing to Avoid Overfitting N C c k N X x C c P x c j i i j i j = + = = + = ( ) ( , ) 1 ( | ) ˆ # of values of Xi Naïve Bayes Laplace smoothing:
Classification Naive Bayes Naive Bayes: Learning Running example: document classification From training corpus, extract Vocabulary Calculate required P(ci) and P( ci)terms For each c in c do docs, subset of documents for which the target class is c d OcS total documents Text, single document containing all docs a for each word x, in Vocabulary na, t number of occurrences of x in text nt number of words in text n,+1 P(xk|c1)← n, Vocabulary
Classification 15 ◼ Textj single document containing all docsj ◼ for each word xk in Vocabulary ◼ njk number of occurrences of xk in Textj ◼ nj number of words in Textj ◼ Naive Bayes: Learning Running example: document classification ▪ From training corpus, extract Vocabulary ▪ Calculate required P(cj ) and P(xk | cj ) terms ▪ For each cj in C do ▪ docsj subset of documents for which the target class is cj ▪ | | 1 ( | ) n Vocabulary n P x c j j k k j + + | total # documents | | | ( ) j j docs P c Naïve Bayes