2.1.LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 11 we see, 2 xTy-1x-xTΣ-12-喝Σ-1x+5Σ-2 -x-x+x-141+Σ1x-Σ-1w1= 2μ-1x-2μ喝-1x+吲Σ-12--141 (2.25) (ma(+). (2.26) Consequently,after taking the natural log of(2.18),the desired allocation rule follows. 0 However,theorem (2.1.2)supposes that the values of u,u2,are known,which is almost never the case.Thus,we need to use estimates from our data.Suppose then we have a set of training data,x,for the ith(i=1 or 2)population and the jth data value (j from 1 to n or n2).Then the sample mean vector is calculated by averaging the data vectors from each class, i.e.:= Moreover,wew calculte the covariance matrics for each population by ni S= 1 n3-1 >(xj-x)x-元,)T.(Note that is a(m×1)vector and S:isa(m×m)matrix). j=1 Since we assume that the two covariance matrices are equal,we will pool the two sample covariance matrices to derive an unbiased estimate of∑: Spooled= n1-1 n2-1 (n1-1)+(n2-1) S1+ (1-1)+(2-1 S2 (n1-1)S1+(n2-1)S2 n1+n2-2 Substituting our estimates into our allocation rule yields: Theorem 2.1.3.4]Allocate x to m if (国-Sar-国-)YSaa(国+)>n c(12)p2 c(21)1 (2.27) P1 Note that once we use parameter estimates rather than the known quantities,then the derived allocation rule is an estimate of the optimal allocation rule.However,if sample sizes are large,then it is reasonable to expect the estimate to be close enough to the optimal rule. Lastly,let us consider the case in which the cost ratios and prior probability ratios are equal
2.1. LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 11 we see, 1 2 " x T Σ −1x − x T Σ −1µ2 − µ T 2 Σ −1x + µ T 2 Σ −1µ2 −x T Σ −1x + x T Σ −1µ1 + µ T 1 Σ −1x − µ T 1 Σ −1µ1 # = 1 2 " 2µ T 1 Σ −1x − 2µ T 2 Σ −1x + µ T 2 Σ −1µ2 − µ T 1 Σ −1µ1 # = (2.25) (µ1 − µ2) T Σ −1x − 1 2 (µ1 − µ2) T Σ −1 (µ1 + µ2). (2.26) Consequently, after taking the natural log of (2.18), the desired allocation rule follows. However, theorem (2.1.2) supposes that the values of µ1, µ2, Σ are known, which is almost never the case. Thus, we need to use estimates from our data. Suppose then we have a set of training data, {xij}, for the i th (i = 1 or 2) population and the j th data value (j from 1 to n1 or n2). Then the sample mean vector is calculated by averaging the data vectors from each class, i.e. x¯i = Pni j=1 xij ni . Moreover, we will calculate the covariance matrices for each population by Si = 1 ni − 1 Xni j=1 (xij − x¯i)(xij − x¯i) T . (Note that x¯i is a (m×1) vector and Si is a (m×m) matrix.) Since we assume that the two covariance matrices are equal, we will pool the two sample covariance matrices to derive an unbiased estimate of Σ: Spooled = " n1 − 1 (n1 − 1) + (n2 − 1)# S1 + " n2 − 1 (n1 − 1) + (n2 − 1)# S2, = (n1 − 1)S1 + (n2 − 1)S2 n1 + n2 − 2 . Substituting our estimates into our allocation rule yields: Theorem 2.1.3. [4] Allocate x to π1 if (x¯1 − x¯2) T S −1 pooledx − 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2) > ln c(1|2) c(2|1) × p2 p1 ! . (2.27) Note that once we use parameter estimates rather than the known quantities, then the derived allocation rule is an estimate of the optimal allocation rule. However, if sample sizes are large, then it is reasonable to expect the estimate to be close enough to the optimal rule. Lastly, let us consider the case in which the cost ratios and prior probability ratios are equal
12 CHAPTER 2.BASIC DISCRIMINANTS i.e.. c12×2=1 c(21)p1 changing the equation in Theorem (2.1.3)to 保-PSax-民-PSa民+刘>m= (2.28) (民1-PS品x-民-PS品u民+)>0 (2.29) ÷民1-刘Fsx>-刘Ps国+刘 (2.30) If we denote as a row vector,say 1,then we can write (230)as ix>+刘 (2.31) In essence,we can think of the decision rule as reducing the number of dimensions of our data to one by projecting it on our vector 1,and allocating a new point x by where it lies relative to the midpoint of the two classes,1). 2.2 Fisher's Discriminant Analysis -2 2 -2 2 6 Figure 4.6 The left plot shows samples from two classes (depicted in red and blue)along with the histograms resulting from projection onto the line joining the class means.Note that there is considerable class overlap in the projected space.The right plot shows the corresponding projection based on the Fisher linear discriminant. showing the greatly improved class separation. Figure 2.1:Fisher's Discriminant Analysis:Striking a balance between class separation and in-class variance,[2]. R.A.Fisher proposed another method for linear discriminant analysis that did not presuppose any known distribution of the training data.He does,however,begin with the idea of dimensionality reduction shown above.Again,suppose that we have our training data set {xi}as defined above, where populations T1 and m2 don't necessarily have the same distribution but have equal covariance matrices E1=2=E.Then,if given a vector w with unit length,we can project our set in the direction of w by taking the dot product of the two vectors [w.xij}. Onto what kind of vector w do we want to project our data?To make a good classifier,the
12 CHAPTER 2. BASIC DISCRIMINANTS i.e., c(1|2) c(2|1) × p2 p1 = 1! , changing the equation in Theorem (2.1.3) to (x¯1 − x¯2) T S −1 pooledx − 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2) > ln[1] = (2.28) (x¯1 − x¯2) T S −1 pooledx − 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2) > 0 (2.29) ⇒ (x¯1 − x¯2) T S −1 pooledx > 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2) (2.30) If we denote (x¯1 − x¯2) T S −1 pooled as a row vector, say ˆl, then we can write (2.30) as ˆlx > 1 2 ˆl(x¯1 + x¯2). (2.31) In essence, we can think of the decision rule as reducing the number of dimensions of our data to one by projecting it on our vector ˆl, and allocating a new point x by where it lies relative to the midpoint of the two classes, 1 2 ˆl(x¯1 + x¯2). 2.2 Fisher’s Discriminant Analysis Figure 2.1: Fisher’s Discriminant Analysis: Striking a balance between class separation and in-class variance, [2]. R.A. Fisher proposed another method for linear discriminant analysis that did not presuppose any known distribution of the training data. He does, however, begin with the idea of dimensionality reduction shown above. Again, suppose that we have our training data set {xij} as defined above, where populations π1 and π2 don’t necessarily have the same distribution but have equal covariance matrices Σ1 = Σ2 = Σ. Then, if given a vector w with unit length, we can project our set in the direction of w by taking the dot product of the two vectors : {w · xij}. Onto what kind of vector w do we want to project our data? To make a good classifier, the
2.2.FISHER'S DISCRIMINANT ANALYSIS 13 two projected class means ought to be far apart,i.e.we want to make wT(2-)large.(To find a bounded solution,we specify that we want w=1.)That way the differences between the two classes are maximally highlighted and evaluating the projection of a new point against the projected sample means can accurately classify the largest number of points.The goal of being able to classify the largest number of unknown points accurately based on the training data is called the generalization ability and motivates Support Vector Machines,the next chapter. But separating the projected class means is only one concern,since if the classes have a large variance on the projected vector,then the class overlap can potentially make classification inac- curate.Refer to the figure on the opposite page.The picture on the left projects the data onto the vector connecting the two sample means,assuring that the distance between the two projected means is maximized.However,since the two projected classes are not partitioned on the projection vector,the problem of classifying new points is no easier.However,in the picture on the right,the chosen projection vector separates the projected sample means and partitions the two classes on the projection vector.Classification of a new point xo then is based on comparing the projection of xo to the midpoint between the two projected sample means.To tend toward a vector that partitions our classes,we want to minimize the within-class variance,i.e.we want to minimize ∑1(x15-1)x1-1)T+∑21(x2-2)(x2-2T wTSpooledw= n1+n2-2 (Recall that wTSw is the weighted sum of the variances of the two projected populations.) We can optimize the difference between projected means and the in-class variance in one equation so that we can find our desired projection vector: Maximize J(w)=(wT())2 such thatw =1. (2.32) wT Spooledw We then prove the maximization lemma to show that Fisher's Discriminant Analysis gives w a ()TSpooled. Lemma 2.2.1.(Maximization Lemma,(4):For B a positive definite matrir,d a given vector, and x a non zero arbitrary vector; (xTd)2 max TTBt =dTB-1d such that=1, (2.33) with the marimum attained when x=cB-d for some constant c #0. Proof.By using the extended Cauchy-Schwarz inequality,(xTd)2<(xTBx)(dTB-d).Dividing both sides by (xTBx)yields Bd.Thus the maximum is dB-d,as desired. (xd)2 Setting the equality and solving for x yields x=cB-1d. 口 Note that this is not a classification method in and of itself,because Fisher's procedure only find the optimal vector to project data onto.However,as mentioned above,classification should compare the projection of new points to the midpoint between the two projected sample means:
2.2. FISHER’S DISCRIMINANT ANALYSIS 13 two projected class means ought to be far apart, i.e. we want to make wT (x¯2 − x¯1) large. (To find a bounded solution, we specify that we want ||w|| = 1.) That way the differences between the two classes are maximally highlighted and evaluating the projection of a new point against the projected sample means can accurately classify the largest number of points. The goal of being able to classify the largest number of unknown points accurately based on the training data is called the generalization ability and motivates Support Vector Machines, the next chapter. But separating the projected class means is only one concern, since if the classes have a large variance on the projected vector, then the class overlap can potentially make classification inaccurate. Refer to the figure on the opposite page. The picture on the left projects the data onto the vector connecting the two sample means, assuring that the distance between the two projected means is maximized. However, since the two projected classes are not partitioned on the projection vector, the problem of classifying new points is no easier. However, in the picture on the right, the chosen projection vector separates the projected sample means and partitions the two classes on the projection vector. Classification of a new point x0 then is based on comparing the projection of x0 to the midpoint between the two projected sample means. To tend toward a vector that partitions our classes, we want to minimize the within-class variance, i.e. we want to minimize wT Spooledw = wT Pn1 j=1(x1j − x¯1)(x1j − x¯1) T + Pn2 j=1(x2j − x¯2)(x2j − x¯2) T ! w n1 + n2 − 2 . (Recall that wT Spooledw is the weighted sum of the variances of the two projected populations.) We can optimize the difference between projected means and the in-class variance in one equation so that we can find our desired projection vector: Maximize J(w) = (wT (x¯2 − x¯1))2 wT Spooledw such that ||w|| = 1. (2.32) We then prove the maximization lemma to show that Fisher’s Discriminant Analysis gives w α (x¯1 − x¯2) T S −1 pooled. Lemma 2.2.1. (Maximization Lemma,[4]): For B a positive definite matrix, d a given vector, and x a non zero arbitrary vector, max (x T d) 2 x T Bx = d T B −1d such that ||w|| = 1, (2.33) with the maximum attained when x = cB−1d for some constant c 6= 0. Proof. By using the extended Cauchy-Schwarz inequality, (x T d) 2 ≤ (x T Bx)(d T B−1d). Dividing both sides by (x T Bx) yields (x T d) 2 (xT Bx) ≤ d T B −1d. Thus the maximum is d T B−1d, as desired. Setting the equality and solving for x yields x = cB−1d. Note that this is not a classification method in and of itself, because Fisher’s procedure only find the optimal vector to project data onto. However, as mentioned above, classification should compare the projection of new points to the midpoint between the two projected sample means:
14 CHAPTER 2.BASIC DISCRIMINANTS Theorem 2.2.2.(Allocation Rule Based on FDA):Classify an unclassified point toto n if 0=(国-)Sda西m>m=(-)TSpooled(1+), (2.34) t切π2f 1 0=(国-Sad西<m-国-TSpooled(国+, (2.35) and unclassifiable if equality holds. Recall that in the previous section,we defined a vector 1-(and showed that given no cost or prior probability concerns,Linear Discriminant Analysis could be interpreted as projecting an unknown datum onto a vector and then comparing the projection of the unknown point to the midpoint between the two projected sample means.Thus,we can see that Fisher's method is identical to assuming that the two populations are distributed similarly and with equal covariance matrices,given no cost or prior probability concerns. We should mention that the assumption of equal covariance matrices is important,because that justifies our choice of the midpoint as the classification critical value.Consider the trivial example shown on the next page.Here the two populations have clearly difference variances,and thus one population occupies a greater portion of the projection vector.Then the midpoint is a poor choice for a critical value,and an accurate decision function will have its critical value closer to ui. 2.3 Further Concerns In this chapter,we have looked at two basic methods of linear discrimination,one that is parametric and another that is non-parametric.However,we only covered cases where the two within-class variances were equal(=E2=E).We have just seen that Fisher's Discriminant Analysis fails when the two class variances are different,but Linear Discriminant Analysis can account for this by directly plugging the two covariance matrices in separately to the pdfs,fi(x)and f2(x).Another issue that was not covered in this section is the difference between the estimated allocation and the optimal allocation when using LDA.The interested reader can find plenty of literature outlining the difference,and a good place to start is section 11.4 in [4]
14 CHAPTER 2. BASIC DISCRIMINANTS Theorem 2.2.2. (Allocation Rule Based on FDA): Classify an unclassified point x0 to π1 if y0 = (x¯1 − x¯2) T S −1 pooledx0 > mˆ = 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2), (2.34) to π2 if y0 = (x¯1 − x¯2) T S −1 pooledx0 < mˆ = 1 2 (x¯1 − x¯2) T S −1 pooled(x¯1 + x¯2), (2.35) and unclassifiable if equality holds. Recall that in the previous section, we defined a vector ˆl = (x¯1 − x¯2) T S −1 pooled and showed that given no cost or prior probability concerns, Linear Discriminant Analysis could be interpreted as projecting an unknown datum onto a vector and then comparing the projection of the unknown point to the midpoint between the two projected sample means. Thus, we can see that Fisher’s method is identical to assuming that the two populations are distributed similarly and with equal covariance matrices, given no cost or prior probability concerns. We should mention that the assumption of equal covariance matrices is important, because that justifies our choice of the midpoint as the classification critical value. Consider the trivial example shown on the next page. Here the two populations have clearly difference variances, and thus one population occupies a greater portion of the projection vector. Then the midpoint is a poor choice for a critical value, and an accurate decision function will have its critical value closer to µ1. 2.3 Further Concerns In this chapter, we have looked at two basic methods of linear discrimination, one that is parametric and another that is non-parametric. However, we only covered cases where the two within-class variances were equal (Σ1 = Σ2 = Σ). We have just seen that Fisher’s Discriminant Analysis fails when the two class variances are different, but Linear Discriminant Analysis can account for this by directly plugging the two covariance matrices in separately to the pdfs, f1(x) and f2(x). Another issue that was not covered in this section is the difference between the estimated allocation and the optimal allocation when using LDA. The interested reader can find plenty of literature outlining the difference, and a good place to start is section 11.4 in [4]
2.3.FURTHER CONCERNS 15 Clos>I·X C15的1=A △ 4*) 支 Figure 2.2:A trivial example of two unequal class variances
2.3. FURTHER CONCERNS 15 Figure 2.2: A trivial example of two unequal class variances