Chi-Square Calculation: An Example Play chess Not play chess Sum(row) Like science fiction 250(90)20(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col 300 1200 1500 a X2(chi-square) calculation(numbers in parenthesis are expected counts calculated based on the data distribution the two categories) (250-90)2(50-210)2(200-360)2(1000-840) x =507.93 90 210 360 840 a It shows that like science fiction and play chess are correlated in the group 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 6
16 Chi-Square Calculation: An Example ◼ Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) ◼ It shows that like_science_fiction and play_chess are correlated in the group 507.93 840 (1000 840) 360 (200 360) 210 (50 210) 90 (250 90) 2 2 2 2 2 = − + − + − + − = Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
Correlation Analysis(Numeric Data) a Correlation coefficient (also called Pearson's product moment coefficient) :(a -)(b-B) (a b,)-nAB A B (n-Do,OB AOB where n is the number of tuples a and b are the respective means of a and B, aaand oB are the respective standard deviation of a and B, and t(a, bi) is the sum of the AB cross-product If TAB>0, A and B are positively correlated (As values increase as B's. The higher, the stronger correlation a TAB=0: independent;AB<0: negatively correlated 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 17
17 Correlation Analysis (Numeric Data) ◼ Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi ) is the sum of the AB cross-product. ◼ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. ◼ rA,B = 0: independent; rAB < 0: negatively correlated A B n i i i A B n i i i A B n a b nAB n a A b B r ( 1) ( ) ( 1) ( )( ) 1 1 , − − = − − − = = = A B
Visually Evaluating Correlation 1.00090080070 0.60 050 040 0.30020-0.10 0.00 0.10 02 0.30 Scatter plots 米冰形 showing the similarity from to 1 0.50 0.60 090 ∥来 同济大学软件学院 18 ool of Software Engineering. Tongpi Unversity
18 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1
Correlation (viewed as linear relationship) a Correlation measures the linear relationship between objects a To compute correlation we standardize data objects, A and b and then take their dot product k mean (A))/ std(A) b'k=(bk -mean (b))/std (B) correlation(A,B)=A·B 同济大学软件学院 19 ool of Software Engineering. Tongpi Unversity
19 Correlation (viewed as linear relationship) ◼ Correlation measures the linear relationship between objects ◼ To compute correlation, we standardize data objects, A and B, and then take their dot product a' (a mean(A))/std(A) k = k − b' (b mean(B))/std(B) k = k − correlation(A,B) = A'•B
Covariance(Numeric Data Covariance is similar to correlation ∑=1(a-A)(b-B) OU (A,B)=E(4-A)(B-B)= Cou(A, B) Correlation coefficient: TA, B- 0AOB where n is the number of tuples, A and b are the respective mean or expected values of a and b, oa and oB are the respective standard deviation of a and B Positive covariance: If cova>0, then a and b both tend to be larger than their expected values Negative covariance: If CovA B <0 then if a is larger than its expected value, B is likely to be smaller than its expected value Independence: CovAB =0 but the converse is not true Some pairs of random variables may have a covariance of o but are not independent. Only under some additional assumptions(e.g, the data follow multivariate normal distributions) does a covariance of 0 imply independence 同济大学软件学院 ool of Software Engineering. Tongpi Unversity 20
20 Covariance (Numeric Data) ◼ Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. ◼ Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. ◼ Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. ◼ Independence: CovA,B = 0 but the converse is not true: ◆ Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence A B Correlation coefficient: