(1)Data Integration ●Data integration Integrating data from multiple data sources into a consistent store center Pattern/Mode/Structure integration Integrate metadata of different data sources Entity identification problem:match different real- world entities from multiple data sources. .E.g.A.cust-id=B.customer no Semantic integration problem 10 DATA Copyright 2019 by Xiaoyu Li
Copyright © 2019 by Xiaoyu Li. 10 (1) Data Integration Data integration Integrating data from multiple data sources into a consistent store center Pattern/Mode/Structure integration Integrate metadata of different data sources Entity identification problem: match different realworld entities from multiple data sources. E.g. A.cust-id=B.customer_no Semantic integration problem
(1)Data Integration Data Source A Wrapper Data Source B Wrapper Mediated Schema “Virtual Database” Data Source C Wrapper Fig.1 Simple schematic for a data- integration solution.A system designer constructs a mediated schema against which users can run queries. 11 Copyright 2019 by Xiaoyu Li
Copyright © 2019 by Xiaoyu Li. 11 (1) Data Integration Fig.1 Simple schematic for a dataintegration solution. A system designer constructs a mediated schema against which users can run queries
(2)Redundancy Data Data integration An attribute (such as annual revenue,for instance)may be redundant if it can be "derived"from another attribute or set of attributes.Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Correlation Analysis For nominal data,we use the x(chi-square)test For numeric attributes,we use t the correlation coefficient and covariance, ATA 12 Copyright 2019 by Xiaoyu Li
Copyright © 2019 by Xiaoyu Li. 12 (2) Redundancy Data Data integration An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. Correlation Analysis For nominal data, we use the (chi-square) test. For numeric attributes, we use the correlation coefficient and covariance
(3)Correlation Analysis For nominal data we use the x2(chi-square)test. For numeric attributes of data we use the correlation coefficient and covariance. 13 DATA Copyright 2019 by Xiaoyu Li
Copyright © 2019 by Xiaoyu Li. 13 (3) Correlation Analysis For nominal data we use the (chi-square) test. For numeric attributes of data we use the correlation coefficient and covariance
1)Nominal-x2 (chi-square)test .x2(chi-square)test .is the observed frequency of joint event (aib) eis the expected frequency of (ab) N is the number of tuples A x2-2a,e a1 a2 i c i=1 i=1 b1 B b2 count(A=a;)*couni(B=b) ji N br Degrees of freedom:(c-1)*(r-1) (A=ai,B=bj) 14 Copyright C 2019 by Xiaoyu Li
Copyright © 2019 by Xiaoyu Li. 14 χ 2 (chi-square) test σij is the observed frequency of joint event (ai ,bj ) eij is the expected frequency of (ai ,bj ) N is the number of tuples A a1 a2 i ac b1 B b2 j br (A=ai,B=bj) = = − = r j ij ij ij c i e e 1 2 1 2 ( ) N count A a count B b e i j ij ( = ) * ( = ) = Degrees of freedom: (c-1)*(r-1) 1) Nominal-χ 2 (chi-square) test