Web-log Mining: from Pages to relations Qiang Yang HKUST Hong Kong, China 2021/1/26
2021/1/26 1 Web-log Mining: from Pages to Relations Qiang Yang HKUST, Hong Kong, China
A Spectrum of Web-Log Miners Knowledge Knowledge rich Have logical relations over page contents Database generated pages Can make accurate predictions even without data Knowledge middle Some hierarchical representations of ontology and content Most cases Interesting! Can predict based on similarity Knowledge poor Have only page-level logs No relational knowledge Can predict observed pages only when data is plenty 2021/1/26
2021/1/26 2 A Spectrum of Web-Log Miners ◼ Knowledge Rich: ◼ Have logical relations over page contents ◼ Database generated pages ◼ Can make accurate predictions even without data! ◼ Knowledge Middle: ◼ Some hierarchical representations of ontology and content ◼ Most cases! ◼ Can predict based on similarity ◼ Knowledge Poor: ◼ Have only page-level logs ◼ No relational knowledge ◼ Can predict observed pages only when data is plenty Interesting! Knowledge
1. Knowledge-poor --Web-log mining Association rule based predictive model ABC D Left Hand Side Right Hand A B CF Side ABE AB C B. C B. CD.G B CD G C D 2 Current observations Window of prediction A B C Size=m 1 Extract rules 2 Select rules 2021/1/26
2021/1/26 3 1. Knowledge-poor -- Web-log mining Current Observations Window of Prediction A B C Size=m A,B,C,D A,B,C,F A,B,E B,C B,C,D,G C, D Left Hand Side Right Hand Side A,B C B,C,D G Association Rule based Predictive Model ? 1 2 1 Extract rules 2 Select rules
Rule-Representation Methods (min sup=2) Subset 2 {A,C}→>C PI Substring A C C BC→C B A C Latest Substring A C C C→C Subsequence Latest Subsequence 2021/1/26
2021/1/26 5 Rule-Representation Methods (min sup=2) ◼ Subset {A, C}→C ◼ Substring “BC”→C ◼ Latest Substring “C”→C ◼ Subsequence ◼ Latest Subsequence W1 W2 A1 A2 A3 P1 P2 A B C A C B C A C D C A C D C
Rule- selection criteria Among the rules whose LHs matches W1 Longest-Match Selection Select a rule whose left hand side is the longest to apply Corresponds to using the strongest signature to predict Most Confident Select the rule with highest confidence to apply Pessimistic selection UCEN is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error UCFE, N cO N 2021/1/26
2021/1/26 6 Rule-Selection Criteria ◼ Among the rules whose LHS matches W1, ◼ Longest-Match Selection ◼ Select a rule whose left hand side is the longest to apply ◼ Corresponds to using the strongest signature to predict ◼ Most Confident ◼ Select the rule with highest confidence to apply ◼ Pessimistic Selection ◼ UCF (E,N) is the upper bound on the estimated error for a given confidence value, assuming a normal distribution of error N U E N conf CF p ( , ) = 1−