SGWrap Rule Language I ternatively you can view Contact, or see the Overy N ane Details Plat form: java Purpose: indexing Availability: source Ahoy! The Homepage Finder Purpose: maintenance Availability: none 曰-<>置 eb robots B<> robot mapping wrapper 如何形式化 的描述?
11 SGWrap Rule Language mapping wrapper 如何形式化 的描述?
SGWrap rule Language o Aformula language describing the intent of user is important for web data extraction systems It should be O EXact. This is the basic constrain As wrapper program must give out exact result for automatically extraction, the language describing wrapper's intention must also be exact O EXpressive. The language should be able to describe typical intention and consideration of user In our case. it should be able to express DOM tree navigation and structure result construction O Compact. The language should be simple and powerful. It can describe the problem in short script, and it should have facilities helping user performing general operation, such as string operations O Understandable. Rule is not only for computer but also for human So the language should be human understandable, for the possibility that human will custom and adjust it
12 SGWrap Rule Language ⚫ A formula language describing the intent of user is important for web data extraction systems. It should be Exact. This is the basic constrain. As wrapper program must give out exact result for automatically extraction, the language describing wrapper's intention must also be exact. Expressive. The language should be able to describe typical intention and consideration of user. In our case, it should be able to express DOM tree navigation and structure result construction. Compact. The language should be simple and powerful. It can describe the problem in short script, and it should have facilities helping user performing general operation, such as string operations. Understandable. Rule is not only for computer but also for human. So the language should be human understandable, for the possibility that human will custom and adjust it
SGWrap rule Language o SGWrap's Rule is designed to be that type of language. It is exact as it uses XPath as the basic DOM Tree description method. It is expressive as it introduces XQuery's FLWR expression for result construction. It is also compact and understandable o Rule consists of three parts o(a)an assign clause, o(b a variable name for returning result and o(c)a return clause, which can be a variable name or a function clause or a Rule array containing other Rules
13 SGWrap Rule Language ⚫ SGWrap's Rule is designed to be that type of language. It is exact as it uses XPath as the basic DOM Tree description method. It is expressive as it introduces XQuery's FLWR expression for result construction. It is also compact and understandable. ⚫ Rule consists of three parts: (a)an assign clause, (b)a variable name for returning result and (c)a return clause, which can be a variable name or a function clause or a Rule array containing other Rules
SGWrap rule Language-example o LET SWeb_ robots: -document(Sd) o l/ document(sd)is expression reserved by SGWrap Rule which is used to ●∥ represent the concept root" ofa document RETURN <Web robots> o FoR Robot IN SWeb robots/HTMU/BODY/TABLE/TBODY/TR l/ Following we will have a array of Rules, which means that the result o/ consists ofa serials of child node RETURN <robot LET Sname: =Robot/TD[OJA RETURN <name>sname</name> ET SPlatform: =Robot/TD[1]TABLE/TBODY/TR[contains(/TH,"Platform: )J/TD RETURN <Platform>pLatform</Platform> </robot </eb_robots>Refertohttp://idke.ruc.educn/sgwrap/doc/rule Specification. html#Rule-Specification for specification
14 SGWrap Rule Language - example ⚫ { ⚫ LET $Web_robots:=document($d) ⚫ // document($d) is expression reserved by SGWrap Rule which is used to ⚫ // represent the concept ``root'' of a document. ⚫ RETURN <Web_robots> ⚫ { ⚫ FOR $robot IN $Web_robots/HTML/BODY/TABLE/TBODY/TR ⚫ // Following we will have a array of Rules, which means that the result ⚫ // consists of a serials of child node. ⚫ RETURN <robot> ⚫ { ⚫ LET $name:=$robot/TD[0]/A ⚫ RETURN <name>$name</name> ⚫ } ⚫ { ⚫ LET $Platform:=$robot/TD[1]/TABLE/TBODY/TR[contains(./TH, "Platform:")]/TD ⚫ RETURN <Platform>$Platform</Platform> ⚫ } ⚫ </robot> ⚫ } ⚫ </Web_robots> ⚫ } Refer to http://idke.ruc.edu.cn/sgwrap/doc/RuleSpecification.html#Rule-Specification for specification
SGWrap rule Language ● SGWrap Rule language应用在HTM网页 的抽取上出现了一些问题 ○HTML网页给抽取带来困难 ○规则没有条件分支语句,不具备条件选择的能力 ○规则建立在W3CDOM模型上,而W3CDOM标 准与事实标准(EDOM)并不符合
15 SGWrap Rule Language ⚫SGWrap Rule Language应用在HTML网页 的抽取上出现了一些问题 HTML网页给抽取带来困难 规则没有条件分支语句,不具备条件选择的能力 规则建立在W3C DOM模型上,而W3C DOM标 准与事实标准(IE DOM)并不符合