Abdelguerfi, M.. Eskicioglu, R. Liebowitz, J. "Knowledge engineerin The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton CRC Press llc. 2000
Abdelguerfi, M., Eskicioglu, R., Liebowitz, J. “Knowledge Engineering” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
94 Knowledge engineering 94.1 Databases Database Abstraction Data Models Databases Hierarchical Databases M. Abdelguerfi and Databases. Architecture of a DBMs. Data Integrity and R. Eskicioglu Security. Emerging Trends University of New Orleans 94.2 Rule-Based Expert Systems Problem Selection.Knowledge Acquisition.Knowledge Jay Liebowitz Representation.Knowledge Encoding. Knowledge Testing and George Washington University Evaluation. Implementation and maintenance 94.1 Databases M. Abdelguerfi and R. Eskicioglu In the past, file processing techniques were used to design information systems. These systems usually consist of a set of files and a collection of application programs. Permanent records are stored in the files, and application programs are used to update and query the files. The application programs were in general developed individ- ually to meet the needs of different groups of users. In many cases, this approach leads to a duplication of data among the files of different users. Also, the lack of coordination between files belonging to different users often leads to a lack of data consistency. In addition, changes to the underlying data requirements usually necessitate hajor changes to existing application programs. Among other major problems that arise with the use of file processing techniques are lack of data sharing, reduced programming productivity, and increased program maintenance. Because of their inherent difficulties and lack of flexibility, file processing techniques have lost great deal of their popularity and are being replaced by database management systems(DBMS). A DBMS is designed to efficiently manage a shared pool of interrelated data( database). This includes the existence of features such as a data definition language for the definition of the logical structure of the database database schema), a data manipulation language to query and update the database, a concurrency control mechanism to keep the database consistent when shared by several users, a crash recovery strategy to avoid any loss of information after a system crash, and safety mechanisms against any unauthorized access Database Abstraction A DBMS is expected to provide for data independence, i.e., user requests are made at a logical level without any need for the knowledge of how the data is stored in actual files. This implies that the internal file structure could be modified without any change to the user's perception of the database. To achieve data independence, ne Standards Planning and Requirements Committee(SPARC)of the American National Standards Institute (ANSI)in its 1977 report recommended three levels of database abstraction(see Fig. 94. 1). The lowest level in the abstraction is the internal level. Here, the database is viewed as a collection of files organized according to one of several possible internal data organizations(e.g, B+-tree data organization). In the conceptual level, the database is viewed at an abstract level. The user at this level is shielded from the internal storage details. At the external level, each group of users has their own perception or view of the database. Each view is derived from c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 94 Knowledge Engineering 94.1 Databases Database Abstraction • Data Models • Relational Databases • Hierarchical Databases • Network Databases • Architecture of a DBMS • Data Integrity and Security • Emerging Trends 94.2 Rule-Based Expert Systems Problem Selection • Knowledge Acquisition • Knowledge Representation • Knowledge Encoding • Knowledge Testing and Evaluation • Implementation and Maintenance 94.1 Databases M. Abdelguerfi and R. Eskicioglu In the past, file processing techniques were used to design information systems. These systems usually consist of a set of files and a collection of application programs. Permanent records are stored in the files, and application programs are used to update and query the files. The application programs were in general developed individually to meet the needs of different groups of users. In many cases, this approach leads to a duplication of data among the files of different users. Also, the lack of coordination between files belonging to different users often leads to a lack of data consistency. In addition, changes to the underlying data requirements usually necessitate major changes to existing application programs. Among other major problems that arise with the use of file processing techniques are lack of data sharing, reduced programming productivity, and increased program maintenance. Because of their inherent difficulties and lack of flexibility, file processing techniques have lost a great deal of their popularity and are being replaced by database management systems (DBMS). A DBMS is designed to efficiently manage a shared pool of interrelated data (database). This includes the existence of features such as a data definition language for the definition of the logical structure of the database (database schema), a data manipulation language to query and update the database, a concurrency control mechanism to keep the database consistent when shared by several users, a crash recovery strategy to avoid any loss of information after a system crash, and safety mechanisms against any unauthorized access. Database Abstraction A DBMS is expected to provide for data independence, i.e., user requests are made at a logical level without any need for the knowledge of how the data is stored in actual files. This implies that the internal file structure could be modified without any change to the user’s perception of the database. To achieve data independence, the Standards Planning and Requirements Committee (SPARC) of the American National Standards Institute (ANSI) in its 1977 report recommended three levels of database abstraction (see Fig. 94.1). The lowest level in the abstraction is the internal level. Here, the database is viewed as a collection of files organized according to one of several possible internal data organizations (e.g., B+-tree data organization). In the conceptual level, the database is viewed at an abstract level. The user at this level is shielded from the internal storage details. At the external level, each group of users has their own perception or view of the database. Each view is derived from M. Abdelguerfi and R. Eskicioglu University of New Orleans Jay Liebowitz George Washington University
ne conceptual database and is designed to meet the needs of particu External Level (View) group of users. Such a group can only have access to the data specified by its particular view. This, of course, ensures both privacy and security. The mapping between the three levels of abstraction is the task of the DBMS. When changes to the internal level(such as a change in file organi- tion)do not affect the conceptual and external levels, the system is said to provide for physical data independence. Logical data independence prevents Conceptual Level changes to the conceptual level to affect users' views. Both types of data independence are desired features in a database system Data Models a data model refers to an integrated set of tools used to describe the data Internal Level and its structure data relation and data constraints. Some data models provide a set of operators that is used to update and query the database. Data FIGURE 94.1 Data abstraction. nodels can be classified in two main categories: record-based and object based. Both classes are used to describe the database at the conceptual and external levels. with object-based data models, constraints on the data can be specified more explicitly. There are three main record-based data models: the relational, network, and hierarchical models. In the relational model, data at the conceptual level is represented as a collection of interrelated tables. These tables are normalized so as to minimize data redundancy and update anomalies. In this model, data relationships ar implicit and are derived by matching columns in tables In the hierarchical and network models, the data represented as a collection of records and data relationships are explicit and are represented by links. The lifference between the last two models is that in the hierarchical model, data is represented as a tree structure, while it is represented as a generalized graph in the network model In hierarchical and network models, the existence of physical pointers(links)to link related records alloy an application program to retrieve a single record at a time by following the pointer's chain. The process of following the pointer's chain and selecting one record at a time is referred to as navigation. In nonnavigational models such as the relational model, records are not related through pointer's chains, but relationships are established by matching columns in different tables The hierarchical and network models require the application programmer to be aware of the internal structure of the database. The relational model, on the other hand, allows for a high degree of physical and logical data independence. Earlier DBMSs were for the most part navigational systems. Because of its simplicity and strong theoretical foundations, the relational model has since received wide acceptance. Today, most DBMSs are based on the relational model Other data models include a popular high level conceptual data model, known as the Entity-Relationship ER)model. The ER model is mainly used for the conceptual design of databases and their applications. The ER model describes data as entities, attributes, and relationships. An entity is an"object"in the real world with an independent existence. Each entity has a set of properties, called attributes, that describes it. A relationship is an association between entities. For example, a professor entity may be described by its name, age, and salary and can be associated with a department entity by the relationship"works for With the advent of advanced database applications, the ER modeling concepts became insufficient. This has led to the enhancement of the ER model with additional concepts, such as generalization, categories, and inheritance, leading to the Enhanced-ER or EER model Relational databases The relational model was introduced by E. F Codd [1970]. Since the theoretical underpinnings of the relational model have been well defined it has become the focus of most commercial dbmss In the relational model, the data is represented as a collection of relations. To a large extent, each relation can be thought of as a table. The example of Fig. 94. 2 shows part of a university database composed of two e 2000 by CRC Press LLC
© 2000 by CRC Press LLC the conceptual database and is designed to meet the needs of a particular group of users. Such a group can only have access to the data specified by its particular view. This, of course, ensures both privacy and security. The mapping between the three levels of abstraction is the task of the DBMS. When changes to the internal level (such as a change in file organization) do not affect the conceptual and external levels, the system is said to provide for physical data independence. Logical data independence prevents changes to the conceptual level to affect users’ views. Both types of data independence are desired features in a database system. Data Models A data model refers to an integrated set of tools used to describe the data and its structure, data relationships, and data constraints. Some data models provide a set of operators that is used to update and query the database. Data models can be classified in two main categories: record-based and objectbased. Both classes are used to describe the database at the conceptual and external levels. With object-based data models, constraints on the data can be specified more explicitly. There are three main record-based data models: the relational, network, and hierarchical models. In the relational model, data at the conceptual level is represented as a collection of interrelated tables. These tables are normalized so as to minimize data redundancy and update anomalies. In this model, data relationships are implicit and are derived by matching columns in tables. In the hierarchical and network models, the data is represented as a collection of records and data relationships are explicit and are represented by links. The difference between the last two models is that in the hierarchical model, data is represented as a tree structure, while it is represented as a generalized graph in the network model. In hierarchical and network models, the existence of physical pointers (links) to link related records allows an application program to retrieve a single record at a time by following the pointer’s chain. The process of following the pointer’s chain and selecting one record at a time is referred to as navigation. In nonnavigational models such as the relational model, records are not related through pointer’s chains, but relationships are established by matching columns in different tables. The hierarchical and network models require the application programmer to be aware of the internal structure of the database. The relational model, on the other hand, allows for a high degree of physical and logical data independence. Earlier DBMSs were for the most part navigational systems. Because of its simplicity and strong theoretical foundations, the relational model has since received wide acceptance. Today, most DBMSs are based on the relational model. Other data models include a popular high level conceptual data model, known as the Entity-Relationship (ER) model. The ER model is mainly used for the conceptual design of databases and their applications. The ER model describes data as entities, attributes, and relationships. An entity is an “object” in the real world with an independent existence. Each entity has a set of properties, called attributes, that describes it. A relationship is an association between entities. For example, a professor entity may be described by its name, age, and salary and can be associated with a department entity by the relationship “works for”. With the advent of advanced database applications, the ER modeling concepts became insufficient. This has led to the enhancement of the ER model with additional concepts, such as generalization, categories, and inheritance, leading to the Enhanced-ER or EER model. Relational Databases The relational model was introduced by E. F. Codd [1970]. Since the theoretical underpinnings of the relational model have been well defined, it has become the focus of most commercial DBMSs. In the relational model, the data is represented as a collection of relations. To a large extent, each relation can be thought of as a table. The example of Fig. 94.2 shows part of a university database composed of two FIGURE 94.1 Data abstraction
FAC INFO (Iname, soclal sec#, street dept w Orleans EE Hanoura 40 anabel Severn Matari Abdel 3893901 St charles DEPT_ CHAIR( dept, FIGURE 94. 2 An example of two relations: FAC_INFO and DEP_CHAIR. lations. FAC_INFO gives personal information(last name, social security, street and city of residence, and department)of a faculty. DEP_CHAIR gives the last name of the chairman of each depa. a columnn ulty is not allowed to belong to two departments. Each row in a relation is referred to as a tuple. A column name is called an attribute name. The data type of each attribute name is known as its domain. a relation scheme is a set of attribute names. For instance, the relation scheme (or scheme for short) of the relation FAC_ INFO is name, social_sect#t, street, city, dept). A key is a set of attribute names whose composite value is distinct for all tuples. In addition, no proper subset of the key is allowed to have this property. It is not unusual for a scheme to have several possible keys In FAC_INFO, both Iname and social_sec# are possible keys. In this case, ach possible key is known as a candidate key, and the one selected to act as the relations key, say, Iname, is referred to as the primary key. a superkey is a key with the exception that there is no requirement for minimality. In a relation, an attribute name(or a set of attribute names)is referred to as a foreign key, if it is the primary key of another relation. In FAC_INFO, the attribute name dept is a foreign key, since the same attribute is a key in DEP_CHAIR. Because of updates to the database, the content of a relation is dynamic. For this reason, the data in a relation at a given time instant is called an instance of the relation There are three integrity constraints that are usually imposed on each instance of a relation: primary key egrity, entity integrity, and referential integrity. The key integrity constraint requires that no two tuples of a relation have the same key value. The entity integrity constraint specifies that the key value of each tuple should have a known value(i. e, no null values are allowed for primary keys). The referential integrity constraint specifies that if a relation r, contains a foreign key that matches the primary key of a relation T2, then each value of the foreign key in ri must either match a value of the primary key in r2 or must be null. For the database of Fig 94.2 to be consistent, each value of dept in FAC_INFO must match a value of dept in DEP_CHAIR Relational database design The relational database design [Maier, 1983] refers to the process of generating a set of relation schemes that minimizes data redundancy and removes update anomalies. One of the most popular approaches is the use of the normalization theory. The normalization theory is based on the notion of functional dependencies. Functional dependencies are constraints imposed on a database. The notion of superkey, introduced in the previous section, can be formulated as follows: A subset of a relation scheme is a superkey if, in any instance of the relation, no two distinct tuples have the same superkey value. If r(R)is used to denote a relation r on a schema R, Kc Ra superkey, and t(k)the K-value of tuple t, then no two tuples t and t, in r(R)are such that (K=L(k) The notion of a functional dependency can be seen as a generalization of the notion of superkey. Let X and Y be two subsets of R; the functional dependency X- Y exists in r(R) if whenever two tuples in r(R)ha the same X-value, their Y-value is also the same. That is, if t(X)=t(X), then t(y=t(n. Using functional dependencies, one can define the notion of a key more precisely. a key k of a relation r(R)is such that k-R and no proper subset of k has this property. Note that if the schema R is composed of attribute names(A,, a2 An, then each attribute name A, is functionally determined by the key k, i. e,k-Aj, i=l,.,n.An c2000 by CRC Press LLC
© 2000 by CRC Press LLC relations. FAC_INFO gives personal information (last name, social security, street and city of residence, and department) of a faculty. DEP_CHAIR gives the last name of the chairman of each department. A faculty is not allowed to belong to two departments. Each row in a relation is referred to as a tuple. A column name is called an attribute name. The data type of each attribute name is known as its domain. A relation scheme is a set of attribute names. For instance, the relation scheme (or scheme for short) of the relation FAC_INFO is (lname, social_sec#, street, city, dept). A key is a set of attribute names whose composite value is distinct for all tuples. In addition, no proper subset of the key is allowed to have this property. It is not unusual for a scheme to have several possible keys. In FAC_INFO, both lname and social_sec# are possible keys. In this case, each possible key is known as a candidate key, and the one selected to act as the relation’s key, say, lname, is referred to as the primary key. A superkey is a key with the exception that there is no requirement for minimality. In a relation, an attribute name (or a set of attribute names) is referred to as a foreign key, if it is the primary key of another relation. In FAC_INFO, the attribute name dept is a foreign key, since the same attribute is a key in DEP_CHAIR. Because of updates to the database, the content of a relation is dynamic. For this reason, the data in a relation at a given time instant is called an instance of the relation. There are three integrity constraints that are usually imposed on each instance of a relation: primary key integrity, entity integrity, and referential integrity. The key integrity constraint requires that no two tuples of a relation have the same key value. The entity integrity constraint specifies that the key value of each tuple should have a known value (i.e., no null values are allowed for primary keys). The referential integrity constraint specifies that if a relation r1 contains a foreign key that matches the primary key of a relation r2, then each value of the foreign key in r1 must either match a value of the primary key in r2 or must be null. For the database of Fig. 94.2 to be consistent, each value of dept in FAC_INFO must match a value of dept in DEP_CHAIR. Relational Database Design The relational database design [Maier, 1983] refers to the process of generating a set of relation schemes that minimizes data redundancy and removes update anomalies. One of the most popular approaches is the use of the normalization theory. The normalization theory is based on the notion of functional dependencies. Functional dependencies are constraints imposed on a database. The notion of superkey, introduced in the previous section, can be formulated as follows: A subset of a relation scheme is a superkey if, in any instance of the relation, no two distinct tuples have the same superkey value. If r(R) is used to denote a relation r on a schema R, K Õ R a superkey, and t(k) the K-value of tuple t, then no two tuples t1 and t2 in r(R) are such that t1(K) = t2(K). The notion of a functional dependency can be seen as a generalization of the notion of superkey. Let X and Y be two subsets of R; the functional dependency X Æ Y exists in r(R) if whenever two tuples in r(R) have the same X-value, their Y-value is also the same. That is, if t1(X) = t2(X), then t1(Y) = t2(Y). Using functional dependencies, one can define the notion of a key more precisely. A key k of a relation r(R) is such that k Æ R and no proper subset of k has this property. Note that if the schema R is composed of attribute names {A1, A2, . . ., An}, then each attribute name Ai is functionally determined by the key k, i.e., k Æ Ai, i = 1, . . ., n. An FIGURE 94.2 An example of two relations: FAC_INFO and DEP_CHAIR
PRODUCT supplier_name, product_name, price, location) Martin 500.99 Kenner metairie FIGURE 94.3 Instance of PRODUCT(supplier_name, product_name, price, quantity attribute name that is part of a key is referred to as a prime attribute In the example of Fig 94.2, both attribute names street and city are nonprime attributes. The normalization process can be thought of as the process of decomposing a scheme with update anomalies and data redundancy into smaller schemes in which these undesirable properties are to a large extent eliminated. Depending on the severity of these undesirable properties, schemes are classified into normal forms. Originally, Codd defined three normal forms: first normal form(INF), second normal form(2NF), and third normal form (NF). Thereafter, a stronger version of the 3NF, known as Boyce-Codd normal form(BCNF), was suggested. These four normal forms are based on the concept of functional dependencies. The INF requires that attribute name values be atomic. That is, composite values for attribute names are not allowed. A 2NF scheme is a INF scheme in which all nonprime attributes are fully dependent on the key Consider the relation of Fig. 94.3. Each tuple in PRodUCt gives the name of a supplier, a product name, its price, and the supplier's location. The scheme(supplier_name, product_name, price, quantity)is in INF since ach attribute name is atomic. It is assumed that many products can be supplied by a single supplier, that a given product can be supplied by more than one supplier, and that a supplier has only one location. So, (supplier_name, product_name)is the relations key and the functional dependency supplier_name location should hold for any instance of PRODUCT. The structure of the relation of Fig 94.3 does not allow a supplier to appear in the relation unless it offers at least one product. Even the use of null values is not of much help in this case as product_name is part of a key and therefore cannot be assigned a null value. Another anomaly can be encountered during the deletion process. For instance, deleting the last tuple in the relation results in the loss of the information that Rudd is a supplier located in Metairie. It is seen that the relation PRODUCT suffers from insertion and deletion Modifications can also be a problem in the relation PRODUCT. Suppose that the location of the suppl Martin is moved from Kenner to Slidell. In order not to violate the functional dependency supplier_name location, the location attribute name of all tuples where the supplier is Martin needs to be changed from Kenner to Slidell. This modification anomaly has a negative effect on performance. In addition, the relation PRODUCT suffers from data redundancy. For example, although Martin has only one location"Kenner", such a location appears in all three tuples where the supplier_name is Martin. The update anomalies and data redundancy encountered in PRODUCT are all due to the functional depen dency suppliername -location. The right-hand side of this dependency "location"is a nonprime attribute, and the left-hand side represents part of the key. Therefore, we have a nonprime attribute that is only partially dependent on the key(supplier_name, product_name). As a consequence, the schema(supplier_name, product_name, price, location)is not in 2NF. The removal of the partial dependency supplier_name- location will eliminate all the above anomalies. The removal of the partial dependency is achieved by decomposing the scheme(supplier_name, product_name, price, quantity) into two 2NF schemes:(supplier_name, product_name, price), and(supplier_name, location). This decomposition results in relations PRO_INFO and SUP_LOC shown in Fig. 94.4. The keys of PRO_INFO and SUP_LOC are(supplier_name, product_name), and supplier_name, respectively. Normalizing schemes into 2NF removes all update anomalies due to nonprime attributes being partiall dependent on keys. Anomalies of a different nature, however, are still possible Update anomalies and data redundancy can originate from transitive dependencies. a nonprime attribute a functionally determine Ak. A 3NF is a INF where no nonprime attribute is transitively dependent on a ko o is said to be transitively dependent on a key k via attribute name A;, if k- Aj, A; A, and A, does ne e 2000 by CRC Press LLC
© 2000 by CRC Press LLC attribute name that is part of a key is referred to as a prime attribute. In the example of Fig. 94.2, both attribute names street and city are nonprime attributes. The normalization process can be thought of as the process of decomposing a scheme with update anomalies and data redundancy into smaller schemes in which these undesirable properties are to a large extent eliminated. Depending on the severity of these undesirable properties, schemes are classified into normal forms. Originally, Codd defined three normal forms: first normal form (1NF), second normal form (2NF), and third normal form (3NF). Thereafter, a stronger version of the 3NF, known as Boyce-Codd normal form (BCNF), was suggested. These four normal forms are based on the concept of functional dependencies. The 1NF requires that attribute name values be atomic. That is, composite values for attribute names are not allowed. A 2NF scheme is a 1NF scheme in which all nonprime attributes are fully dependent on the key. Consider the relation of Fig. 94.3. Each tuple in PRODUCT gives the name of a supplier, a product name, its price, and the supplier’s location. The scheme (supplier_name, product_name, price, quantity) is in 1NF since each attribute name is atomic. It is assumed that many products can be supplied by a single supplier, that a given product can be supplied by more than one supplier, and that a supplier has only one location. So, (supplier_name, product_name) is the relation’s key and the functional dependency supplier_name Æ location should hold for any instance of PRODUCT. The structure of the relation of Fig. 94.3 does not allow a supplier to appear in the relation unless it offers at least one product. Even the use of null values is not of much help in this case as product_name is part of a key and therefore cannot be assigned a null value. Another anomaly can be encountered during the deletion process. For instance, deleting the last tuple in the relation results in the loss of the information that Rudd is a supplier located in Metairie. It is seen that the relation PRODUCT suffers from insertion and deletion anomalies. Modifications can also be a problem in the relation PRODUCT. Suppose that the location of the supplier Martin is moved from Kenner to Slidell. In order not to violate the functional dependency supplier_name Æ location, the location attribute name of all tuples where the supplier is Martin needs to be changed from Kenner to Slidell. This modification anomaly has a negative effect on performance. In addition, the relation PRODUCT suffers from data redundancy. For example, although Martin has only one location “Kenner”, such a location appears in all three tuples where the supplier_name is Martin. The update anomalies and data redundancy encountered in PRODUCT are all due to the functional dependency supplier_name Æ location. The right-hand side of this dependency “location” is a nonprime attribute, and the left-hand side represents part of the key. Therefore, we have a nonprime attribute that is only partially dependent on the key (supplier_name, product_name). As a consequence, the schema (supplier_name, product_name, price, location) is not in 2NF. The removal of the partial dependency supplier_name Æ location will eliminate all the above anomalies. The removal of the partial dependency is achieved by decomposing the scheme (supplier_name, product_name, price, quantity) into two 2NF schemes: (supplier_name, product_name, price), and (supplier_name, location). This decomposition results in relations PRO_INFO and SUP_LOC shown in Fig. 94.4. The keys of PRO_INFO and SUP_LOC are (supplier_name, product_name), and supplier_name, respectively. Normalizing schemes into 2NF removes all update anomalies due to nonprime attributes being partially dependent on keys. Anomalies of a different nature, however, are still possible. Update anomalies and data redundancy can originate from transitive dependencies. A nonprime attribute Ai is said to be transitively dependent on a key k via attribute name Aj , if k Æ Aj, Aj Æ Ai, and Aj does not functionally determine Ak. A 3NF is a 1NF where no nonprime attribute is transitively dependent on a key. FIGURE 94.3 Instance of PRODUCT (supplier_name, product_name, price, quantity)