Chapter 1 Introduction Data redundancy and inconsistency.Since different programmers create the files and application programs over a long period,the various files are likely to have different structures and the programs may be written in several programming languages Morover thsmnformation may be duplicated veral pl (files)For exa mple,if a student has a do ble major(sa music atics)the add nd teleph umbe of tha may appear nthe Mu that consist of stuc studen ts in the Mathematics department.This redundancy leads to h gher storage and access cost.In addition,it may lead to data inconsistency;that is,the various copies of the same data may no longer agree.For example,a changed student address may be reflected in the Music department records but not elsewhere in the system. Difficulty in accessing data.Su bose that one of the university clerks needs to find out the names ofall studen area.The clerk asks the data-processing department to Because the designers of the did nota ipate this req appl it atior n progra ram nera no ices:either obtain the list of all studen an needed information manually or ask a programmer to write the necessary application program.Both alternatives are obviously unsatisfactory.Suppose that such a program is written,and that,several days later,the same clerk needs to trim that list to include only those students who have taken at least 60 credit hours.As expected,a program to generate such a list does not exist.Again,the clerk has the preceding two options,neither of which is satisfactory. The point here is that conventional file-processing environments do not allow needed data tobe etri ved in a cony nient and efficient manner.More responsive data-retrieval systems are required for general us Data isolation.Because data are scattered in various files,and files may be in differen t forma writing new application programs to retrieve the appropriate data is difficult. .Integrity problems.The data values stored in the database must satisfy cer- tain types of consistency constraints.Suppose the university maintains an account for each department,and records the balance amount in each ac- count.Suppose also that the university requires that the account balance of a department may never fall below zero.Developers enforce these constraints in the sve stem by adding a ineaeateamoia2abemePe Ppro oriate code in the various a ver.when ne am n.The inded when involve several data items f t files Atomicity problems. A compute system like e any o other device,is subject to failure.In many applications,it is crucial that,if a failure occurs,the data
4 Chapter 1 Introduction • Data redundancy and inconsistency. Since different programmers create the files and application programs over a long period, the various files are likely to have different structures and the programs may be written in several programming languages. Moreover, the same information may be duplicated in several places (files). For example, if a student has a double major (say, music and mathematics) the address and telephone number of that student may appear in a file that consists of student records of students in the Music department and in a file that consists of student records of students in the Mathematics department. This redundancy leads to higher storage and access cost. In addition, it may lead to data inconsistency; that is, the various copies of the same data may no longer agree. For example, a changed student address may be reflected in the Music department records but not elsewhere in the system. • Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of all students who live within a particular postal-code area. The clerk asks the data-processing department to generate such a list. Because the designers of the original system did not anticipate this request, there is no application program on hand to meet it. There is, however, an application program to generate the list of all students. The university clerk has now two choices: either obtain the list of all students and extract the needed information manually or ask a programmer to write the necessary application program. Both alternatives are obviously unsatisfactory. Suppose that such a program is written, and that, several days later, the same clerk needs to trim that list to include only those students who have taken at least 60 credit hours. As expected, a program to generate such a list does not exist. Again, the clerk has the preceding two options, neither of which is satisfactory. The point here is that conventional file-processing environments do not allow needed data to be retrieved in a convenient and efficient manner. More responsive data-retrieval systems are required for general use. • Data isolation. Because data are scattered in various files, and files may be in different formats, writing new application programs to retrieve the appropriate data is difficult. • Integrity problems. The data values stored in the database must satisfy certain types of consistency constraints. Suppose the university maintains an account for each department, and records the balance amount in each account. Suppose also that the university requires that the account balance of a department may never fall below zero. Developers enforce these constraints in the system by adding appropriate code in the various application programs. However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is compounded when constraints involve several data items from different files. • Atomicity problems. A computer system, like any other device, is subject to failure. In many applications, it is crucial that, if a failure occurs, the data
1.2 Purpose of Database Systems 5 be to the ts $500 from or to the failure.Consider count ce of department A lance of depa rtmen I a system ailure occurs d ring the execution of the program,it is possi h le that the $500 was removed from the balance of department A but was not credited to the balance of department B, resulting in an inconsistent database state.Clearly,it is essential to database consistency that either both the credit and debit occur,or that neither occur. That is,the funds transfer must be atomic-it must happen in its entirety or not at all.It is difficult to ensure atomicity in a conventional file-processing system. Concurrent-access anomalies.For the sake of overall performance of the sys. ster s allow to update the the la Interne ns of acc per day to the updates lata by shoppers In such n environ ment, tion c s possible an sider department A,with two department clerks debit the account balance (by say $500 and $100,re spectively)of department A at almost exactly the same time,the result of the concurrent executions may leave the budget in an incorrect(or inconsistent) state.Suppose that the programs executing on behalf of each withdrawal read the old balance,reduce that value by the amount being withdrawn,and write the result back.If the two programs run concurrently,they may both read the value s10000.and write l ack $9500 and $9900,res ctively.Dep which one writes the value last the account baland of de artment A ma contain he $9500 g9900 r tha th Bgainst this correct valu data may be orm of superv different application programs that have not been coordina ted previously As another example,suppose a registration program maintains a count of students registered for a course,in order to enforce limits on the number of students registered.When a student registers,the program reads the current count for the courses.verifies that the count is not already at the limit adds one to the count,and stores the count back in the database.Suppose two students register concurrently,with the count at(say)39.The two program executions may both read the value 39,and both would then write back 40, rect increase of only 1,even though two students suc. cessfully registered for the course and theco ant should be 41.Furthermor on limit was in the above students would be able to register,leading to a violation of the limit of 40 students .Security problems.Not every user of the database syster m should be able to access all the data.For example,in a university,payroll personnel need to see only that part of the database that has financial information.They do not need access to information about academic records.But,since applica- tion programs are added to the file-processing system in an ad hoc manner, enforcing such security constraints is difficult
1.2 Purpose of Database Systems 5 be restored to the consistent state that existed prior to the failure. Consider a program to transfer $500 from the account balance of department A to the account balance of department B. If a system failure occurs during the execution of the program, it is possible that the $500 was removed from the balance of department A but was not credited to the balance of department B, resulting in an inconsistent database state. Clearly, it is essential to database consistency that either both the credit and debit occur, or that neither occur. That is, the funds transfer must be atomic—it must happen in its entirety or not at all. It is difficult to ensure atomicity in a conventional file-processing system. • Concurrent-access anomalies. For the sake of overall performance of the system and faster response, many systems allow multiple users to update the data simultaneously. Indeed, today, the largest Internet retailers may have millions of accesses per day to their data by shoppers. In such an environment, interaction of concurrent updates is possible and may result in inconsistent data. Consider department A, with an account balance of $10,000. If two department clerks debit the account balance (by say $500 and $100, respectively) of department A at almost exactly the same time, the result of the concurrent executions may leave the budget in an incorrect (or inconsistent) state. Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that value by the amount being withdrawn, and write the result back. If the two programs run concurrently, they may both read the value $10,000, and write back $9500 and $9900, respectively. Depending on which one writes the value last, the account balance of department A may contain either $9500 or $9900, rather than the correct value of $9400. To guard against this possibility, the system must maintain some form of supervision. But supervision is difficult to provide because data may be accessed by many different application programs that have not been coordinated previously. As another example, suppose a registration program maintains a count of students registered for a course, in order to enforce limits on the number of students registered. When a student registers, the program reads the current count for the courses, verifies that the count is not already at the limit, adds one to the count, and stores the count back in the database. Suppose two students register concurrently, with the count at (say) 39. The two program executions may both read the value 39, and both would then write back 40, leading to an incorrect increase of only 1, even though two students successfully registered for the course and the count should be 41. Furthermore, suppose the course registration limit was 40; in the above case both students would be able to register, leading to a violation of the limit of 40 students. • Security problems. Not every user of the database system should be able to access all the data. For example, in a university, payroll personnel need to see only that part of the database that has financial information. They do not need access to information about academic records. But, since application programs are added to the file-processing system in an ad hoc manner, enforcing such security constraints is difficult
Chapter 1 Introduction These difficulties,among others,prompted the development of database sys- ata em ove the e m mt processing systems.In most of this book,we use a university organization as a running e xample of a typical data-pro ing application 1.3 View of Data A database system is a collection of interrelated data and a set of programs that allow users to access and modify these data.A major purpose of a databm system is to rovide users with an abstract view of the hides how the data are stored and maintained. 1.3.1 Data Abstraction For thesystem to be usable,it must retrieve data efficiently.The need for efficiency has led designers to use complex data structures to represent data in the database Since many database-system users are not computer trained,developers hide the complexity from users through several levels of abstraction,to simplify users interactions with the system: 。h sical le vel.The lowest level of abstraction describes how the data are ac- tual tored.The physical level describes complex low-level data structures in detail. .Logical level.The next-higher level of abstraction describes what data are stored in the database,and what relationships exist among those data.The logical level thus describes the entire database in terms of a small number of relatively simple structures.Although implementation of the simple struc tures at the lo gical level may involy con hysical-level structures.the of the log ical level does not ne d to re of this co plexity Thi sical data inde pende s,who keep in the database.us the logical leve View level.The highest level of abstraction describes only part of the entire database.Even though the logical level uses simpler structures,complexity remains because of the variety of information stored in a large database Many users of the database system do not need all this information;instead, they need to access only a part of the database.The view level of abstraction exists to simplify their interaction with the system.The system may provide many views for the same database Figure 1.1 shows the An ana onship among the three levels of abstractio ogy to the concept ing languages may clarify the distinction among levels of abstraction.Many high-level programming
6 Chapter 1 Introduction These difficulties, among others, prompted the development of database systems. In what follows, we shall see the concepts and algorithms that enable database systems to solve the problems with file-processing systems. In most of this book, we use a university organization as a running example of a typical data-processing application. 1.3 View of Data A database system is a collection of interrelated data and a set of programs that allow users to access and modify these data. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data are stored and maintained. 1.3.1 Data Abstraction For the system to be usable, it must retrieve data efficiently. The need for efficiency has led designers to use complex data structures to represent data in the database. Since many database-system users are not computer trained, developers hide the complexity from users through several levels of abstraction, to simplify users’ interactions with the system: • Physical level. The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures in detail. • Logical level. The next-higher level of abstraction describes what data are stored in the database, and what relationships exist among those data. The logical level thus describes the entire database in terms of a small number of relatively simple structures. Although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. This is referred to as physical data independence. Database administrators, who must decide what information to keep in the database, use the logical level of abstraction. • View level. The highest level of abstraction describes only part of the entire database. Even though the logical level uses simpler structures, complexity remains because of the variety of information stored in a large database. Many users of the database system do not need all this information; instead, they need to access only a part of the database. The view level of abstraction exists to simplify their interaction with the system. The system may provide many views for the same database. Figure 1.1 shows the relationship among the three levels of abstraction. An analogy to the concept of data types in programming languages may clarify the distinction among levels of abstraction. Many high-level programming
1.3 View of Data 7 view level view1 view 2 view n logical level Figure 1.1 The three levels of data abstraction. languages support the notion of a structured type.For example,we may describe a record as follows: type instructor=record ID:char(⑤: name:char (20) 0m ary:numeric (82) end; has a name and a type several such record types,including .department,with fields deptname,building,and budget .course,with fields course id,title,dept name,and credits student,with fields ID,name,dept name,and tot cred At the physical level,an instructor,department,or student record can be de- scribed as a block of consecutive storage locations.The compiler hides this level of detail from programmers.Similarly,the database system hides many of the lowest-level storage details from database programmers.Database administra- tors,on the other hand,may be aware of certain details of the physical organiza- tion of the data. and C++usestruct declarations.Java doesnot have
1.3 View of Data 7 view 1 view 2 logical level physical level . view n view level Figure 1.1 The three levels of data abstraction. languages support the notion of a structured type. For example, we may describe a record as follows:1 type instructor = record ID : char (5); name : char (20); dept name : char (20); salary : numeric (8,2); end; This code defines a new record type called instructor with four fields. Each field has a name and a type associated with it. A university organization may have several such record types, including • department, with fields dept name, building, and budget • course, with fields course id, title, dept name, and credits • student, with fields ID, name, dept name, and tot cred At the physical level, an instructor, department, or student record can be described as a block of consecutive storage locations. The compiler hides this level of detail from programmers. Similarly, the database system hides many of the lowest-level storage details from database programmers. Database administrators, on the other hand, may be aware of certain details of the physical organization of the data. 1The actual type declaration depends on the language being used. C and C++ use struct declarations. Java does not have such a declaration, but a simple class can be defined to the same effect
8 Chapter1 Introduction At the logical level,each such record is described by a type definition,as in the previous code segment,and the interrelationship of thes defined as well Programmers using a programming language work at this leve of abstraction.Similarly,database abstraction that hide at the view level computer use of the ssee a set of applicatior programs ta types.At the view are a ine or alof thes views In addition to hiding details of the logical level of the database,the views als provide a security mechanism to prevent users from accessing certain parts of the database For example,clerks in the university registrar office can see only that part of the database that has information about students;they cannot access information about salaries of instructors. 1.3.2 Instances and Schemas Databases change over time as information is inserted and deleted.The collection of info d in the a particula called an ir Th the ase schema moaRpo2adthbaesdhemasanmdntaCan nfrequently,if at all. The concept of e unde erstood by analogy to a program written in a programming language.A database schema corresponds to the variable declarations(along with associated type definitions)in a program. Each variable has a particular value at a given instant.The values of the variables in a program at a point in time correspond to an instance of a database schema. Database systems have several schemas,partitioned according to the levels of abstraction.The physical schema describes the database design at the physical evel,while thschema describes the database designat the logical le a database may also have several schemas at the view lev mes called subschemas,that t describe diffe of the database Of th the e logical is by he mportant,in ter erms of its effect since sical sc appl s by using th logical Ihe phys usually be changed easily without affecting application programs.Application programs are said to exhibit physical data independence if they do not depend on the physical schema,and thus need not be rewritten if the physical schema changes. We study languages for describing schemas after introducing the notion of data models in the next section. 1.3.3 Data Models Underlying tools 心o ing da data aints and away to debe the den of adatabase at
8 Chapter 1 Introduction At the logical level, each such record is described by a type definition, as in the previous code segment, and the interrelationship of these record types is defined as well. Programmers using a programming language work at this level of abstraction. Similarly, database administrators usually work at this level of abstraction. Finally, at the view level, computer users see a set of application programs that hide details of the data types. At the view level, several views of the database are defined, and a database user sees some or all of these views. In addition to hiding details of the logical level of the database, the views also provide a security mechanism to prevent users from accessing certain parts of the database. For example, clerks in the university registrar office can see only that part of the database that has information about students; they cannot access information about salaries of instructors. 1.3.2 Instances and Schemas Databases change over time as information is inserted and deleted. The collection of information stored in the database at a particular moment is called an instance of the database. The overall design of the database is called the database schema. Schemas are changed infrequently, if at all. The concept of database schemas and instances can be understood by analogy to a program written in a programming language. A database schema corresponds to the variable declarations (along with associated type definitions) in a program. Each variable has a particular value at a given instant. The values of the variables in a program at a point in time correspond to an instance of a database schema. Database systems have several schemas, partitioned according to the levels of abstraction. The physical schema describes the database design at the physical level, while the logical schema describes the database design at the logical level. A database may also have several schemas at the view level, sometimes called subschemas, that describe different views of the database. Of these, the logical schema is by far the most important, in terms of its effect on application programs, since programmers construct applications by using the logical schema. The physical schema is hidden beneath the logical schema, and can usually be changed easily without affecting application programs. Application programs are said to exhibit physical data independence if they do not depend on the physical schema, and thus need not be rewritten if the physical schema changes. We study languages for describing schemas after introducing the notion of data models in the next section. 1.3.3 Data Models Underlying the structure of a database is the data model: a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints. A data model provides a way to describe the design of a database at the physical, logical, and view levels